BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Realtime APIs: Mike Amundsen on Designing for Speed and Observability

Realtime APIs: Mike Amundsen on Designing for Speed and Observability

Key Takeaways

  • The “IDC MaturityScape: Digital Transformation Platforms 1.0” report states that 71% of organizations expect to see the volume of API calls increase in the next 2 years. 59% of organizations expect the latency of a typical API request to be under 20 milliseconds, and 93% expect a latency under 50 milliseconds. 
  • According to the same report, 90% of new applications being built will feature a microservice architecture, and 35% of all production applications will be “cloud native.”
  • To meet the increasing demand placed on API-based systems, there are three areas for consideration: architecting for performance, monitoring for performance, and managing for performance.
  • Good architectural practices include avoiding a simple “lift and shift” to the cloud, embracing asynchronous processing, and reengineering data and networks.
  • Understanding a system and the resulting performance, and being able to identify bottlenecks, is critical to meeting performance targets. Services must also be monitored, both from an operational perspective (SLIs), and also a business perspective (KPIs). 

In a recent apidays webinar, Mike Amundsen, trainer and author of the recent O’Reilly book API Traffic Management 101, presented “High Performing APIs: Architecting for Speed at Scale.” Drawing on recent research by IDC, he argued that organizations will have to drive systemic changes to meet the upcoming increased demand for consumption of business services via APIs. This change in requirements relates to both an increased scale of operation and a decreased response time.

The changing nature of customer expectations, combined with new data-driven products and an increase in consumption at the edge (mobile, IoT, etc.), has meant that the need for low latency “real-time APIs” is rapidly becoming the norm.

The recent adoption of cloud technology and microservices-based architecture has enabled innovation and increased speed of development throughout the business world.

However, this technology and architecture style is not always conducive to creating performant applications; in the cloud nearly everything communicates over a virtualized network, and in a service-oriented architecture multiple separate processes are typically invoked to fulfill each business function. Amundsen stated that a series of holistic changes to how systems are designed, operated, and managed is required to meet the new demands. 

The performance imperative

Amundsen began his presentation by describing the “performance imperative” he is seeing throughout the IT industry. Focusing first on the ecosystem transformation, he referenced a 2019 IDC research report “IDC MaturityScape: Digital Transformation Platforms 1.0.” This report states that 75% of organizations will be completely digitally transformed in the next decade, with those companies not embracing a modern way of working not surviving. By 2022, 90% of new applications being built will feature a microservice architecture, and 35% of all production applications will be “cloud native.”

API call volumes are increasing as more organizations embrace digital transformations. According to the same report, 71% of organizations expect to see the volume of API calls increase in the next 2 years. About 60% expect over 250 million API calls per month (~10 million per business day). There is an increasing focus on transaction response time, with 59% of organizations expecting the latency of a typical API request to be under 20 milliseconds, and 93% expecting a latency under 50 milliseconds. 

Amundsen argued that to meet the increasing demand placed on organizations, there are three areas for consideration: architecting for performance, monitoring for performance, and managing for performance.

Architecting for performance

Migrating applications to a cloud vendor’s platform brings many benefits, such as the ability to take advantage of data processing or machine learning-based services to innovate rapidly, or reducing the total cost of ownership (TCO) of an organization’s platform. However, cloud infrastructure can be significantly different than traditional hardware, both in terms of configuration and performance. Amundsen cautioned that performing a “lift and shift” of an existing application is not enough to ensure performance.

The vast majority of infrastructure components and services within a cloud platform are connected over a network—e.g. block stores are often implemented as network-attached storage (NAS). Colocation of components is not guaranteed—e.g. an API gateway might be located in a different physical data center than the virtual machine (VM) a backend application is running on. The global reach of cloud platforms opens opportunities for reaching new customers and also provides more effective disaster recovery options, but this also means that the distance between your customers and your services can increase.

Amundsen suggested that engineering teams redesign components into smaller services, following the best practices associated with the microservices architectural style. These services can be independently released and independently scaled. Embracing asynchronous methods of communicating, such as messaging and events, can reduce wait states. This offers the possibility of an API call rapidly returning an acknowledgment and decreasing latency from an end-user perspective, even if the actual work has been queued at the backend. Engineers will also need to build in transaction reversal and recovery processes to reduce the impact of inevitable failures. Additional thought leaders in this space, such as Bernd Ruecker, encourage engineers to build “smart APIs” and learn more about business process modeling and event sourcing.

For systems to perform as required, data read and write patterns will frequently have to be reengineered. Amundsen suggested judicious use of caching results, which can remove the need to constantly query upstream services. Data may also need to be “staged” appropriately throughout the entire end-to-end request handling process. For example, caching results and data in localized points of presence (PoPs) via content delivery networks (CDNs), caching in an API gateway, and replication of data stores across availability zones (local data centers) and globally. For some high transaction throughput use cases, writes may have to be streamed to meet demand, for example, writing data locally or via a high throughput distributed logging system like Apache Kafka for writing to an external data store at a later point in time.

Engineers may have to “rethink the network,” (respecting the eight fallacies of distributed computing), and design their cloud infrastructure to follow best practices relevant to their cloud vendor and application architecture. Decreasing request and response size may also be required to meet demands. This may be engineered in tandem with the ability to increase the message volume. The industry may see the “return of the RPC,” with strongly-typed communication contracts and high-performance binary protocols. As convenient as JSON is, compared to HTTP, a lot of computation (and time) is used in serialization and deserialization, and the text-based message payloads sent over the wire are typically larger.

Monitoring for performance

Understanding a system and the resulting performance, and being able to identify bottlenecks, are critical to meeting performance targets. Observability and monitoring are critical aspects of any modern application platform. Infrastructure must be monitored, from machines to all aspects of the networking stack. Modern cloud networking may consist of many layers, from the high-level API gateway and service mesh implementations to the cloud software-defined networking (SDN) components to the actual virtualized and physical networking hardware—and these need to be instrumented effectively.

Services must also be monitored, both from an operational perspective and also a business perspective. Many organizations are increasingly adopting the site reliability engineering (SRE) approach of defining and collecting service level indicators (SLIs) for operational metrics, which consist of top-line metrics such as utilization, saturation, and errors (USE), or request rate, errors, and duration (RED). Business metrics are typically related to key performance indicators (KPIs) that are driven by an organization’s objectives and key results (OKRs). 

With infrastructure emitting metrics and services producing SLI- and KPI-based metrics, the corresponding data must be effectively captured, processed, and presented to key stakeholders for this to be acted on. This is often a cultural challenge as much as it is a technical challenge.

Managing for performance

Amundsen stated that the organization should aim to create a “dashboard culture.” Monitoring top-line metrics, user traffic, and the continuous delivery process for “insight” into how a system is performing are vitally important. As stated by Dr. Nicole Forsgren and colleagues in Accelerate, the four key metrics that are correlated with high performing organizations are lead time, deployment frequency, mean time to restore (MTTR), and change failure percentage. Engineers should be able to use internal tooling to effectively solve problems, such as controlling access to functionality, scaling services, and diagnosing errors. 

The addition of security mitigation is often in conflict with performance—for example, adding layers of security verification can slow responses—but this is a delicate balancing act. Engineers should be able to understand systems, their threat models, and whether there are any indicators of current problems—e.g. an increased number of HTTP 503 error codes being generated, or a gateway being overloaded with traffic.

The elastic nature of cloud infrastructure typically allows almost unlimited scaling, albeit with potentially unlimited cost. However, the culture of managing for performance requires the ability to easily understand how current systems are being used. For example,engineers need to know whether scaling specific components is an appropriate response to seeing increased loads at any given point in time. 

Diagnosing errors in a cloud-based microservices system typically requires an effective user experience (or developer experience) in addition to requiring a collection of tooling to collect, process, and display metrics, logs, and traces. Dashboards and tooling should show top-level metrics related to a customer’s experience, and also allow engineers to test hypotheses for issues by drilling-down to see individual service-to-service metrics. Being able to answer ad hoc questions of an observability or monitoring solution is currently an area of active research and product development.

Once the ability to gain insight and solve problems has been obtained, the next stage of managing for performance is the ability to anticipate. This is the capability to automatically recover from failure—building antifragile systems—and the ability to experiment with the system, such as examining fault-tolerance properties via the use of chaos engineering.

Summary

Amundsen concluded the presentation by reminding the audience of the IDC research report; the IT industry should prepare for API call volumes to increase and the requirements for transaction processing time to decrease. Supporting these new requirements will demand organization and system-wide changes. 

Software developers will have to learn about redesigning services and reengineering data, and instrumenting services for both increased visibility into operational and business metrics. Platform teams will have to consider rethinking networks, monitoring infrastructure, and managing traffic. Application operation and SRE teams will need to work across the engineering organization to enable the effective identification and resolution of problems, and also anticipate issues and enable experimentation.

A more detailed exploration into the topics discussed here can be found in Mike Amundsen’s recent O’Reilly book API Traffic Management 101.

About the Author

Daniel Bryant works as a Product Architect at Datawire, and is the News Manager at InfoQ, and Chair for QCon London. His current technical expertise focuses on ‘DevOps’ tooling, cloud/container platforms and microservice implementations. Daniel is a leader within the London Java Community (LJC), contributes to several open source projects, writes for well-known technical websites such as InfoQ, O'Reilly, and DZone, and regularly presents at international conferences such as QCon, JavaOne, and Devoxx.

Rate this Article

Adoption
Style

BT