BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Observability and Avoiding Alert Overload from Microservices at the Financial Times

Observability and Avoiding Alert Overload from Microservices at the Financial Times

Key Takeaways

  • Any microservice-based application is a distributed systems, and accordingly, services do not run independently. If something fails, it can often lead to cascade failures, which complicates monitoring and alerting.
  • In order to adapt to the challenges of monitoring a microservices-based application, Wells suggested a three-pronged approach: build a system that can be supported; concentrate on "stuff that matters"; and cultivate alerts and the information they contain.
  • Log aggregation is required within any distributed system due to the volume of services and potential latency introduced via communication over a network, which means that logs may go missing or get increasingly delayed.
  • When implementing monitoring within a microservice-based system, traditional tooling like Nagios is often limited, as it does not provide a 'service-level' view, and the default (infrastructure) checks include things that cannot be fixed.
  • A core goal of monitoring and alerting is to know about problems before clients do, and accordingly the practice of running 'synthetic requests' that mimic user functionality behaviour is vital.
  • Alerts must continually be cultivated, and if an alert is received that doesn't make sense, or does not require human interaction, it must be corrected or removed.
  • Creating alerts should be part of the normal development workflow: "code, test, alerts". In order to ensure that the development team know if an alert stops working, tests should be added to validate the alert.
  • Proactivity is required when maintaining and dealing with alerts in a non-trivial system, and out of date information can be worse than none at all. 
  • A microservices architecture lets you move fast, but there is an associated operational cost, particularly around monitoring and observability. Make sure it's a cost you're willing to pay.

At QCon London, Sarah Wells presented "Avoiding Alerts Overload from Microservices", and cautioned that developers and operators must fundamentally change the way they think about monitoring when building a distributed microservice-based system.

Wells, a Principal Engineer at Financial Times, began the talk by stating that knowing when there is a problem is not enough; an alert must only be triggered when an action by a human is required. A microservices architecture may allow the development team to move fast, but there is an operational cost, and the number (and complexity) of alerts generated by a microservice-based system can be overwhelming.

"A microservices architecture lets you move fast, but there is an associated operational cost. Make sure it's a cost you're willing to pay."

The Financial Times FT.com website is powered by a microservice backend, primarily utilising the Java and Go programming languages, packaged and deployed with Docker and CoreOS onto the Amazon Web Services (AWS) platform. FT stores data within mongoDB, elastic, neo4j and Apache Kafka.

There are 99 functional services, with 350 running instances at any given time, and 52 nonfunctional services, with 218 running instances. Wells stated that if each of the 568 service instances were checked every minute, this would result in 817,920 checks per day.

Running containers on shared Virtual Machines (VMs) requires 92,160 system-level checks, for a total of 910,080 checks per day. In addition, any microservice-based application is a distributed systems, and accordingly, services do not run independently.

If something fails, it can often lead to cascade failures, which further complicates monitoring and alerting.


Wells stated that a microservice-based application makes the challenges of monitoring worse.

In order to adapt to the challenges of monitoring a microservices-based application, Wells suggested a three-pronged approach: build a system that can be supported; concentrate on "stuff that matters"; and cultivate alerts and the information they contain.

In order to build a system that can be supported, log aggregation and monitoring are essential. Log aggregation is required due to the volume of services and potential latency introduced via communication over a network, which means that logs may go missing or get increasingly delayed. This in turn means that log-based alerts may miss issues, particularly time sensitive issues. Effective log aggregation requires a method to find all related logs, and accordingly the FT team use transaction id for correlation.

When implementing monitoring, traditional tooling like Nagios is often limited, as it does not provide a 'service-level' view, and the default (infrastructure) checks include things that cannot be fixed. In a microservices-based system, monitoring should be at the service and VM level. Monitoring needs to be aggregated and made visual, and the FT technical team utilise a custom framework named SAWS (built by Silvano Dossan) and Dashing. There is also extensive use of graphing via Graphite and Grafana.


FT.com Technical Team’s SAWS Aggregated Monitoring

When developing polyglot services, logging and monitoring integration must be made easy for any language that is used. The expectations, or operational contract, must be specified, and each service owner is responsible for implementing functionality to meet this requirement. For example, the FT healthcheck standard requires that every service expose a healthcheck endpoint over HTTP, 'http://service/__health', which returns a 200 if the service can run the healthcheck, and a JSON document containing multiple checks that can contain additional information but must return '"ok":true' or '"ok": false'.


FT.com microservices alert dashboard, which is powered by the dashing.io framework

A core goal of monitoring and alerting is to know about problems before clients do, and accordingly the practice of running 'synthetic requests' that mimic user functionality behaviour is vital. If functionality relating to a key user journey is broken, for example, an FT editor cannot publish a new article, then this must be fixed immediately. Wells stated that engineers must learn to prioritise and "concentrate on the stuff that matters". The FT technical team have also created dashboards showing core client statistics, such as number of errors, and response latency, but Wells stressed that it is "the end-to-end [business functionality] that matters" and “if you just want information, create a dashboard or report”.

Alerts must continually be cultivated, and if an alert is received that doesn't make sense, or does not require human interaction, it must be corrected or removed. If an issue occurs, and there was no alert, then one should be added as part of the fix. Key information must be included within each alert, for example, an overview of the business impact, the associated run book location, and corresponding transaction ids that triggered the issue.


Example FT.com alert with information including the issue, the impact, transaction IDs and a link to the associated run book

The FT team use dedicated 'Ops Cops' (on-call members of the development team, rotated regularly) to watch for issues with monitoring, and have integrated alerting within the team's Slack messaging system. A pre-defined list of emojis (with clear stated purpose for each) are used to indicate when and how an issue is being managed and resolved.

Concluding the talk, Wells suggested that creating alerts should be part of the normal development workflow "code, test, alerts". In order to ensure that the development team know if an alert stops working, tests should be added to validate the alert. The FT technical team subscribe to the philosophy of chaos testing, and inspired by Netflix's Simian Army and Chaos Monkey, they have created a 'Chaos Snail' (which is "smaller than a monkey, and written in Bash shell"!). Wells cautioned that proactivity is required when maintaining and dealing with alerts in a non-trivial system, and out of date information can be worse than none at all. Automate updates wherever possible, and find ways to share what is changing.

The slides for Sarah Wells’ QCon London talk, "Avoiding Alerts Overload From Microservices" can be found on Speaker Deck, and the video can be found on InfoQ.

About the Author

Daniel Bryant is leading change within organisations and technology. His current work includes enabling agility within organisations by introducing better requirement gathering and planning techniques, focusing on the relevance of architecture within agile development, and facilitating continuous integration/delivery. Daniel’s current technical expertise focuses on ‘DevOps’ tooling, cloud/container platforms and microservice implementations. He is also a leader within the London Java Community (LJC), contributes to several open source projects, writes for well-known technical websites such as InfoQ, DZone and Voxxed, and regularly presents at international conferences such as QCon, JavaOne and Devoxx.

 

Rate this Article

Adoption
Style

Educational Content

BT