Luke Demi, software engineer at Coinbase, writes about the changes in monitoring and logging that have taken place at Coinbase since mid-2018. Coinbase moved from a self-managed Elasticsearch cluster that served the dual purpose of log analysis and metrics visualization, to Datadog for metrics collection and managed Elasticsearch on AWS for log aggregation.
Coinbase, hosted on AWS, used to self-manage their Elasticsearch cluster that was used across engineering and other functional teams. Custom application metrics were collected by parsing standard output from applications. AWS Cloudwatch does expose OS metrics, but these would have been available in a separate dashboard from the custom metrics.
During the 2017 cryptocurrency boom, Coinbase faced increased traffic which in turn increased the amount of log data their apps generated. Their Elasticsearch cluster was unable to scale to the new load. Load related failures would require a restart of the cluster, and it was vulnerable to expensive queries. It was also impossible to collect diagnostic data unless the cluster was restarted.
The team discovered that the largest applications were responsible for consuming the most Elasticsearch capacity. They were forced to reduce the log retention period to handle the increased amount of data. Demi writes that alerting solutions like ElastAlert and Watcher that work off Elasticsearch "don't allow engineers to interactively build alerts" and were difficult to integrate with third party tools.
The Coinbase engineering team had to solve the problem of a scalable Elasticsearch cluster that could cater to different simultaneous use cases and also to have a metrics collection, monitoring and alerting system. They opted for AWS’s managed Elasticsearch service which was logically sharded by use case using an nginx proxy in front. Coinbase’s previous cluster depended on X-Pack, a commercial offering from Elastic, for authentication and authorization, which is not available with AWS’s solution. The proxy routes requests to the correct Kibana dashboard based on the incoming request, and also performs authentication against Coinbase’s internal single sign-on (SSO) service.
For metrics, the team deployed Datadog for metrics collection and alerting. They added some security barriers like building their own Docker container instead of the provided one, letting projects choose if they want Datadog, and running it on a separate network bridge than the other containers on that host. The team also wrapped the standard Docker socket with a proxy to prevent possible access to environment variables by the agent.
The Coinbase team defined service level indicators (SLIs) around the new metrics, with the relevant time series graphs for each. The new toolset has increased reliability and decreased the maintenance overhead of their logging pipeline, according to Demi. However, it is not clear how the correlation between logs and metrics happen, since they are collected by different third party services.