BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Phil Calcado on Lessons Learnt During SoundCloud's Microservice Migration

Phil Calcado on Lessons Learnt During SoundCloud's Microservice Migration

This item in japanese

At QCon London 2015 Phil Calcado shared lessons learnt from SoundCloud’s move from a monolithic to microservices architecture, and stated that the core requirements for building a microservice platform include developing capabilities for rapid provisioning, basic monitoring and rapid application deployment.

Calcado, director of core engineering at SoundCloud, began by sharing details of SoundCloud’s “hypergrowth” period between 2011 to now, where the number of people using the platform per month has grown from 3 million to approximately 300 million. SoundCloud is the largest audio repository online, with 11 hours of audio uploaded to the platform every minute, and users can consume content utilising a variety of devices, both mobile and desktop-based. SoundCloud was initially developed as monolithic Ruby on Rails application, but in 2011 the decision was made to embrace a microservice architectural style.

Over the past year much has been written about microservices, including Martin Fowler’s “MicroservicePrerequisites” article. Calcado stated that when he read this article he began to feel nervous, but was initially unsure why. In 2011 the SoundCloud team had decided to prepare the platform for the implementation of a microservice architecture. They were conscious of a potential “microservice explosion”, where the number of new microservices being developed could become unmanageable, particularly from an operational perspective.

Calcado discussed that the SoundCloud application platform was born “cloud native”, and initially a large part of their workload was run on Amazon’s AWS platform. All audio and transcoding was undertaken in AWS, but the SoundCloud application was run in a series of private data centers located in Amsterdam. Calcado stated that during 2010 and 2011 one of the most popular public application deployment platforms was Heroku, and accordingly during the SoundCloud microservice planning this was used for design inspiration. The SoundCloud team embraced the concepts behind Heroku’s 12factor.net guidelines for building PaaS-based applications, and utilised LXC for a deployment mechanism and Doozer for platform coordination.

Calcado suggested that although the initial SoundCloud architecture was much better than anything else available within their platform at the given time, there were problems, primarily the creation of “noisey neighbours” in their own data centers. This was due to the fact that the SoundCloud utilisation of LXC had no resource limits (i.e. no cgroups) and implemented a naive scheduling policy.

At the same time as SoundCloud were working on their new platform, the open source community was also developing similar solutions, such as Mesos, Docker and Kubernetes. Several of these products were being backed by the likes of Twitter, AirBnB and Google, which meant development of the software was occurring rapidly. SoundCloud were initially tempted to change their platform to utilise these public products, but decided to simplify their current solution (and also let the industry stabilise around container-based solutions) before looking to perform any migration.

In relation to monitoring Calcado proposed that the state of telemetry tools was not good during the period of 2011 to 2012. SoundCloud had initially chosen to leverage StatsD, Graphite, Nagios and Pagerduty. This was a workable solution, but Graphite was the weakest component in the toolchain, as the tool offered a limited query language, was slow and consumed large amounts of disk space.

The SoundCloud team members who had arrived at the company from working with other large-scale systems were also not happy with metrics push model. Ultimately this lead SoundCloud to build their own monitoring system, Prometheus, which has been released as open source. Prometheus utilises a pull model for collecting metrics, and the data is sent to Icinga and Pagerduty for monitoring and alerting.

Calcado stated that it can be difficult to determine what has broken with a microservice architecture, in comparison with a monolithic application. A core lesson learned was the need to standardise monitoring dashboards. Currently in the SoundCloud architecture an application’s dashboard configuration is included as a JSON file within the associated code repository. This allows the team responsible for the service to implement a dashboard that meets the monitoring requirements.

An application’s operational functionality must also be exposed in a standard way, allowing monitoring data to be easily accessed, providing the capability to shut-down an application, or exposing the ability to trip downstream circuit-breakers. Application platforms such as twitter-server provide examples of such functionality exposed through an HTTP interface. Calcado stated that in addition to a standardised approach to operational functionality, production-level incident resolution plans must include escalation policies all the way up to management teams. This encourages management to prioritise work that will prevent production-level on-call activity.

For the final of the three initial topics proposed, Calcado discussed that it is essential to create a reliable build pipeline for continuous delivery of microservices to production. The build and deployment process must also be standardised across services. Although not all of the SoundCloud services run via LXC, the use of containers is advantageous, and allows the spawning of a “mini soundcloud” on a development laptop. This is beneficial for working with dependent services in a microservice platform.

When summarising the talk, Calcado answered the initial question proposed at the beginning of the presentation - why was he so nervous of the details of Martin Fowler’s microservice prerequisites article?

It was the first moment I realised we messed up - there are simple and incremental ways to address the issues of rapid provisioning, basic monitoring and rapid application deployment, without building your own systems

However, Calcado was comforted by a recent conversation with Adrian Cockcroft, who worked as the Cloud Architect at Netflix during their implementation of a service-based architecture, where the SoundCloud team wondered aloud why they couldn’t create a perfect system on the first attempt? Calcado paraphrased Cockcroft’s response:

Uh? Do you think Netflix got it right first time? [...] We broke everything until we found a way to get around the problems. We only published the good things.

Calcado concluded by stating that many good things have resulted from the SoundCloud implementation of a microservice platform. In particular the implementation of the Prometheus monitoring system, which is an essential tool for determining production issues in a large-scale system; the need for a set of guidelines (“not rules”) for microservice application development, such as the 12factor.net application guidelines; and the need to build upon solid foundations for each service, such as that provided by Twitter’s Finagle application platform.

Slides from Phil Calcado’s “No Free Lunch, Indeed: Three Years of Micro-services at SoundCloud” talk can be found on the QCon London 2015 schedule web page.

Rate this Article

Adoption
Style

BT