Netlify's infrastructure team shared their story of how they increased their customer deployment speeds by up to 2x by optimizing their deployment algorithm and increased observability into their systems in the process.
Netlify offers hosting and backend services for web applications and static websites. It generates static files from user provided content and serves them over its content delivery network across multiple cloud vendors. They ran into performance issues for large deployments of the order of 50,000 and more files due to recursive querying in their internal data structures stored in MongoDB. The infrastructure team optimized this and as part of this exercise also improved observability into their systems.
Under the hood, Netlify’s core for deploying user content is built around Merkle trees. A Merkle tree is a data structure where each leaf node is labeled with the hash of some data, and every non-leaf node is labeled with the hash in the labels of its child nodes. The hash of the file content is used as the file name in the tree to detect changes in content in Netlify’s backend. A deployment "calculates those hashes and generates a new tree" based on the changes and the existing content. Internally stored in MongoDB, the tree used to be created upfront when a deployment was initiated by the customer. Subsequently, the tree was scanned and updated as the customer's files were uploaded to track the deployment’s progress. This caused performance issues due to lock contention in MongoDB, especially for large deployments.
To improve performance, the first step was to introduce instrumentation to improve visibility into the process. This revealed that there were operations repeated for each file upload. The team decided to separate the deployment's progress tracking from the core tree. They also stopped creating the entire tree upfront, instead storing the information separately and then creating the final tree in the end.
InfoQ reached out to Ingrid Epure, senior software engineer at Netlify, to understand more.
The team rolled out the new feature into production with some initial hiccups. They had focused on having an extensive list of tests and telemetry to ensure confidence and visibility. However, due to the initial issues faced, they added some additional checkpoints and iterated on them until they were confident enough. Epure says that these checkpoints include using telemetry (metrics collection) to accurately pinpoint the issue, and leaving builds in a final stable state with additional guardrails.
Image courtesy: https://www.netlify.com/blog/2020/05/05/what-netlifys-infrastructure-team-learned-as-it-increased-deploy-speed-by-up-to-2x/ (used with permission)
Epure says that their monitoring is centered around "a combination of structured events, traces and metrics to answer questions about the behaviour of our systems". She elaborates:
The reality of a large scale distributed system is that you will be unable to anticipate all the ways in which it will fail, so the majority of questions engineers will encounter will tend towards unknown unknowns. For these types of scenarios, you need more than metrics-driven traditional monitoring - which relies heavily on predicting how a system may fail, and checks for those pre-defined failures. We use metrics for alarms, historical trends, and getting simple answers for a set of pre-established questions - for example, how many requests a second did service X receive. We also use events and traces to break down information by a number of interest points and drive behavioural conclusions.
Netlify uses two third party services for event aggregations, distributed tracing, monitoring and alerts. In addition to general observability into systems, monitoring also plays a key role in deployment rollouts. Epure explains in more detail:
When issues happen, it is key to be able to understand the cause and mitigate them in a timely manner. In a multidimensional distributed system, which is the case of our customers' builds, we rely heavily on being able to aggregate events across the services that run a customer build - Netlify Server, Buildbot and our API - in such a way that the data tells a story and allows us to answer questions.
She adds that this has improved now with stronger telemetry and the experience of using it, as compared to before.
Netlify's continuous deployment system runs on Kubernetes, and so does each custom build for customers. Each custom build runs on its own pod and is terminated after the build completes.