The New York Times Engineering Team wrote about their approach to scaling and incident management against the backdrop of increased traffic during the November 2018 US midterm elections.
In the run-up to the US midterm elections in November 2018, the team had to prepare for additional traffic across its mobile and desktop platforms, with their digital subscriber base having almost doubled since 2016. They had rearchitected their platform in 2017 - a move which included transitioning from on-premises infrastructure to the cloud. This move decentralized their infrastructure and incident management, with each team being responsible for their own. They still had to manage inter-team dependencies.
The key challenge was to ensure that the platform could serve real-time results and coordinate incident handling across different teams. They tackled this with a multi-step process starting with an architectural and operational review, stress testing, and a special group to oversee and ensure collaboration between the different teams. InfoQ reached out to Prashanth Sanagavarapu, director of engineering, delivery engineering and SRE, and Vinessa Wan, principal program manager, delivery engineering and SRE at The New York Times, to understand more about the process.
An "election leadership team" was first formed which acted as "the intermediary between our SRE, company stakeholders and the engineering teams." Sanagavarapu and Wan explained how the SRE model works at New York Times:
We have a dedicated group of SREs in our Delivery Engineering group. Delivery Engineering aims to improve business velocity for product engineering teams by building tooling and automation, defining processes and partnering with teams through engagements. The SRE team uses the engagement model in working with teams for larger projects and goals and also provide ongoing guidance with teams, either through facilitating Production Readiness Reviews (PRRs) or educating teams on things like incident management.
The architecture reviews revealed the extent to which individual teams were covered with respect to monitoring and recovery from outages. Their monitoring stack is a mix of tools managed by their cloud service providers and self-managed ones like Prometheus and Grafana, integrated with email and Slack for alerting. Sanagavarapu and Wan shared an overview of their monitoring philosophy and setup:
The Delivery Engineering team provides guidance and automation for applications to set baseline/default monitoring with tools provided by the public cloud provider at the time of project creation. Our application development teams do set SLOs & SLIs to monitor service reliability and quality. Teams rely on monitoring and alerting for managing applications. While we are working on a centralized visualization tool that will provide shared views of reliability, teams use some common metrics and others that are specific to their systems.
To test how their systems stood up to high traffic, the team ran tests against their production environment, keeping a constant lookout for real increased traffic so they could stop the tests if required. General approaches to such load testing include shadow traffic generation, which replicates incoming traffic and is usually run against non-production environments. After the New York Times migrated to the cloud, did they leverage autoscaling for their services? Sanagavarapu and Wan explain:
Our scaling needs are very spiky at times like when an article goes viral or breaking news event occurs. We do have auto scaling for our systems but they do have limits. Currently, we tend to scale our systems a bit over provisioned for anticipated traffic. Our teams also rely on common traffic patterns we learned over time and scale our systems beforehand.
Based on the readiness assessment, teams - grouped by tech stack and dependencies - worked out of conference rooms, interacting via video and messaging tools. Inter-team dependencies are "internal API dependencies and external cloud provider services". The former are "defined by an end user experience or business workflows", say Sanagavarapu and Wan. Grouping teams with known dependencies allowed them to test for end-to-end reliability of business workflows and/or end user experiences, and ultimately handle the increased traffic.