At this year's QCon London conference Pablo Jensen, CTO at Sportradar, a sports data service provider, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels (PDF of slides). Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.
One of the technical practices Sportradar employs is regular failover testing in production (a kind of Chaos Engineering). Their fail fast strategy is tested at an individual service level, as well as at cluster level, and even for an entire datacenter. The latter is possible because, as Jensen stressed, production environments are created and run exactly the same way across all datacenters. From a client point of view, they act as a single point of contact, whereas internally workloads can be allocated (or moved) to any live datacenter. Applications know as little as needed about the infrastructure where they are running, and can be deployed the same way across on-premise and cloud (AWS), but the bulk of the work is done in Sportradar's own datacenters, for cost purposes.
Other common resiliency strategies employed at Sportradar include circuit breakers and request throttling, handled by Netflix's Hystrix tool. Jensen also mentioned decoupling of live databases from data warehousing to avoid potential impact of reporting and data analysis on live customers.
In terms of governance, Sportradar puts strong emphasis on managing dependencies (and their impact, because issues with 3rd party providers is still their #1 source of incidents), for example by classifying external service providers into three categories of accepted risk:
- "single-served" for non-critical services provided by a single vendor;
- "multi-regional" for a single vendor that offers some levels of redundancy (such as AWS availability zones);
- "multi-vendor" for critical services that require strong redundancy, and single vendor dependency is not acceptable.
According to Jensen, expanding infrastructure to Google Cloud Platform has been on the cards in order to further reduce risk (thus moving the cloud infrastructure service from "multi-regional" to "multi-vendor"). Further, accepting that single vendor services might fail means that dependent internal services must be designed and tested to cope with those failures. This focus on risk management manifests also internally, as each business area is served via independent technical stacks hosted independently on redundant services.
With 40+ IT teams allocated to specific business areas, Sportradar also faced the need to setup some governance around the software architecture and lifecycle. Before a new development starts, it must pass a "fit for development" gate, with agreed architecture, security and hosting guidelines in place. Perhaps more importantly, deployment to production must pass a "fit for launch" gate to ensure marketing and client support teams are aware of the changes and ready.
Services are still improved after launch, as IT teams must follow a "30% rule" whereby 30% of their time is allocated to improving current services stability and operability, as well as improving existing procedures (such as on-call or incident procedures). Jensen highlighted the importance of iterating over established procedures, improving them and regularly communicating and clarifying them (not correctly following procedures is still their #4 contributor to incidents).
In terms of organizational structure, IT (product) teams aligned with business areas has worked well, with centralized IT and security teams providing guidance and oversight, rather than executing the work themselves. For example, security development guidelines were defined by the centralized team, and iterated upon for a period of three months as the first product teams to follow them provided feedback on what worked and what not. Only then were the guidelines rolled out to all the product teams.
Finally, each service must have an on-duty team assigned before being launched. On-duty teams provide 2nd level technical support - roughly 0.5% of all 110.000+ client requests per year escalates to this level - throughout the entire service's lifetime. As Jensen stressed, only the best (and higher paid) engineers in the organization work in these teams, promoting a culture of client focus and service ownership. Clients are kept in loop of any open non-trivial incident which is then followed by a postmortem once closed. Jensen added that clients appreciate this level of transparency.
Additional information on the talk can be found on the QCon London website, and the video of the talk will be made available on InfoQ over the coming weeks.