Cloud applications promise high availability and accessibility to their users, but for that to be achieved a disaster recovery plan is essential. The team behind InfluxDB shared at KubeCon EU their lessons learned from battle testing their disaster recovery strategy on the day when they deleted production.
High availability, redundancy and business continuity are all terms that should be familiar to any person working in software development. The digital transformation acceleration has meant that most businesses have moved to a 24/7 schedule. Even at points where we are inclined to believe that the products we built are indestructible, we all know that the words of Rick Spencer, VP of Product at InfluxDB, "A production level disaster will hit sooner or later; make sure you are prepared for it" are as true as they get.
InfluxDB provides a platform for building time series (time-stamped data) applications, meaning that for many software applications this provides the data backbone. During the presentation, Spencer and Wojciech Kocjan, a platform engineer at InfluxDB, presented the sequence of events from the day they deleted production and the lessons learned from it.
Landscape:
The company is running a Kubernetes-based, partially stateful, application. They have a storage engine with persistent volume claims (PVCs) configured to retain underlying volume and object store for persistence. Most microservices are stateless or use managed databases. Following the GitOps way, everything is stored in the git repository, separating code and config repositories. ArgoCD is used for deploying all instances of InfluxDB Cloud.
How the Damage was Done
As all of the ArgoCD applications were defined in a single massive configuration file, a difficult to spot name collision was accidentally introduced. This meant that once targeted for deployment, ArgoCD will react by deleting and reinstalling the cluster. The faulty name targeted the "central-eu-cluster".
10:22-10:37 UTC: Following the merging of the faulty PR, the cluster was deleted, triggering the monitoring systems to report API failures and customer escalations. In response the support team started the incident response process.
10:42-10:56 UTC: Following the normal procedure that implies that reverting the PR would allow the restore of normal operation, the team realized that it would be more complicated given the statefulness of the services and redeploys would generate new volumes. A recovery plan process was started and the status page was updated to properly reflect the issue.
The Rebuild
11:17 - 15:26 UTC: Once the "red button" that stops the deployment on the line was pressed, deployment lists were created and double-checked. All services dependent on existing volumes were carefully deployed with state preserved. This initially occurred manually, and was automated once the process gained trust. The stateless services were deployed by CD in parallel with data integrity being verified along the way. Where possible, services were restored from backup instead of redeployment.
Expecting a surge in traffic once the service was restored, ingress points were proactively scaled up.
Return to Service
15:26 - 16:04 UTC: The write service was enabled, and the system was tested and functionality validated. Tasks were enabled allowing them to fail, in order to avoid overloading the cluster. Once the tasks were completed, the query service was re-enabled.
In the Aftermath:
16:04 UTC forward: The customers were notified about returning to normal operation mode and about their tasks failure, helping to restore whenever possible.
The root cause analysis (RCA) was written and published.
In the modern world of distributed systems, the ever-expanding digital infrastructure is powered by data -- this age's new oil -- so its importance is obvious for any business. As InfluxDB is a data storage, the importance of a continuous operation for its customers is vitally important. Looking back on the event, the company concluded that their mechanisms were not good enough; even though they had downtime, they didn’t lose any data.
After this incident, the team took steps into ensuring that these kinds of changes are detected early: a basic test tool renders the YAML files to detect duplicate resources. And in order to minimize the impact of any fault, the file structure was changed so each file contains only one object. Also ArgoCD was reconfigured to not allow changing the objects from a different namespace.
More than the preemptive measures taken, the company prepared even more thoroughly for the eventuality of an outage still happening: they improved the process of handling public-facing incidents. Runbooks for the uncovered cases were written and exercised.
You can assume that sooner or later something will happen that will provoke a production outage, but it is up to you to be as prepared as possible to make its impact as minimal as possible and that each incident makes your team stronger.