Anna Shipman revealed to the QCon London attendees how DevOps drives UK's Government Digital Service (GDS). GDS aims to lead the digital transformation of UK's government, "mak[ing] digital services and information simpler, clearer and faster". Its most well known site is GOV.UK, which provides government information and services.
At GDS, developers have a lot of autonomy. They are responsible for their application's whole lifecycle. Developers deploy to production from their own laptops and support their own code, including on-call rotations. They also make their own tech choices. Given the government context GDS lives in, all this autonomy prompted a lot of questions from the audience. Shipman explained there was a shared understanding early on that DevOps was the best way to follow GDS' mission. For instance, developers deploy into production from their own laptops for reasons of efficiency and the ability to quickly rollout changes into production. Shipman gave the example of the Heartbleed bug that was solved in a couple of hours after its announcement.
Developers are on-call a weekly rotation basis. GDS has rigorous rules on what events should trigger PagerDuty alerts. Every incident that might trigger PagerDuty must have an entry in GDS' operations manual clearly explaining the mitigating steps to resolve that incident. Only really serious events, such as server crashes, trigger PagerDuty. GDS does not have 24x7 application support requirements so it is possible to switch to static pages when needed. Shipman considers that the operations manual is the most important tool the teams have to support their activities. It is this manual that allows GDS to share the responsibilities through all team members. GDS has a hard rule: whenever a flaw is found in the manual, the person who finds it must first solve the incident at hand and afterwards updates the manual.
Serious incidents trigger blameless post-mortems. The persons involved in the incident first write a report that is widely shared across the organization. Shortly thereafter, all the relevant stakeholders gather for a post-mortem meeting to work out how to prevent or mitigate that incident in the future.
Scheduling deploys to production follows a simple procedure. Teams register their intention in the release plan, which is divided into 30 minutes slots. When it comes the time to deploy, the team must ensure it has the badger toy, which acts as an exclusive lock for deployment.
Each team has the freedom to make their own technological choices. Those choices are discussed with the architects and the infrastructure team. Shipman, an architect at GDS, told the audience that she felt that her role was more about listening to the teams and helping them with their choices than about decreeing governance rules. Shipman cited the case of a team who decided that PostgreSQL was a better fit than MySQL for their scenario. They joined forces with the infrastructure team and performed that migration over several weeks.
GDS uses a lot of open source tools. Among others, they use Jenkins as a CI server, Puppet for IT Automation, syslog and logstash for logging, Cucumber for acceptance testing and Icinga for monitoring. GDS also develops most of their tools and applications in the open. AlphaGov hosts all their tools that are open source, but not supported in any way. GDS Operations hosts tools that have a higher level of commitment, such as vCloud Tools.
The slides for Anna Shipman’s talk “Delivering GOV.UK: DevOps for the nation” can be found on the QCon London 2015 website schedule page.