At QCon San Francisco 2019, Anvita Pandit, senior developer at Google, explained Google’s internal Key Management System (KMS), which supports various Google services. This internal KMS manages the generation, distribution and rotation of cryptographic keys, and also handles other secret data. Moreover, the internal KMS supports various services on the Google Cloud Platform (GCP), including the Cloud KMS, and therefore this system needs to scale.
Implementing encryption at scale requires a highly available key management, which means 99.999% at Google. To achieve this, Google uses several strategies, as presented by Pandit:
- Best practices for change management and staged rollouts
- Minimize dependencies, and aggressively defend against their unavailability
- Isolate by region and client type
- Combine immutable keys and wrapping to achieve scale
- Provide a declarative API for key rotation
At the beginning of the presentation, Pandit explained why Google or any organization should use a KMS service for secrets like third party API keys, OAuth tokens, and cryptographic keys for encryption and decryption. Storing this data in a file or on GitHub is not a valid option.
A KMS offers separate management of key-handling and separation of trust. Pandit stated that the internal KMS service leverages the Google storage infrastructure, such as providing authentication and encryption at rest. Furthermore, the service has two main features:
- Single point of access, controling who can access secrets
- Auditing who touches the keys using binary verification and logging (not of the secrets itself)
The internal KMS relies on what’s called Google’s "root of trust", which Pandit explains thoroughly. It is available in a Google Whitepaper: Encryption at Rest.
Next, Pandit used an example of a massive Gmail outage back in 2014, which was caused by a bug in a cron job responsible for merging KMS configurations. The glitch led to the truncation of the configurations and was subsequently pushed to all live services. The outage disrupted the use of Gmail and other live services, which lasted for up to two hours.
Google learned several lessons from the outage, such as the single point of failure within the process of updating KMS configurations, the startup and runtime dependencies for the service, and runtime dependencies. As a result, Google made changes to its internal KMS service to prevent an outage of this magnitude again by:
- Eliminating all-at-once global rollouts of binaries and configuration
- Regional failure isolation and client isolation
- Minimizing dependencies
The changes led to a KMS service with the following characteristics:
- No downtime since the incident: over six nines of availability
- 99.9% requests are served within 6ms
- Performance of 10s of Millions of Queries per second (QPS)
Finally, Pandit talked about key rotation, which is best practice yet also a challenge. Key rotation helps to prevent keys from being compromised and limits the windows of vulnerability; however, it does provide the risk of data loss. Furthermore, to have a key rotation at scale means meeting some design goals to cater to it:
- KMS users design with rotation in mind by providing choices for the frequency of the rotation, and the ability to specify the time to live (TTL) of the ciphertext.
- Using multiple key versions should be no harder than using a single key by tight integration through Google’s cryptographic libraries.
- Very hard to lose data by versioning the ciphertext.
Pandit has published the slides of her presentation. Additionally, most other presentations at the conference were recorded and will be available on InfoQ over the coming months. Lastly, the next QCon London 2020 is scheduled for March 2 - 6, 2020.