At QCon New York 2017, Andrew Spyker and Amit Joshi presented "A Series of Unfortunate Container Events at Netflix". Key takeaways from running production workloads within containers running on the AWS Cloud at Netflix included: expect problematic containers and workloads; there is continued need for cloud to evolve for containers; container schedulers and the runtime are complex - ops enablement is key for production systems, and users need help adopting containers responsibly; and it has been worth the effort due to value containers unlock.
Spyker, manager of Netflix Container Cloud, and Joshi, senior software engineer at Netflix, began the talk by introducing Titus, Netflix's container management platform. Titus schedules both batch and service jobs, manages resources, and executes containers using Docker/AWS integration and the associated Netflix infrastructure. Titus is currently deployed across multiple AWS accounts and is running within three regions. Over 5000 AWS EC2 instances are used to run Titus (mostly m4.4xlarge and r3.8xlarge), there are over 10,000 containers running concurrently, and over a one week period 1,000,000 containers were launched.
At Netflix there is a single cloud platform for VMs and containers, with Spinnaker providing continuous delivery, and other tooling (much of which is open source) providing telemetry systems, discovery and RPC load balancing, healthchecks, chaos monkey, traffic control (Flow and Kong), and Netflix secure secret management. Titus integrates containers with AWS EC2, including VPC Connectivity (IP per container), Security Groups, EC2 Metadata service, IAM Roles, multi-tenant isolation (cpu, memory, disk quota, network) and live and S3 persisted logs rotation and management.
The next section of the presentation discussed lessons the Titus team had learned from running containers in production over the past year, the first of which was "expect bad actors" (unintentional or otherwise). Spyker discussed issues such as run-away submissions that consumed all resources; the system being perceived as an infinite queue, which caused some teams to submit a large number of jobs that caused the scheduler to run out of memory; the submission of invalid jobs; and failing jobs that repeat.
Accordingly, the Titus engine and API has been improved over the past year, and now includes scheduler capacity groups, better input validation and exception handling, and rate limiting of failing jobs. The team email of the job submitter is now also mandatory in order to facilitate tracking down the owner of bad jobs. The Netflix Titus team have also examined the issue of container escape protection, and have attempted to implement User Namespaces since this was enabled in Docker 1.10 (Feb 2016), but have met with challenges.
The second lesson learned presented was "The Cloud isn't Perfect", and discussed challenges including: cloud rate (API) limiting, which was solved by using exponential backoff with jitter; hosts initialising as bad, or becoming bad, which was addresses by adding host health checks and auto-termination if required; and how to handle entire cluster upgrades, which was solved using partitioned cluster updates.
The third lesson, "Our Code isn't Perfect", discussed how the Titus team discovered issues such as disconnected containers, preventing Titus from killing containers, which was addressed by monitoring for differences between expected and actual state and reconciling aggressively; schedulers failover speed is important, including the need for data sharding and to minimise scheduler initialisation actions; operating containers requires knowledge of the Linux kernel, which was addressed by training and tooling; the need to embrace chaos testing; and providing effective dashboards and alerting is vital.
The final section of the talk presented what has worked well with Titus over the past year. The team had built upon solid software, and praised Docker, ZooKeeper, Mesos and Cassandra. There was a clear focus on delivering core business value, and the decision of what not to build has been just as important as what features should be implemented. Operational enablement is being managed in three phases, from manual red/black deploys of a cluster, a runbook for on-call, and ultimately to automated build pipeline delivery.
Measuring Service Level Objectives (SLOs) is vital, and "if you aren't measuring, you don't know, and if you don't know, you can't improve". The Titus team has also begun onboarding services onto the platform in a controlled manner, using a spreadsheet containing a traffic light system showing what cluster/platform features are alpha, beta and generally available.
Additional information on the talk "A Series of Unfortunate Container Events at Netflix" presented by Andrew Spyker and Amit Joshi can be found on the QCon New York website, and the video recording will be made available on InfoQ over the coming months.