Netflix, the popular movie streaming site, deploys a hundred times per day, without the use of Chef or Puppet, without a quality assurance department and without release engineers. To do this, Netflix built an advanced in-house PaaS (Platform as a Service) that allows each team to deploy their own part of the infrastructure whenever they want, however many times they require. During QCon New York 2013, Jeremy Edberg gave a talk about the infrastructure Netflix built to support this rapid pace of iteration on top of Amazon's AWS.
Netflix uses a service-oriented architecture to implement their API, which handles most of the site's requests (2 billion requests per day). Behind the scenes, the API is separated into many services, where each service is managed by a team, allowing teams to work relatively autonomously and decide themselves when and how often they want to deploy new software.
Netflix is heavily invested in DevOps. Developers build, deploy and operate their own server clusters and are accountable when things go wrong. In case of failure, a session is organized where the root cause of the issue is investigated, and ways are discussed to prevent similar issues in the future -- similar to the five whys.
Deployment at Netflix is completely automated. When a service needs to be deployed, the developer first pushes the code to a source code repository. The code push is picked up by Jenkins, which subsequently performs a build producing an application package. Then, a fresh VM image (AMI) is produced based on a base image (containing a Linux distribution) and software that all Netflix servers run, including a JVM and Tomcat, possibly further customized by the team. On top of this base install, the application package is installed. From this, an AMI is produced and registered with the system.
To deploy the VM images to its infrastructure, Netflix built Asgard. Via the Asgard web interface, VM images can be instantiated to create new EC2 clusters. Every cluster consists of at least 3 EC2 instances for redundancy, spread over multiple availability zones. When deploying a new version, the cluster running the previous version is kept running while the new version is instantiated. When the new version is booted and has registered itself with the Netflix services registry called Eureka, the load balancer flips a switch directing all traffic to the new cluster. The new cluster is monitored carefully and kept running overnight. If everything runs OK, the old cluster is destroyed. If something goes wrong, the load balancer is switched back to the old cluster.
Failure happens continuously in the Netflix infrastructure. Software needs to be able to deal with failing hardware, failing network connectivity and many other types of failure. Even if failure doesn't occur naturally, it is induced forcefully using The Simian Army. The Simian Army consists of a number of (software) "monkeys" that randomly introduce failure. For instance, the Chaos Monkey randomly brings servers down and the Latency Monkey randomly introduces latency in the network. Ensuring that failure happens constantly makes it impossible for the team to ignore the problem and creates a culture that has failure resilience as a top priority.
Many parts of the Netflix infrastructure are open source already and available on Github. It is Netflix' goal to eventually release all of its infrastructure for other companies to benefit from.