At Kubecon in Austin, TX, attended by over 4000 engineers, Craig McLuckie delivered a keynote on the Kubernetes journey.
InfoQ caught up with Craig McLuckie, one of the original founders of the Kubernetes project and CEO of Heptio.
InfoQ: In your keynote you talked about three things that matter in technology companies and how Kubernetes is making a big difference. First, let's talk about developer productivity, and then move on to how recoverability is better than resiliency. How does Kubernetes help with these challenges?
Craig McLuckie: Kubernetes really helps with this in two ways.
First, it takes the operator out of the equation for a lot of common situations. Using traditional technologies, a business might discover that something is awry, and an on-call engineer gets paged. Often as not, they have to restart a process to get things going again. It is a human process and happens at human time. Kubernetes automates this and introduces control loops that manage the health of the components under management. There are some things that machines just do better than people and this is one of them. This isn't to say that Kubernetes displaces the role of the operator; we tend to think of it as delivering ‘operator power tools' that make it easier and take a lot of the toil out of running apps.
The second big consideration is that Kubernetes is an intrinsically distributed system built on very robust bones. It, by its design, removes single points of failure from the system. With a properly configured cluster (configured for high availability, and appropriate infrastructure) you could walk around a data center and unplug nodes, and Kubernetes would adjust the workload to running nodes and push everything back into a good state. This astonishes a lot of users when they first see it in action. I remember one calling it a ‘shotgun proof system'. It is surprisingly durable.
I also want to point out that companies should think about what availability really means - in simplest terms it is the percentage of the time your app is up. There are two things that drive that - how long does it take to go down? And when it does go down how long does it take to come back up?
What people often miss is that time to get the application running again dominates the overall availability equation. You can put a lot of time into making sure that something doesn't ever go down (resiliency), but end up missing the point altogether if in the rare situation it does go down it takes forever to come back up (i.e. not thinking as much about recoverability).
The other thing that people miss is that there are very different types of outages. For some classes of application, being ‘down' one second every thousand seconds would not have a measurable impact on the business. Users may not even notice if 0.1% of the time they were forced to reload their app and think nothing of it. Some businesses may really struggle to absorb a full day's outage during peak season. There are very different types of outages.
InfoQ: You stated that Kubernetes enables multi-cloud, and that is happening in the real world. However, isn't it the case that enterprises prefer to standardize on a single vendor? Hybrid cloud seems to be limited to on-premises and a cloud vendor and not between different cloud vendors -- is this an accurate statement or not?
McLuckie: This may be true for some users, but isn't what I am seeing with bigger companies. Most businesses of scale (enterprises) are really worried about being in a single vendor relationship with a cloud provider. I can't remember a single CIO level conversation with finance, retail, manufacturing, health care, etc. in the last six months where folks haven't mentioned Google and Microsoft at some point. Compare that with when I was working at Google a couple years ago when there were a lot of folks who were Amazon only shops.
Companies want to be in a second cloud provider relationship. For them the ideal state is the cloud provider offers ‘utility computing' as boring as the electric supply (110 volts at 50hz) and they want to get it at the best price they can. They are really worried about lock-in and the implications that has on their negotiating power.
Now this doesn't mean that individual applications are going to run in multiple clouds. There are some cases where that is true (e.g. internet scale applications, or apps that run across different geographies), but for many folks the question is where they build their next application. The thing that is key for them is not having to retrain their development staff to build an application in another environment. Kubernetes is in the Goldilocks zone: not so low level that you get stuck in the specifics of a given environment, but not so high level (as with most PaaS solutions) that you can't run pretty much anything you want to. We designed it that way from the start.
To be clear, this isn't because Amazon is doing a bad job, very far from it, but Google and Microsoft have both come along in leaps and bounds in the last year or so and are very strong in the market.
InfoQ: In the keynote you claimed that the "Enterprise" is complicated. We may already know this, but is the claim that Kubernetes is the silver bullet here?
McLuckie: No, it really isn't. Just like standardizing the OS to run on commodity machines as the world moved from mainframes to client-server architectures wasn't a silver bullet in managing enterprise complexity. Enterprises are always going to have to deal with unique operating requirements and conditions, and it is going to take a very long time to move all applications to run in Kubernetes if ever.
If you spend enough time working on an architecture that touches core businesses today, and trace the dependency chain far enough along, you will often as not find a mainframe. In fact, as best I understand it mainframe sales peaked this decade. It is going to be a very long time before people remove VMs from their data-centers.
Kubernetes does, however, help and has the potential to be as disruptive to application development as the transition from mainframes to client-server was back in the day. We will see a good number of existing applications move to Kubernetes quickly, but it will likely take some time for a lot of traditional apps to move. When things are running there is a tremendous amount of inertia to overcome with organizations. Not just technically, but culturally. And change is slow.
InfoQ: Let's now change gears and talk about your recent startup journey. Your company Heptio talks about a Kubernetes undistribution. Can you elaborate on what this exactly means and how it might help enterprises?
McLuckie: Over the past couple of years, we have seen a pretty marked change in the relationship between enterprises and open source communities that power them.
First, they see open source as being a great way to mitigate the threat of lock-in. Kubernetes creates a consistent environment to run apps and abstracts them from the infrastructure they run on. Having said that, as long as the cloud provider maintains coherence with upstream, there are few reasons why an enterprise would not want to use a hosted version of Kubernetes (Google Kubernetes Engine, Azure Kubernetes Service, and Amazon Kubernetes Service). These will all emerge as great choices for a lot of situations. There are, however, many situations where they can't use one of those offerings. Those don't run on-premises, or they may have unique requirements that aren't naturally met by those platforms.
The second thing that is changing is enterprises relationship with the open source communities themselves. Savvy enterprises realize that by working to get something accepted by the upstream community positions them better since they don't have to maintain that themselves. It is natural for them to find a partner who can help close out gaps in the ecosystem with upstream friendly solutions that also work in cloud provider based environments.
Heptio Kubernetes Subscription delivers a lot of the positive attributes of a traditional distribution. A single accountable vendor to deliver well defined reference architectures, a stable installation framework, 24x7x365 support etc. But this comes with several benefits over a traditional distribution:
- We are committed to staying fastidiously ‘upstream', meaning you never have to worry where Kubernetes ends and the distribution starts from your app's perspective. This means you have a high degree of flexibility around where you run your app (on HKS, or on a cloud provider managed service).
- We are committed to relentlessly improving our support model through tools like Sonobuoy and more progressive mechanisms to qualify and maintain your clusters. We accept that most enterprise environments are ‘snowflakes' and recognize that at the end of the day observed consistency is more important than controlling the process around building the cluster. We will go into more detail on this a little later.
- We are committed to closing the gaps we see in bringing new classes of workloads to Kubernetes in an upstream friendly fashion. You can already see this with the projects we have sponsored, all of which have a customer story behind them.
InfoQ: Ark and Sonobouy solve some of the common concerns for enterprises around managing Kubernetes clusters. Can you provide some technical details about these products, and also share some of the roadmaps of these and other products complementing the Kubernetes project?
McLuckie: Ark came about because one of our earliest customers was struggling to deal with backup/recovery for Kubernetes workloads that were stateful. The advice they had been given around copying the underlying state store (etcd) wasn't working for them. As we started working on the solution, it became clear that the value of Ark goes well beyond backup recovery. It is a really robust way to migrate workloads from environment to environment. This is a challenge in every vertical we work with. Customers value being able to copy production environments for experimentation, and in some cases need to be able to move workloads from unmanaged Kubernetes solutions (either on-premises or on the cloud) to the managed solutions cloud providers create.
We will continue to invest in both driving the overall availability of workloads by refining the feature of Ark so that it plays well with more demanding workloads, and introducing capabilities that enhance choice - making it an effective migration tool. Giving users the flexibility to not only move workloads from one environment to another, but to ultimately maintain copies of workloads in other environments so that if, for example, there was a major cloud provider outage they could quickly get up and running on a different cloud provider.
Sonobuoy started as a tool to manage down the complexity of support. It was clear from the very beginning that ‘configuration drift' happens in production environments and we found ourselves doing a lot of the same work each time we were on a support call. Additionally, this work often took the form of a painful "game of telephone" with lots of high latency back and forth. We asked ourselves what was the best way to tell if a cluster was ‘good' and figured out that the upstream Kubernetes conformance tests would be a good place to start. This helps our users ensure that their cluster looks like the clusters that a given release was qualified on. The tool has subsequently become the underlying framework for vendor certification through the Cloud Native Computing Foundation's Kubernetes certification program.
Looking ahead, we want to go beyond just running conformance tests and help users get ahead of security and optimization problems. A great example of this is the recent Tesla security issue where cryptominers found their way into a production cluster. We have created extensions to Sonobuoy that will be available to our HKS subscribers that capture our perspective on what a good cluster should look like. They provide helpful insight into how to optimize clusters for given workloads and availability needs. The Tesla example is exactly the kind of thing we hope to help users avoid. It isn't enough to just deliver binaries (as with a distribution). We aim to deliver expertise and opinions captured in code. But we have to do it in a way that is true to our values around avoiding proprietary runtime components; we can't take away our customer's control of their environment.
Beyond Ark and Sonobuoy, we have started working on Contour, a modern way to handle load balancing for Kubernetes clusters, and ksonnet - a simpler way to create Kubernetes configurations for real world use. More about those in the future.
InfoQ: I am going to ask the exact question I asked Brendan Burns. A lot of talks at Kubecon were aimed at making development boring ("it should just work"). Where are we as a community today? If you can let your fantasy run a little bit wild, how will this look about five years from now compared to where we are today?
McLuckie: It is a great question, and yes, development is still too interesting. I hope that within the next five years we won't be talking about Kubernetes any more than we talk about the Linux Kernel. It really needs to fade into the background. If we do our job right, I think a few things will be true.
Most open source and ISV (software vendor)'s installation instructions will start with ‘pick a certified Kubernetes cluster' of your choice. Step 2 will be ‘run this kubectl command'. Kubernetes will unlock the ability for third party software to run anywhere, and with work will make it easy for those vendors to provide alternatives to cloud provider managed services. There will be a lot of situations where you may choose to use a cloud service, but you should be able to get a similar experience from someone who is not a cloud provider on infrastructure you control.
I believe that for the developer workflow, we will move away from closed PaaS offerings to a place where companies can assemble PaaS like capabilities from best-of-breed parts. Some of those may be domain specific and apply to a specific industry. Enterprises will be able to quickly assemble a soup-to-nuts solution that provides a simple path from code to production with strong guardrails, and the ability to ‘break glass' when needed and run custom capabilities.
If we do our jobs right, we will also see a shift from ticket driven infrastructure management where human operators perform a lot of functions, to a world of API driven management where a lot of the common things enterprises need are delivered by internal teams that specialize in that function. We will start to see the SRE (site reliability engineering) discipline emerge as enterprises sponsor specialized teams to deliver services to the broader organization and create deep operations specialization.
The entire video and details of the keynote session are on the conference website.