Key Takeaways
- People don't really care about moving to microservices per se. What they really care about is increasing feature velocity. In order to apply many people to a problem, you need to divide them up into teams, because people simply can't communicate effectively within very large groups.
- You can organize your people as independent, cross-functional, and self-sufficient feature teams that own an entire feature from beginning to end. When you do this, you end up breaking up that monolithic process that was the gating factor for feature velocity.
- A microservice system of any complexity cannot be instantiated fully locally, and therefore a hosted development platform must provide developer isolation and developer-driven real-time deployments
- A service (mesh) proxy like Envoy is a good way to implement developer isolation through smart routing, and it can also provide developer-driven deployments using techniques like canary releasing. Projects like Istio and the Ambassador API gateway provide a user-friendly control plane for Envoy.
InfoQ recently sat down with Rafael Schloming, CTO and chief architect at Datawire, and discussed the challenges that face modern software-driven organizations. Although the implementation of microservices is often simply a side effect of the desire to increase velocity through application decomposition and decoupling, there are inherent developer workflow and deployment requirements that must be met. Schloming here elaborates further on this and discusses how Kubernetes and the Envoy service proxy (with control planes like Istio and Ambassador) can meet this need.
InfoQ: A key premise of your recent QCon San Francisco presentation appeared to be that organizations that are moving from a monolithic application to a microservice-based architecture also need to break up their monolithic process. Can you explain a little more about this?
Rafael Schloming: This is actually based on the premise that people don't really care about moving to microservices per se — what they really care about is increasing feature velocity. Microservices simply happen to be a side effect of making the changes necessary to increase feature velocity.
It's pretty typical for organizations as they grow to get to a point where adding more people doesn't increase feature velocity. When this happens, it is often because the structure and/or process the organization uses to produce features have become the bottleneck, rather than the headcount.
When an organization hits this barrier and starts investigating why features seem to be taking much longer than seems reasonable given the resources available, the answer is often that every feature requires the coordination of too many different teams.
This can happen across two different dimensions. Your people can be divided into teams by function: product versus development versus QA versus operations. Your people can also be divided up by component: e.g., front end versus domain model versus search index versus notifications. When a single feature requires coordinating efforts across too many different teams, the gating factor for delivering the feature is how quickly and effectively those different teams can communicate. Organizations structured like these are effectively bottlenecked by a single monolithic process that requires each feature to be understood (at some level) by far too much of the organization.
InfoQ: So how do you fix this?
Schloming: In order to apply many people to a problem, you need to divide them up into teams somehow, because people simply can't communicate effectively in very large groups. When you do this you are making a set of tradeoffs. You are creating regions of high-fidelity communication and coordination within each team, and creating low-fidelity communication and relatively poorer coordination between teams.
To improve feature velocity in an organization, you can organize your people as independent, self-sufficient feature teams that own an entire feature from beginning to end. This will improve feature velocity in two ways. First, since the different functions (product, development, QA, and operations) are scoped to a single feature, you can customize the process to that feature area — e.g., your process doesn't need to prioritize stability for a new feature that nobody is using. Second, since all the components needed for that feature are owned by the same team, the communication and coordination necessary to get a feature out the door can happen much more quickly and effectively.
When you do this, you end up breaking up that monolithic process that was the gating factor for feature velocity, and you create many smaller processes owned by your independent feature teams. The side effect of this is that these independent teams deliver their features as microservices. The fact that this is a side effect is really important to understand. Organizations that look to gain benefit directly from microservices without understanding these principles can end up exacerbating their problems by creating many small component teams and worsening their communication problems.
InfoQ: Could you explain how this relates to the three development phases that, you mentioned, applications progress through: prototyping, production, and mission-critical?
Schloming: Each phase represents a different tradeoff between stability and velocity. This in turn impacts how you optimally go about the different kinds of activities necessary to deliver a feature: product, development, QA, and operations.
In the prototyping phase, there is a lot of emphasis on putting features in front of users quickly, and because there are no existing users, there is relatively little need for stability. In the production stage, you are generally trying to balance stability and velocity. You want to add enough features to grow your user base, but you also need things to be stable enough to keep your existing users happy. In the mission-critical phase, stability is your primary objective.
If the people in your organization are divided along these lines (product, development, QA, and operations), it becomes very difficult to adjust how many resources you apply to each activity for a single feature. This can show up as new features moving really slowing because they follow the same process as mission-critical features or it can show up as mission-critical features breaking too frequently in order to accommodate the faster release of new features.
By organizing your people into independent feature teams, you can enable each team to find the ideal stability versus velocity tradeoff to achieve its objective, without forcing a single global tradeoff for your whole organization.
InfoQ: Another key premise from the talk was that teams building microservices must be cross-functional and able to get self-service access to the deployment mechanisms and the corresponding platform properties like monitoring, logging, etc. Could you expand on this?
Schloming: There are really two different factors here. First, if your team owns an entire feature, then it needs expertise in all the components that go into that feature, from front end to back end and anything between. Second, if your team owns the entire lifecycle of a feature from product to operations, your team needs expertise in all these different engineering-related activities — it can't just be a dev team.
Of course, this can require a lot of expertise, so how do you keep the team small? You need to find a way for your feature teams to leverage the work of other teams in the organization without the communication pathways between teams getting in the critical path of feature development. This is where self-service infrastructure comes into play. By providing a self-service platform, a feature team can benefit from the work that a platform team does without having to file a ticket and wait for a human to act upon it.
InfoQ: What kind of tooling can help with self-service access for deployment, and also to the platform?
Schloming: Kubernetes provides some great primitives for this sort of thing — e.g., you can use namespaces and quotas to allow independent teams to safely coexist within a single cluster. However, one of the bigger challenges here comes with maintaining a productive development workflow as your system increases in complexity. As a developer, your productivity depends heavily on how quickly you can get feedback from running code.
A monolithic application will typically have few enough components that you can wire them all together by hand and run enough of the system locally that you have rapid feedback from running code as you develop. With microservices, you quickly get to the point where this is no longer feasible. This means that your platform, in addition to be able to run all your services in production, also needs to provide a productive development environment for your developers. This really boils down to two problems:
- Developer isolation: With many services under active development, you can't have all your developers share a single dev cluster, or everything is broken all the time. Your platform needs to be able to provision isolated copies of some or all of your system purely for the purpose of development.
- Developer/real-time deployments: Once you have access to an isolated copy of the system, you need a way to get the code from your fingertips running against the rest of the system as quickly as possible. Mechanically, this is a deployment because you are taking source code and running it in on a copy of prod.
This is pretty different though in some other important respects. When you deploy to production there is a big emphasis on strict policies and careful procedures: e.g., passing tests, canary deploys, etc. For these developer deployments, there is a huge productivity win from being able to dispense with the safety and procedure and focus on speed: e.g., running just the one failing test instead of the whole suite, not having to wait for a git commit and webhook, etc.
InfoQ: Could you explain these problems and how to solve them in a little more depth?
Schloming: For developer isolation, there are two basic strategies:
- Copy the whole Kubernetes cluster.
- Use a shared Kubernetes cluster, but copy individual resources (such as Kubernetes services, deployments, etc.) for isolation, and then use request routing to access the desired code.
Almost any system will grow to the point of requiring both strategies.
To implement developer isolation, you need to ensure all your services are capable of multiversion deployments, and you need a layer-7 router, plus a fair amount of glue to wire it all into a safe and productive workflow on top of git. For multiversion deployments, I've seen people use everything from sed to envsubst to fancier tools like Helm, ksonnet, and Forge for templating their manifests. For a layer-7 router, Envoy is a great choice and super easy to use, and is available within projects like Istio and the Ambassador API gateway that add a more user-friendly control plane.
For developer/real-time deployments, there are two basic strategies:
- Run your code in the Kubernetes cluster, and optimize the build/deploy times.
- Compile and run your code locally and then route traffic from the remote Kubernetes cluster to your laptop, and from the code running on your laptop back to the your remote cluster.
Both these strategies can significantly improve developer productivity. Tools like Draft and Forge are both geared towards the first strategy, and there are tools like kube-openvpn and Telepresence for the second.
One thing is for sure, there is still a lot of DIY required to wire together a workable solution.
InfoQ: You mentioned the benefit that service-mesh technology, like Envoy, can provide for interservice communication ("east-west" traffic) in regard to observability and fault tolerance. What about ingress ("north-south" traffic)? Are there benefits to using similar technology here?
Schloming: Yes. In fact, in regards to bang for buck, this is the place I would look to deploy something like Envoy first. By placing Envoy at the edge of your network, you have a powerful tool to measure the quality of service that your users are seeing, and this is a key building block for adding canary releases into your dev workflow, something that is critical for any production or mission-critical services you have.
InfoQ: How do you think the Kubernetes ecosystem will evolve over the next year? Will some of the tools you mention become integrated within this platform?
Schloming: I certainly wouldn't be surprised to see deeper integration between Envoy and Kubernetes. One thing I certainly hope to see is some stabilization. Kubernetes and Envoy are both foundational pieces of technology. Together they provide the core parts of an extremely flexible and powerful platform, but you really need to spend a while becoming an expert in order to leverage them.
I think in regards to the larger ecosystem, we'll see more projects geared at allowing non-experts to leverage some of the benefits these tools can offer.
InfoQ: Is there anything else you would like to share with InfoQ readers?
Schloming: The Datawire team is working on a range of open-source tooling for improving the Kubernetes developer experience, and so we are always keen to get feedback from the community. Readers can contact us through our website, Twitter, or Gitter, and you can often find us speaking at tech conferences.
The video from Schloming’s QCon San Francisco 2017 talk “Microservices: Service-Oriented Development” can be found on InfoQ alongside a summary of the talk.
About the Interviewee
Rafael Schloming is Co-founder and Chief Architect of Datawire. He is a globally recognized expert on messaging and distributed systems and a spec author of the AMQP specification. Previously, Rafael was a principal software engineer at Red Hat.