The inaugural Cloud Native Computing Foundation ServiceMeshCon 2019 took place during November in San Diego, and was part of the KubeCon and CloudNativeCon “Day Zero” program. Key takeaways from the day included: a service mesh manages all service-to-service communication and provides dynamic service discovery, traffic management, and cross/non functional requirements; there are clear benefits obtained by using this new technology, but there are also tradeoffs, such as additional operational complexity; and service mesh technology is rapidly becoming part of the platform “plumbing” - the interesting innovation within this space is happening in relation to the higher-level abstractions and the human-focused control planes.
Lin Sun, senior technical staff member at IBM, kicked off the day by presenting a timeline of service mesh developments. Starting in 2015, the timeline began with the release of the Netflix OSS stack, which included Java/JVM-based microservice communication libraries such as Ribbon, Eureka, and Hystrix. When used in combination, these libraries offered service mesh-like functionality. Other highlights on the timelines included the releases of Istio in May 2017, Linkerd 2.0 in July 2018, Consul Connect and SuperGloo in November 2018, service mesh interface (SMI) in May 2019, and Maesh and Kuma in September 2019.
InfoQ has been tracking the topic that we now call service mesh since late 2013, when Airbnb released SmartStack. This technology offered an out-of-process service discovery mechanism (using HAProxy) for the emerging “microservices” style architecture. In late 2014 Netflix released Prana, a “sidecar” process that allowed application services written in any language to communicate via HTTP to standalone instances of the JVM-based Netflix OSS libraries mentioned previously. In 2016, the NGINX team began talking about “The Fabric Model”, which was very similar to a service mesh, but required the use of their commercial NGINX Plus product for implementation.
Next to the ServiceMeshCon stage was William Morgan, CEO of Buoyant, the company behind the Linkerd service mesh and one of the first organisations to use the term “service mesh”. The primary message from this talk was that service mesh is akin to platform “plumbing”, which although vitally important, is fundamentally a supporting technology and shouldn’t be a focal point in and of itself. Morgan announced the release of a new product from Buoyant, Dive, a SaaS-based “team control plane” for platform teams operating Kubernetes. Dive adds higher-level, human-focused, functionality on top of the Linkerd service mesh, and provides a service catalog, an audit logs of application releases, a global service topology, and more.
Ana-Maria Calin, systems engineer at Paybase, and Risha Mars, software engineer at Buoyant, provided an overview on how to debug a service mesh. Service mesh technologies are new, and as such there are often edge case issues or bugs hidden within releases, especially beta releases. As a service mesh is on the critical path for satisfying all user requests, developing the ability to debug any related issues is critical -- both in terms of confirming that the bug is located within the mesh implementation (as opposed to the application code or another part of the platform), and in providing guidance to the maintainers of a service mesh project.
The core advice included the creation of a test environment that allows reliable replication of the bug, changing service mesh log levels to enable increased generation and collection of relevant information, capturing traffic (for example, via tcpdump and Wireshark), and using sidecar containers with debugging tools to inspect and debug applications and the mesh implementation.
Fuyuan Bie and Zhimeng Shi, software engineers at Pinterest, described how the platform team at Pinterest had created a custom mesh control plan, named Tower. The control plane is written in Go, and manages a data plane consisting of Envoy proxies that run alongside applications and manage all traffic from the edge (ingress) through to all service-to-service communication. They suggested that creating a bespoke service mesh is not appropriate for all (or many) organisations, but the project began before the current service meshes had become established. Pinterest also has a complicated mixture of applications developed using “C++/Java/Python/Node/Go/Elixir” that are deployed onto “thousands of clusters ranging from IaaS to Dockerized services to Kubernetes”.
The Pinterest team had four goals for the implementation of a service mesh: unification of traffic control across platforms and consistency of configuration; safety and velocity, enabling fast and slow release of new functionality; enforcing transport security, for example, implementing TLS across all service-to-service communication; and enabling “responsibility segregation”, allowing developers to configure core communication properties, and the platform and security teams to override global properties where appropriate or required.
In the afternoon sessions, Christian Posta, global field CTO at Solo, presented “The Truth about the Service Mesh Data Plane”. Service mesh technology offers many benefits, but as with the decision to adopt any new technology, the tradeoffs should be carefully analyzed. Referencing the work by the CNCF Unified Data Plane API working group, Posta continued by exploring the spectrum of the service mesh data plane options, from an implementation in code or a library (e.g. the Netflix OSS stack), through to an out of process sidecar proxy (e.g. Envoy), and ultimately to a shared gateway per domain and single centralized gateway. He stated that engineers should learn more about this spectrum, and make an informed choice when implementing a service mesh.
Posta also provided a demonstration of implementing WebAssembly-based plugins for Envoy-based service meshes and gateways, and discussed the release of Solo’s WebAssembly Hub.
Michelle Noorali, senior software engineer at Microsoft, took to the stage to provide an update on the service mesh interface (SMI). The Service Mesh Interface (SMI) specification provides an abstraction layer on top of different service mesh implementations, with the goal of enabling tooling to be built on top of the interface and allowing the swapping of compliant implementations. She demonstrated using traffic splitting, emitting traffic metrics, and configuring traffic access control using SMI alongside Istio, Linkerd, and HashiCorp’s Consul.
When asked during the event, approximately 70% of the self-selecting crowd of ~300 attendees indicated they had experimented with a service mesh. Less than 10 attendees indicated they were using a service mesh in production.
More details on all the ServiceMeshCon sessions can be found on the event website, and the CNCF has uploaded all of the ServiceMeshCon presentation recordings to their YouTube channel.