The Envoy Proxy has taken the cloud native computing platform world by storm. Practically every large cloud vendor has integrated Envoy into their stack, and many end user organisations are leveraging this proxy within their platform, either at the edge or within a service mesh.
In this podcast, we sit down with Matt Klein, software plumber at Lyft and creator of Envoy, and discuss topics including the continued evolution of the popular proxy, the strength of the open source Envoy community, and the value of creating and implementing standards throughout the technology stack. We also explore the larger topic of cloud natives platforms, and discuss the tradeoffs between using a simple and opinionated platform against something that is bespoke and more configurable, but also more complex. Related to this, Matt shares his thoughts on when and how to make the decision within an organisation to embrace technology like container orchestration and service meshes.
Finally, we explore the creation of the new Envoy Mobile project. The goal of this project is to expand the capabilities provided by Envoy all the way out to mobile devices powered by Android and iOS. For example, most current user-focused traffic shifting that is conducted at the edge is implemented with coarse-grained approaches via by BGP and DNS, and using something like Envoy within mobile app networking stacks should allow finer-grained control.
Key Takeaways
- The Envoy Proxy community has grown from strength-to-strength over the last year, from the inaugural EnvoyCon that ran alongside KubeCon NA 2018, to the increasing number of code contributions from engineers working across the industry
- Attempting to create a community-driven “universal proxy data plane” with clearly defined APIs, like Envoy’s XDS API, has allowed vendors to collaborate on a shared abstraction while still allowing room for “differentiated success” to be built on top of this standard
- Google’s gRPC framework is adopting the Envoy XDS APIs, as this will allow both Envoy and gRPC instances to be operated via a single control plane, for example, Google Cloud Platform’s Traffic Director service.
- There is a tendency within the software development industry to fetishise architectures that are designed and implemented by the unicorn tech companies, but not every organisation operates at this scale.
- However, there has also been industry pushback against the complexity that modern platform components like container orchestration and service meshes can introduce to a technology stack. Using a platform within these components provides the best return on investment when an organisation’s software architecture and development teams have reached a certain size.
- Function-as-a-Service (Faas)-type platforms will most likely be how engineers will interact with software in the future. Business-focused developers often do not want to interact with the platform plumbing
- Envoy Mobile is building on prior art, and aims to expand the capabilities provided by Envoy all the way out to mobile devices using Android and iOS. Most current end user traffic shifting is implemented with coarse-grained approaches via BGP and DNS, and using something like Envoy instead will allow finer-grained control.
- Using Envoy Mobile in combination with Protocol Buffers 3, which supports annotations on APIs, can facilitate working with APIs offline, configuring caching, and handling poor networking conditions. One of the motivations for this work is that small increases in application response times can lead to better business outcomes.
Subscribe on:
Show Notes
What has been happening in the last year? -
- 04:10 So much happens in this space - on the Lyft side we've been focussing internally on various efforts including the IPO.
- 04:25 My focus internally has mostly been on reliability, which has been interesting.
- 04:35 Part of that effort is the Envoy Mobile effort
There was EnvoyCon, which was a great community effort, and also has a diverse range of speakers and topics. -
- 05:05 When we thought about putting together the conference, putting together the CFP, I was concerned that no-one was going to submit any proposals.
- 05:15 It would have been embarrassing if we had a conference and no proposals!
- 05:35 We had 70 or 80 proposals for our first year single-track conference.
- 05:40 Most of them were really fantastic proposals. We had a room for 250-300 people and it sold out almost immediately.
- 05:50 We then moved to a bigger room and sold that out.
- 05:55 I was blown away by the conference - so much fun.
Has the decision to not create a company based around Envoy been freeing? -
- 06:20 I think it's less about being free, and more about being able to focus on technical purity - we've been able to do an incredible job of making technology-first decisions.
- 06:50 We don't have to worry about changes hurting someone's business or having an impact from an income perspective.
- 07:00 We've been able to do an increasingly good job of having a lot of people collaborate in the project that are competitors.
- 07:10 We see cloud providers doing code review for each other, or people who are starting companies based on Envoy collaborating with other.
- 07:30 We've been able to do that because we've built the project in a way that allows others to have success on top, though extensibility or APIs.
- 07:45 That has allowed people to come and collaborate on the data plane, which is becoming a commodity thing, and also compete on other areas.
- 07:55 That creates a very collaborative community, and there's some truth that because I don't have a company myself, it's easier for me to act as an independent human router instead of commercialising those interactions.
We had a great keynote at QCon SF on cultivating the environment for success. -
- 08:50 I don't want to claim that I'm making all of the technical decisions, but I do think there's something to be said for setting the tone of how people collaborate or communicate.
- 09:15 I've been told many times over the last few years that there have been many people sworn off continuing to work with certain open-source communities, and they come to work with Envoy and they are excited to collaborate.
- 09:35 That's one of the things I'm most proud of - you create this situation where you could not pay all of the people working on Envoy to work at a single company - it would be impossible.
- 09:50 There might be geographical or financial reasons, and yet all these people are working together on the same thing.
- 10:15 There are times when I look at the team, and it's the best team I've ever worked with from a raw talent perspective.
- 10:25 They work together so well, and they work across all these different companies.
Can you break down your one of your recent tweet stream about complexity within software development? -
- 11:05 There are things that I like about twitter; it gives me a pulse of what's going on in the larger community.
- 11:20 One of the themes that I've seen is pushback against complexity, asking whether Kubernetes or Service Mesh is too complicated.
- 11:35 What I was trying to say in the thread is that it is too complicated, for many people.
- 11:50 We aren't building technology for technology's sake - there is some vertical business that be ride-sharing, search, shopping.
- 12:10 If you focus on doing only what that business needs for that particular vertical it can be relatively simple.
- 12:25 Because we're in this hype-driven environment, driven by Twitter and conferences, people see some of the technical architectures at these large scale companies.
- 12:45 In the cloud native space these days, there is a growing community that needs to sell software to people to justify the company's existence.
- 13:00 There are people that need this software - but we're in this tumultuous period where we have a lot of signal and noise from big company, thought leading or vendor marketing.
- 13:15 There are cases in which we have companies and organisations that might be inclined to adopt technical solutions whose complexity is not required for the business at this time.
- 13:30 What I was describing to people is: start with what you need, and if you're lucky to grow to a size where you need microservices architecture and are having scaling problems, you might need to use Kubernetes or a "service mesh" to solve networking problems.
- 13:55 That's where I think there is some confusion - there's pushback about using service mesh.
- 14:00 The pushback is misplaced; it's not that service mesh is not needed, but that they are needed only at a particular scale.
- 14:10 Organisations need to be honest when they're at the scale where they need to invest in microservices, service mesh, container orchestration or Kubernetes.
Have you got any advice for a tech lead about when to start looking around for technology like Kubernetes or service mesh? -
- 14:50 The answer I'm going to give you is entirely subjective, not data-driven.
- 15:00 For me, what I've seen, is that it starts to come down to how many engineers are working on the same product.
- 15:15 What I've seen many times is that simplistic solutions (monolith, PaaS or single database) is going to work well up to a certain size of development.
- 15:40 Typically as companies find success and their physical scale grows, the number of features and developers typically grow with it.
- 15:55 There are exceptions; when Instagram sold they had 12 people in their ops team.
- 16:00 Typically, as scale grows, the number of people grow.
- 16:10 When the number of people grow, then for social reasons, a microservice architecture evolves.
- 16:15 That tends to be when the problems occur; around orchestration, packaging, versioning, networking and observability - that's when the problems are going to come.
- 16:25 For me, that number is subjective, but at 80-100 people, and as a company is predicting to grow the engineering team to double the size in the next couple of years, that's when to start preparing.
- 16:45 If you want to be ahead of the curve, the time to start looking at some of the technologies is then, because it's better than being behind the curve.
- 16:55 Many companies have been in that painful place.
- 17:00 There's a sweet spot, looking at current head count, number of services, plans around adopting microservices architectures, what the personnel growth looks like.
- 17:15 When I'm seeing companies in the 100 developer range, and are starting to look at microservices, that is when I would start looking at some of these things.
What's your thoughts on standardisation versus evolution driven by market dynamics? -
- 18:00 In isolation, standards are good - we can have common tooling, we can write to a particular set of interfaces or abstractions that can work across providers.
- 18:15 In reality, it typically takes a lot of time to converge on a particular standard - perhaps not at the lower layers like networking, which tend to be simpler and are driven by proof of concepts and IETF standards like QUIC.
- 18:45 When it comes to a standard like Kubernetes or containers or service mesh or SMI - these are harder.
- 18:55 The reason it's harder, is that there's typically a lot more competition in the market space for adding features.
- 19:05 What we see with the ingress specification in Kubernetes is that a specification ends up providing the lowest common denominator.
- 19:15 You see deployments with the nginx ingress controller with a lot of nginx annotations, and if you're using the whatever controller there are a lot of whatever annotations.
- 19:25 My fear from the SMI perspective is that it will end up in a similar situation.
- 19:30 I'm not against it - I don't think it can hurt, so we might as well try.
- 19:40 If you wind up in a situation where everyone is adding a lot of custom annotations, then you're effectively locked in to whatever thing they have chosen.
- 19:55 Maybe over time some of those things go back into the standard, but I'm less convinced that it will end up being effective.
- 20:00 At the same time, I don't think it can hurt, so no harm in trying.
Would involving more end users in the creation of the specifications be useful? -
- 20:45 I don't know that you're going to find end users that are going to spend their time developing a spec for things that don't exist.
- 20:50 You're in a catch-22 situation: sometimes you need vendors to come together and create this kind of spec.
- 21:05 I think it's a worthwhile experiment, but I don't know what proportion of people the specification is going to be useful for.
- 21:25 If I compare it to the Envoy XDS APIs, which are effectively end-user driven (some of whom are now vendors as well), we are trying to standardise some of them.
- 21:55 There's no easy answer - there are APIs like XDS which are evolving in production and being used in real-world use-cases.
- 22:05 Something like SMI is a little more theoretical - I can see its purpose and hopefully it will be successful, but it's unclear.
What does the working group for the Universal Data Plane SIG look like - is it CNCF led? -
- 22:30 We're doing it under the CNCF umbrella, but what happened is that we developed the APIs called XDS for our Envoy proxy.
- 22:40 gRPC is adopting those APIs which is pretty cool - gRPC is a protocol buffer library, and it includes lookaside load balancing and a central control plane.
- 23:00 Google is interested in having both the gRPC library and Envoy be able to talk to the same control plane so they can be managed by the same system.
Is that related to Google Cloud Platform's Traffic Ddirector service? -
- 23:25 Yes; they want to be able to use Traffic Director to be able to direct traffic over both gRPC and Envoy.
- 23:40 We realised that there were portions of the Envoy API that were Envoy specific; we weren't thinking about other load balancers.
- 23:50 We decided that if we wanted to be serious about doing the API we need to look holistically at the universal portion and the Envoy specific portion.
- 24:05 We had our first kick-off meeting for the group, and I was happy to have the HAProxy folks come (they aren't committed at this point).
- 24:20 We are committed to working this out; if people beyond Envoy and gRPC want to figure out how to adopt the data plane API with XDS, we would like that to happen.
- 24:40 If you have a standard proposal, like SMI, and you want to get people being able to bring in their own solutions to a data plane access layer, this is the way to do that.
- 24:50 If we can standardise XDS, even just across HAProxy, Envoy, and gRPC - you'll start to see a ecosystem of vendors that can write to a particular input/output API and get metrics from it.
- 25:15 At that point it doesn't matter whether it's Envoy or HAProxy, but you can have tools on analysis, DDoS protection and so on.
- 25:30 We have to be honest with ourselves; there's going to be a standard API and things that are extra.
- 25:40 For instance, if one load balancer supports traffic tapping, but one doesn't, and a vendor wants to do a traffic tapping product, how do we negotiate the thing over the standard API supports that capability.
- 25:55 It's a fascinating topic and something that we need to explore, but there isn't a grand plan and we're not going to write a 400 page spec - we're going to stay productive and pragmatic.
Yeah, I guess that if you disrupt a revenue stream for a proxy vendor, that's going to be a hard sell. -
- 26:25 Yes - and we've seen that with some of the other proxies.
- 26:30 Envoy has disrupted the software proxy scene, and I'm proud of that.
- 26:35 I'm also excited about the competition, just because it makes everything better.
- 26:45 I was excited to see the HAProxy 2.0 blog post [https://www.haproxy.com/blog/haproxy-2-0-and-beyond/] - they are getting into more of the cloud native aspects.
- 26:50 Competition improves the whole industry, and it pushes us into some of these common APIs.
Would it be possible to use Envoy to combine and integrate FaaS platforms, or create an open platform? -
- 27:40 I am actually a big believer that functional platforms are the future of how people are going to interact with software.
- 27:50 On Twitter, I said that business application developers typically want to do a few things: read and write from a database, call APIs, and deal with events and streaming data.
- 28:15 There's a huge mess which we expose to everyone today about load balancing and service discovery.
- 28:25 I think functional paradigm makes it better.
- 28:35 The problem is that to create a hosted functional platform like a Heroku or a Fargate or Azure functions is a very opinionated platform.
- 28:50 You have to give them the code and write functions in a certain way and call certain APIs, and that's just the way that it works.
- 29:00 In the industry, we're seeing this struggle that the more knobs and flexibility you give people, the more complex it is.
- 29:10 The more opinionated it is, and the less things you can call, the less complex it is.
- 29:20 When you ask "will there be an open functional platform" that can be hosted locally, maybe with Envoy, that will happen.
- 29:35 At the same time, if you're the big three cloud providers, something like Fargate is the future - not providing VMs on EC2.
- 29:55 The vast majority of people who want to write distributed systems just want to write some code, and the more opinionated you are it will help people out of the box.
- 30:10 The issue we see today with cloud provider solutions aren't going to run Google.com
- 30:35 In a ten year timeframe, those systems are going to be able to run a Google.com or a large real-time platform, because that's where the development is going.
- 30:50 If I can run a real-time system like Lyft on a real time functional platform without worrying about containers, and I can just provide code into a CI/CD pipeline with a managed substrate, that's the holy grail.
With Envoy mobile, am I going to be carrying Envoy around with me on my cell phone? -
- 31:30 You will if you're using Lyft - though I can't speak for anyone else.
- 31:35 Reaction so far has been very good.
- 31:45 There is quite a bit of precedent on using cross-platform networking code on mobile.
- 31:50 Google has a cross-platform networking library called Cronet, which is a C++ library using the same code as Chrome in the server and browser side.
- 32:00 They package a bunch of the networking code for both Android and iOS.
- 32:05 All of Google's networking code uses this library, and third party companies like Snapchat also use it.
- 32:15 The reason that people use it on the server side is that they want consistent networking across platforms.
- 32:30 Why would you want to write a different stack to use QUIC on the mobile when you can use a common core, regardless of what platform or version you are on.
- 32:40 You also want to use modern TLS standards like 1.3; you don't want to wait for everyone to get on a phone that natively supports it.
- 32:50 We feel there's a strong precedent - Facebook also does this (but they haven't open-sourced it).
- 33:05 We think that Envoy mobile can do what those libraries do - consistent networking and retry policy - and we also feel it can do a lot of other Envoy stuff like analytics, tracing and metrics.
- 33:30 Having the XDS or management APIs being able to expand to mobile means that you can do real-time traffic shaping, load shedding and fault tolerance that doesn't rely on DNS.
- 33:45 If you look at state of the art mobile traffic shaping for high scale mobile applications, it's typically related to BGP or DNS - that's what it comes down to.
- 34:00 There are other mechanisms, where the app might be aware of other regions, but typically it boils down to BGP or DNS or traffic shifting.
- 34:10 XDS would allow us to have a finer granularity to allow us to apply policy to mobile phones, and users, to send them to different data centres for performance or load shedding reasons.
- 34:30 You could imagine applying authentication, security, switching routes on the fly - there's a lot of opportunity there.
What about the IoT space? -
- 34:45 I took a 3-4 month break at Lyft and worked on scooters - I shipped Lyft for scooters.
- 34:55 I did the networking on the firmware for the scooter board.
- 35:05 The IoT space is very interesting - a lot of people who are used to phones and big computers have an impression that Linux runs everywhere and has megabytes of RAM.
- 35:25 There's a category of IoT that is running Linux and is capable enough of running Envoy mobile, but there are many IoT devices that are running very small and limited real-time operating systems which aren't big enough to run Envoy.
- 35:50 I think in the IoT/Linux space, it is conceivable that there is interest for the same reasons.
- 36:05 If they're running Linux on an SoC then they are probably doing something more complicated that might benefit for an Envoy like solution.
- 36:10 If they're running on a little real-time Cortex M4 with 128k of RAM is probably doing something simple.
- 36:25 Right now we're targeting Android and iOS - talking to other users, there are a lot of common problems that people are having.
- 36:35 There's the policy, the common networking, and the third thing we can do is to move to protobuf APIs.
- 37:00 You can attach annotations onto those APIs, and you could imagine how you could start to provide off-line APIs or caching, unlike the HTTP caching today.
- 37:20 Let's say that I wanted to do caching of a streaming API - for example, at Lyft we might have a streaming API that is providing pricing data (I'm making this up).
- 37:30 Let's say I open a streaming connection to Lyft to get streamed pricing data, and every so often the server sends updated data.
- 37:40 If the user backgrounds the app and comes back within 30 seconds, I have to start back up again and request the pricing data again.
- 37:50 What if I could apply a caching policy that allows me to get the last streams chunk if it's within a minute, while I get a new stream again and update the data.
- 38:10 These are caching policies that aren't available in the HTTP caching specification, but imagine if we can do it with annotations on streaming APIs.
- 38:20 You could imagine annotations on APIs where the data is required for the app to work, and additional data which improves the experience but isn't required for correct operation.
- 38:35 If I'm in poor networking situations, maybe I can automatically drop the non-required data.
- 38:50 If you have a networking stack that can run end-to-end, and understand networking annotations and status, you can improve the responsiveness of the client.
- 39:00 We see time and time again that improving responsiveness of the application leads to better business outcomes.
- 39:10 Users are notoriously fickle; if things don't load, they will typically leave.
- 39:20 This is where technologies like QUIC come from, trying to have the application perform better in core networking, but we thing we can do this at the API level as well.
- 39:30 Caching, handling poor networking condition, deferred connections etc, can be implemented at the Envoy layer that you don't have to do in both iOS and Android individually.
- 39:50 Obviously there are concerns about battery life, code size, but the early data that we've seen we think we can make it work.
How can people get involved? -
- 40:15 This is entirely in the open: https://github.com/lyft/envoy-mobile - there's a page at https://envoy-mobile.github.io as well.
- 40:25 It's early days; we're targeting an alpha for Lyft employees in the next 8-10 weeks, so we're trying to move this along aggressively.
- 40:40 We open-sourced this because we think there's a need for this in the industry, and we're being transparent where we are at with the project.
- 40:45 If people are interested, we'd love to have you come and collaborate and look at issues or design, but get involved if you want.