At the inaugural EnvoyCon that ran in Seattle, USA, the eBay engineering team talked about running Envoy at the "edge" of their network as a replacement for the existing hardware-based load balancers. Key learnings included that testing is vital in order to verify the performance characteristics across a range of real-world scenarios; migration between old and new edge systems must be carefully controlled, both from an organisational and technical perspective; and having a "programmable edge" provides many advantages, but also presents several challenges.
Bala Madhavan and Qiu Yu, both working as a member of technical staff software engineers at eBay, began the talk by discussing that eBay has data centers in the US, but runs points of presence (PoP) globally in order to reduce latency and provide better performance for end users. They use a "software powered infrastructure", with a Kubernetes cluster in each PoP, and a "north-south" gateway running at the edge of each cluster which is responsible for managing all external ingress and egress traffic.
The eBay edge team runs Envoy within multiple containers in each Kubernetes cluster. Readiness and liveness probes ensure containers are running as expected, and that any dynamic recovery process completes successfully. Layer 7 routing and deployment control is managed by bespoke Kubernetes Custom Resource Definitions (CRDs), and Ingress annotations are used for custom feature specifications. They have implemented a custom discovery service that handles the following Envoy management xDS APIs and is responsible for deployment and routing: routing discovery services (RDS), listener discovery service (LDS), endpoint discovery service (EDS), and cluster discovery service (CDS).
Migration from a hardware based load balancer implementation to an Envoy-powered edge solution has been gradual. The initial step within the migration consisted of identifying and implementing any feature gaps between the existing and new Envoy solution, with the eBay team contributing associated code upstream to the open source Envoy project. Additional support was added for integration with a proprietary certificate management system, and several customisations were made to the data plane (which have not yet been contributed upstream). These handle connection prefetching, drop or reset requests to certain routes based on bespoke requirements, and provide a "default upstream cluster" for graceful error handling. Support for BBR, Google's new TCP flow control algorithm has also been added, as have unspecified "SSL and TCP optimisations".
Madhavan and Yu presented response time graphs demonstrating that the new Envoy-powered implementation currently performs better than the previous traditional hardware load balancer. With the addition of a caching layer implemented as an Envoy HTTP filter that supports dynamic object caching -- using external cache stores and Envoy's RingHash load balancer to shard cached data on multiple cache stores -- the time-to-first-byte (TTFB) improved by an order of magnitude.
In regard to observability, a Prometheus cluster is run in each PoP, and a central Grafana is used for visualisation. Alerting is implemented using Alertmanager and external checks, along with static and dynamic thresholds for anomaly detection. The entire edge stack is installed and configured within Kubernetes via a Helm Chart that acts as the single source of truth for deployments.
Summarising the key lessons learned, priority was placed on the need for testing. Hardware and software cross validation is important, and real user monitoring (RUM) is essential in order to ensure that end users encounter no performance degradation. DC and PoP failovers should be tested, and additional chaos experiments also run. The lessons learned for "smooth workload migration" included a recommendation to use cross-functional (team) planning and verification, and to shift traffic between edge solutions in a controlled manner while observing operational metrics closely.
Challenges noted included the scaling of operations when increasing the number of PoPs running the new system, detecting configuration drift across PoPs, and providing tooling support for cluster management. Implementing a "programmable edge" with the API-driven Envoy proxy has provided many advantages, but challenges were also encountered. For example, the "build versus buy" decision for the control plane, integration with existing (legacy) components, how to act on metrics and associated performance data, and how operations teams with hardware expertise should debug and troubleshoot a software-based solution.
The PDF slides for "Running Envoy as an Edge Proxy" can be found on the EnvoyCon Sched page, and the recording of the presentation can be found on the CNCF YouTube channel.