Uber replaced the legacy architecture built using the WAMP protocol with a new solution that takes advantage of GraphQL subscriptions. The main drivers for creating a new architecture were challenges around reliability, scalability, observability/debugibility, as well as technical debt impeding the team’s ability to maintain the existing solution.
Uber has a large customer base divided into many personas (riders, drivers, eaters, couriers, and merchants). The company supports live (phone, chat) and non-live (inApp messaging) support channels. Historically (between 2019 and early 2023), only 1% of customer interactions (contacts) were served by the live chat channel, while 58% were served via inApp messaging (non-live).
Avijit Singh, staff software engineer at Uber, explains the importance of the live support channel for effectively and cost-efficiently handling customer inquiries:
[Live chat] channel is in the sweet spot for Uber, as it has a good CSAT score (customer satisfaction rating, measured in the range of 1 to 5) while generally reducing CPC (cost per contact). This channel allows for a higher automation rate, higher staffing efficiency (as agents can work on multiple chats at the same time), and high FCR (first contact resolution), which are all beneficial to Uber while providing quality support for customers.
The company wanted to move at least 40% of all customer contacts to live chat support, but the existing architecture suffered from many issues and limitations, including 46% of messages from the customer not being delivered in time, the lack of observability and health monitoring, very low throughput (10 tps) and inability to scale horizontally. Lastly, the WAMP protocol and libraries were deprecated, resulting in poor insights and debugging, but the upgrade path was challenging.
The Legacy Architecture (Source: Uber Engineering Blog)
The team set out to create a new architecture that could initially scale to 40% of the overall contact volume and subsequently to 80% within a year. The architecture was meant to support observability and debuggability from the outset, favor stateless services, and achieve a message-delivery success rate of over 95.5%.
The new architecture consists of the front-end UI used by agents and a few back-end microservices. Chat Channel Manager, Contact Service, and Router are involved in accepting contacts and assigning those to available agents. Agent State Service is responsible for traffic agent’s active contact sessions. Services use Apache Kafka to publish events/notifications, supporting synchronization between routing and agent state tracking.
At the heart of the solution is the GraphQL Subscription Service, which supports bidirectional communication using WebSockets. The UI integrates with the server using the graphql-ws library. The team implemented several reliability-oriented enhancements, including automatic reconnections to resume interrupted chat sessions and heartbeats between the agent’s browser and the chat service.
The New Architecture (Source: Uber Engineering Blog)
Engineers conducted comprehensive load testing before the production rollout and validated that the new solution can support around 10k connections on a single server and support horizontal scalability. The team also invested effort in enhancing observability artifacts to provide comprehensive insights into system performance and SLAs. After rolling out the new architecture, the live chat channel accounts for handling 36% of the overall contact volume, with much more reliable message delivery (error rates reduced from 46% to 0.45%).
Recently, InfoQ reported that The Guardian is also using GraphQL in its newsroom collaboration and asset-sharing tool, Pinboard.