Key Takeaways
- Despite the progress made in realtime systems design, implementing services like chat still requires a lot of work.
- There are few databases that can push data to clients at scale, so realtime applications often have to split responsibilities for pushing updates and persisting changes to separate subsystems.
- Application logic can be vastly simplified if underlying services provide a consistent view of the state. Providing a strongly consistent chat API reduces the amount of integration work for our customers.
- Coordinating two or more systems requires careful design to avoid inconsistencies and race conditions from creeping into the application logic.
- Making clients responsible for coordination can hurt latency and increase development costs for applications. Often, putting that logic on the server-side can alleviate those problems.
Realtime chat has become a common feature of modern applications. These days not only communicators and social networks allow users to talk with each other over the Internet - chat is crucial in healthcare, e-commerce, gaming and many other industries.
Because of this shift, many developers face the task of implementing a chat backend for their applications. As the Platform Lead and then the CTO at Pusher, I went through this journey with the team that built Chatkit - our realtime chat service.
From the interface point of view chat seems simple, but implementing a reliable backend to support it poses many interesting software design challenges. Our problem was even more demanding due to our scaling requirements: thousands of customers and millions of their end-users rely on our products and we expected similar adoption for our chat product.
With over 8,000 developers using our service, we’re happy to shed some light on what makes the system capable of handling high-volume chat applications.
Many other use cases pose similar challenges, so understanding how to build reliable and easy to maintain realtime applications can be a valuable skill for any software engineer.
Realtime is still a challenge
Since WebSockets became standardised by IETF in 2011, building realtime web and mobile application has become much less of a hassle. Over 90% of Web browsers support WebSockets, so fallbacks are not as critical as they were seven years ago. Encryption has also become much cheaper, making it easier to run WebSocket connections over TLS, which significantly improves client connectivity.
Despite great improvements on the client-side, implementing realtime business logic on the backend still requires substantial experience from software engineers. Modern realtime systems often need to be highly distributed, which means developers get challenged with consistency and availability trade-offs. Most of those trade-offs revolve around databases.
Given the abundance of database management systems, finding a solution that supports realtime updates can still surprise software engineers. Relational databases are a great fit for many applications, but they can’t push data to clients - at least not to millions of them. Non-relational options don’t have a strong realtime proposition either. RethinkDB was the closest we’ve gotten so far to a general-purpose realtime storage solution, but it struggled to gain traction and became unmaintained.
Managed realtime storage services do exist, but they often require sacrifices in application design. They also become expensive at scale, as generic realtime synchronisation algorithms and data structures are hard to optimise.
Technically, it’s still possible to build a chat system with a database that doesn’t support pushing realtime updates.
Most popular approaches that don't require a realtime database include polling (client- or server-side) and signalling (clients receive notifications when they need to pull data). Unfortunately, although polling is conceptually simple, it becomes expensive at scale - handling all those requests for data takes plenty of CPU cycles, memory and bandwidth.
Recombine, not reinvent
Because we needed our chat service to be push-based and there were no off-the-shelf databases that could push data to clients, we needed to develop a clever solution for storage.
We considered writing an in-house storage system that could reliably handle data for thousands of our customers, but we quickly realised how epic that task would be. Spending months on just building the foundations of the service didn’t sound appealing to anyone.
The key design decision we made was splitting the problem into two smaller tasks:
- First, we needed to find a system that could distribute messages to end users with realtime latency guarantees.
- Second, we needed to fill the persistence gap, but the storage solution would not be concerned with pushing updates.
For the realtime part, the choice was easy for us - we had been using Redis in production for years with great success. Redis is excellent for low latency updates, but due to its in-memory nature it’s not great for storage.
For the long-term storage, we chose PostgreSQL - another database well understood among our engineers. PostgreSQL is famous for its consistency guarantees and in our case it would fill the data durability gap left by Redis.
The data models we designed for Redis and Postgres yield themselves well to sharding, giving us room to scale the service horizontally in the future.
Once we became confident we could manage both systems in isolation, integrating them into a unified chat backend became our next challenge.
Synchronisation
Combining two distributed systems always creates tricky synchronisation problems.
Because our service would talk to the databases over a network, we had to deal with consequences of the CAP theorem, meaning that during network partitions, our system would have to either drop consistency or become unavailable to end users. We also thought about the PACELC theorem - an extension to CAP - stating that in the absence of network partitions the system has to sacrifice either latency or consistency.
True to our API design philosophy, we decided to optimise for the convenience of our customers.
Reducing consistency guarantees would require much more effort from developers using our chat service, so we picked consistency over both availability (in the event of network partitions) and latency (when the network is fully functional). Even though availability and latency took a hit, we believed our customers would appreciate having to write less integration code and their users wouldn’t notice the difference in real-life scenarios.
To illustrate the trade-off, I will describe how applications obtain messages for chat rooms:
Whenever an end user opens a chat in their application, their chat client needs to fetch historical messages and subscribe for new ones. As implied in the previous section, in our design those two operations reach two independent systems: clients retrieve old messages from PostgreSQL and receive new messages in realtime from Redis. This arrangement makes CAP and PACELC theorems manifest themselves in three ways: data loss, data duplication and incorrect ordering.
The potential for data loss arises from the ephemerality of Redis’ publish-subscribe system, as clients only receive messages published after establishing a subscription. If a client fetched historical data before establishing a Redis subscription, it wouldn’t be able to tell whether it missed any messages.
Subscribing to Redis before fetching old messages solves the first problem, but leaves the issues of data duplication and ordering open.
The problem with ordering is easy to spot - the client performs a history query to Postgres after establishing the Redis subscription, meaning the query will return messages that could be hours or days old after brand new messages already started flowing through the subscription. This requires the client to buffer the realtime messages until the history query completes.
Duplication is much harder to notice, as it requires several operations to happen within a short period of time.
After a client establishes a subscription to Redis, it will send a query for historical messages to Postgres. While that query is retrieving data, another user can post a message to the same chat room. With some luck, Postgres could persist that message before finishing fetching data for the first client’s history query, returning the new message in the query result. Hence, the first client could receive the message twice - once from Redis and once from Postgres.
To handle this scenario, Chatkit uses totally ordered message identifiers. When the history query completes and the list of messages buffered by the Redis subscription gets merged with the query result, the client de-duplicates by checking message identifiers.
There are many implementation details that can make or break the above logic, but written correctly, the algorithm ensures that our chat clients provide a reliable stream of updates.
Moving to the server
Although implementing the above logic in web and mobile clients is technically possible, we decided to push the subscription logic to the server. Some reasons behind that decision would be common for any chat implementation, some are specific to our needs.
First, we had to support Chatkit on three platforms - Web, iOS and Android. Each of those platforms has its quirks, like different network APIs and concurrency primitives. Not only woukd we need to reimplement complex subscription logic three times, we’d also have to make adjustments to each environment, making it more challenging for our engineers to maintain the libraries.
Second, the client-side implementation would be much worse in terms of latency, especially on slow networks. It would take two client-server round trip times (one to the realtime API, one to the storage API) and two on the server-side (from the realtime API to Redis, one from storage API to Postgres). Moving the coordination logic to server-side lets the client retrieve historical data and subscribe to updates within one round trip time. On mobile networks, this can save seconds from the application load time.
Third, by abstracting this logic with a server API, we leave the door open for many optimisations. There are many places where buffers and caches can improve latency and efficiency of both subscriptions and historical queries. Server-side logic is much easier to control and evolve, especially in our case, as we can’t force our customers to upgrade client-side libraries in their applications.
About the Author
Pawel Ledwon is a software engineer with over 10 years of experience building distributed systems for fast growing startups. He approaches engineering leadership and management with a new set of practices and mental models derived from the theory of complex systems. Because of this different approach, Paweł has successfully grown and lead Pusher's platform team over the past 5 years. You can find some of his thoughts on technical leadership on Hackernoon, The Mission and The Startup.