Transcript
Hello, everyone, I'm Bing, I'm an engineer on the infrastructure team at Slack. Today, I will talk about scaling Slack. Before the talk, I want to do a quick poll. Raise your hand if you have used Slack before. Wow. I saw maybe 95 percent of you have used Slack. That's fantastic. Oops. Oh, okay. Now it works.
So, for those who haven't used Slack, I will spend 30 seconds to introduce what Slack is. It is a messaging app, the conversation is organized by channels, you can send and receive messages, get notifications, send files, search for content, and use the tools and services that you already use inside of Slack. Today, there are 1,000 apps available inside Slack. Here are some examples of them.
Our Mission
Slack is where work happens. Our mission is to make people's working lives simpler, more pleasant, and more productive. We launched Slack about four years ago. At first, we only had small teams with a few dozen users; today, we have hundreds of thousands of users. The design decision we made back then made sense, sort of the limitations. I will talk about the challenges and problems we run into, how we solved them, and where we are going next.
Here are some numbers about Slack's scale. We have over 6 million daily active users, over 9 million weekly active users, the time that people stay connected is over two hours on a typical working day. Over half of our daily active users are outside of the U.S.
Cartoon Architecture
This is a simplified version of Slack's architecture. Slack uses a typical stack, in the center, there is web application. It implements all the business logic of Slack. It is backed by MySQL, and a job queue system that exports asynchronous jobs.
And another important component of the stack is messaging server. It sends real time events to users. It uses web socket protocol, which is a photo plex communication protocol over HTTP, it is bi-directional, and both the server and the client can push data to the other end. Our client, the web app, it pushes the web socket to the messaging server.
The Challenge
And the problem we had was users experienced slowness collecting to Slack. The problem first showed up in 2015. To understand why that's a challenge, we will take a look at the log in flow. Imagine that you are arriving at your office, you had your first coffee, you open your laptop and launch Slack. What happens then is your client sends the HTTP post to the server, the server validates the token and sends back a snap shot of your team. Here is a screenshot of the response. With the top-level objects, there's the team, the list of channels, the list of private channels, your IMs, your group IMs, users, and bots.
On the right, this shows the response size with different number of users and channels of the team. You don't need to read too much about the exact number, but the takeaway is that the response size can become fairly big. And, it grows with the size of the team. I will come back to this in a moment. In this response, there's also a web socket URL, which the client uses to establish the web socket connection with the messaging server, and then the real time events flows and the client is ready to use. You can navigate channels, send messages, and start a conversation with your co-workers.
On that web socket connection, there are over 100 types of events, essentially anytime that is happening in real time inside of Slack is sending to the events. Examples of such events include text messages, type indicator, files upload, files comment, user profile change, user presence change, reactions, pinned stars, app instillation, etc. And, in sort, client first downloads a snap shot of the team. And then it establishes a web socket connection and real time events flows in on that connection. The client is eventually consisting of a snap shot of your team. This design made a lot of sense in the early days. The Slack objects, users, channels, messages, are stored locally on the client side. This enabled us to support extremely rich user experiences, and it allowed us to move fast and experiment beyond new features and designs. It worked great for small teams. However, it is not so great for big teams.
Problems
There are a few problems. The first- the team payload is too big, it takes too long for the serve to assemble them, to transmit over the internet, and for the client to process them. It contributes to your large client memory foot print; the client needs to process the payload and keep them up-to-date. And it is expensive to reconnect to Slack. Imagine you had a break in the middle of the day, when you come back 15 minutes later, the client has to sync up the events that happened during that time. There can be a reconnect storm, if there's a networking blip, the users come at the same time, over loading the servers, causing cascading failure and it takes time to fix.
Improvements
We were aware of this problem and did incremental improvements. First, we reduced the initial team payload from three different aspects. The client stores the time stamp of the objects, so the server only needs to send that if they were updated since last time. We removed certain objects, like IMs that are not currently open. The idea is to establish a web socket connection as fast as possible, and then that data can be loaded in parallel while the web socket connection is made.
Third, we simplified certain objects. For example, we simplified the user image URLs into a hash code. On the 10,000 user team, this change alone saved us a few megabytes of data. We also changed how clients bootstrap. Instead of loading the whole word, the client can choose to load one channel first. This is especially useful on mobile.
Imagine this use case: you receive a push notification on your phone, you tap it, all you want to do is to be able to reply to this message as fast as possible. So, instead of loading the whole word, mobile client loads this one channel first, with the current message, a short message history, and users that are mentioning the history. This was first implemented in mobile, and then web client adopted it. The web client also does a full load in parallel, this reduced user-perceived latency by more than 30 percent.
A few other things that we did: we implemented a rate limit. When servers are overloaded, rate limits kick in so we only allow a percentage of traffic to go through. We set up POPs around, points of presence, to determine SSL connections closer to the users. We implemented load testing framework. We created the big teams and simulated different traffic patterns in a controlled way so we can find the bottlenecks proactively.
Support New Product Feature
We are not only dealing with the scale of teams. We are also dealing with new product features. For example, in late-2015, we launched group IMs. This caused users to be in more channels, and caused the initial payload to be bigger. We also launched reactions, this allows messages to be bigger, and the reactions are synced to the client. The client is more likely to dump local storage, because it is easier to upload everything from scratch. Despite all of this, we managed to keep the situation under control with incremental improvements. However, we know that's not enough. We would still run into a problem if even bigger teams sign up with us. We were at risk of outages, when many clients dump their cache all at once.
Client Lazy Loading
It is time for architectural change. The idea is simple: client lazy loading. Download less data up front, and download more on demand. This means a re-architecture of the client because clients can no longer make an assumption that all the data is accessible locally. Sometimes the data is fetched remotely from remote servers; when it is fetched remotely, there's run-through timing introduced, but still, we want the user experience to be seamless, regardless of where the data is fetched. So, we decided that we want a new service. This service answers client queries on demand. It is backed by cache and on the agile network, so it provides fast data access. We call this service Flannel. Why the name Flannel, you may ask? Any guesses? As we know, it is difficult to name in computer science. When we were making the name, the engineer happened to wear a flannel shirt.
Flannel: Edge Cache Service
In short, Flannel is an engine backed up by cache on edge locations. We examined the objects in the initial payload and moved the biggest out, so users, and then channel membership and then channels. We tweaked the log-in process. Now users establish a web socket connection to Flannel first. If the user is the first user on his or her team, then Flannel does not have the team data in cache. It loads the data from DB, and then it establishes the web socket connection on behalf of the user. If the user is not the first user on the team, then Flannel has a cache, it escapes step two, and it goes to an established web socket connection directly for the user.
Flannel is essentially a man in the middle. It sits on the web socket connection between client and the messaging server. On the web socket connection, earlier I mentioned that there are over 100 different types of an event. Flannel passes through all of them to the client. It passes a handful of them to update his cache. Such events are user's and channel's related. So an example is channel creation, user creation, a user changes the profile, a user changes or leaves the channel.
Flannel is deployed to seven different locations. The red dot is the main region, the U.S.-east. It has the four slack stacks: Flannel, DB, and messaging server. Flannel is deployed to six other remote regions, and only Flannel is in the remote region. We use AWS and Google Cloud for the remote regions. Here are some UI examples that is powered by Flannel. Quick switcher is a way to navigate to channels and users, when you type, it gives a list of channels and the user suggestions. When you select, you can navigate to it directly. If you have not used it, I recommend it. It is control K on Mac and Windows.
Mention suggestions. In the mention box, when you start to type@ something, it gives you a list of suggestions, and it tells whether the user is in the current channel or not. And channel headbar, it tells you currently if you are in the general channel, and with a bunch of basic information about the channel, member count, and the channel topic. The channel sidebar, it tells you more about the channel. The team directory tells you everyone that is inside the team and allows you to search it. In short, anything that is related to users, channel, channel membership, and autocomplete, is powered by Flannel.
Flannel Results
Flannel was launched early this year. There was a testing team with 200,000 users; we were not able to load this team without Flannel, and now we are able to load it. Flannel serves over 5M simultaneous connections at peak and serves over a million queries per second.
Here is a side by side comparison of the team directory loading with and without flannel. On the left hand side is where you loaded, and on the right hand side is still spinning. It takes more seconds.
Web Client Iterations
Okay, that is not all of it. The evolution of Flannel. The design of Flannel, client lazy loading, requires the client to load the data asynchronously on demand. However, before Flannel, the web client was implemented in a way with the assumption that data is always available locally. So, right before it is about to render user POP, it meets the local storage. It is impossible for us to change every word in the client code to support the lazy loading all at once. So what Flannel did was to implement a just-in-time annotation.
Let me explain it in an example. Suppose you send a message. You mention Bob in your message, and Bob is not in the Slack client's cache, so the client has no way to render the message at all. Flannel sees on the web socket connection, it sees a chat message, it uses a heuristic to know that the client does not have it. So it generates a synthetic user change event about Bob. The client still uses old code, it sees the user change event, it usurps with Bob, and then the client is able to load the message without problem. In short, the just in time annotation can meet our needs, without changing 1,000 places in the client code. Today, our client evolved. It is able to load data asynchronously everywhere. So, we no longer need just in time annotation, but it used to be instrumental for us to roll out Flannel while the web client made incremental changes to support it.
Old Way of Cache Updates
Let's take a closer look. Why does Flannel sit on the web socket connection? It needs to receive real time events to update his cache. However, there's a down side. Imagine this use case. Today, you are at QCon, you only check Slack after talks occasionally. You want your co-workers to know that's the case, so that they can adjust their assumption about your responsiveness in your client messages. So, you change your Slack's status to be in a conference. This status change is broadcast to all your co-workers. If you are on a 1,000 user team, the messaging server sends out 1,000 of such events. Flannel receives 1,000 copies of the same event, this is very inefficient.
Publish/Subscribe (Pub/Sub) to Update Cache
What we did is we supported, or we implemented publish/subscribe, pub/sub to enable the list of users and channels with the messaging server, and at GitHub days for those. This allows the messaging server for pub/sub, I will not go into detail, but we implemented it. It provides benefits. It saves the flannel CPU, we moved from JSON to thrift event, and the schema is easier to manage. It gives flexibility for cache management.
Flexibility for Cache Management
Previously, the Flannel would load the team data when the first user connects, and unload when the last user disconnects. It has to do it this way because that's the only way to keep the cache up to date. When the last user disconnects, the web socket connection is torn down, and Flannel cache will go stale pretty quickly.
Today, with pub/sub, we isolate what events Flannel receives, who is connected to Flannel, so it can keep it connected longer. When the next user connects, flannel will have a warm cache to help with the user connect time. Pre-worn cache is currently under development.
Next Step
So let's take a look at another closer look. With pub/sub, the flannel needs to be on the web socket path? No, we can move Flannel out of the path. There is no need to couple a search query engine with real time events paths. And once Flannel is moved out of the web socket path, it provides better isolation in our architecture. And this is the next thing that we're going to work on.
Evolution with Product Requirements
So far, I have been talking about the evolution of Flannel because of performance requirements. There's also product requirements. A big one is enterprise grid. Grid is Slack's solution for big enterprise. Inside grid, you can create unlimited number of federated teams, by departments, office locations, or anything you define. Each grid team provides you the same experience that many of you are familiar with. Across teams, you can create shared channels. Across a few teams or the entire grid. At the grid level, it provides management, privacy, and security control. When Flannel was first implemented, we didn't have Grid in our mind.
Before Grid
One important design of Flannel is team affinity. This means that users from the same team go to the same Flannel host if they are from the same geographic location. This is to guarantee cache efficiency. With shared channels, it means many different teams share the same objects. This makes Flannel cache duplicated objects inside the cache repeatedly.
Now
It is clear we need to do better. We introduced great aware cache, which means that, on the single flannel host, we store those shared objects in one separate key space, which are referenced by multiple teams.
Grid Awareness Improvements
With this change, we saved a lot of Flannel memory. Our average per host, we saved 22GB of memory, and 1.1TB across the fleet. For our biggest customer, it also improves their DB shard CPU usage. The CPU idle increased from 25 percent to 90 percent. That is because it is more likely for shared objects to be cached. So it takes -- so Flannel needs to go to DB to fetch the data fewer times. For the exact same reason, this improves user-connect latency, P99 latency dropped from 40 seconds to 4 seconds.
Future
In the future, we plan to implement scatter and gather. When we have the request coming, one Flannel hosts fetches the data from different key spaces, which are located on different Flannel host and combines them together. We think this will further improve our cache efficiency. So far, I have been talking about the evolution of Flannel. Let's switch gears and look at client pub/sub. This is an interesting feature for performance reasons.
Expand Pub/Sub to Client Side
If you still remember the example that I brought up earlier- you are at QCon, you change your Slack status to be in the conference, and this is broadcast to all your co-workers. In fact, this is unnecessary. Most likely, you are going to talk to a few dozen co-workers most of the time. So only they need to know your status change. Instead of broadcasting everything to everyone, client-side pops up for clients to subscribe to a list of users in channels that they are interested in, and only get updates for those. In fact, reducing the number of events clients have to process largely improves the client's performance.
Presence Events
The first thing that we moved to client pub/sub is presence update. This is a feature when the user goes on line or off line, we render on the avatar. Presence events used to be 60 percent of all the events in terms of volume, due to its volatile nature. And on a 1,000 user team, suppose that each member updates their presence once, it generates 1 million events, and it grows in square. And the client keeps a list of users in the current view and only subscribes to their presence. When the user changes their view, it changes the subscription list. This reduced the presence events by a factor of five across the fleet, and we are moving more events into this model. Lastly, let's take a quick tour of messaging server.
Messaging Server
What is messaging server? You can think of it as a big router. It routes events to users. When the event happens on the team, a message is created, a reaction is made, a file is uploaded, and messaging server found this event to the list of users who are supposed to receive them. Messaging server maintains metadata so that he knows throughout the traffic, what channels are on the team and who is connected to which channels.
Limitations
It was initially implemented sharded by team, meaning that one point was responsible for your team, and all the events for that team was going to that server. That is a single point of failure. When the server goes down, you cannot connect to Slack.
We have shared channels. Shared channels means that there are many shared states among different teams, and we need to copy messages around. Imagine a shared channel which is shared among a few thousand teams, and they are all in production management; this makes state management impossible. It is clear that we need a plan B.
Topic Sharding
We changed from team sharding to topic sharding. What is topic in this context? Topic is a group of affiliated objects that you define; so a topic can be a user, a specific group of users, a channel, shared, or unshared, a DM, a group DM, a team, or a grid. Events are sent to topics. Topic sharding is a natural fit for shared channels, the events from the shared channel is sent to the topic within that channel, and is sent to all channel members, regardless of which team the user is on. It reduced user-perceived failures. And now, when a server goes down, it means that the list of topics are not available. Very likely, you can still connect to Slack, but you may not be able to access all your channels.
Other Improvements
With the server re-architecture, we improved failure recovery. Previously, when the server goes down, it requires human interaction, the human needs to come in, run the script, and migrate the script to a healthy server. Today, this is automatic. It reduced the failure recovery time from minutes to seconds.
Messaging server supports pub/sub, and both flannel and clients subscribe to the list of topics they are interested in and gets updates for those. This improves the Flannel and client performance. And messaging server implemented -- previously, it was only in the main DC. Today, we moved part of the functionality to the edge. This reduced the networking bandwidth between the main region and the remote regions.
Our Journey
This is a two-year journey; we had a problem, we did a bunch of incremental changes. We kept the situation in control and did a big architectural change. This architectural effort is across different teams, in mobile PM and product engineers.
Our first design of this architecture is not the most optimal. We knew that. But we decided to do it this way, because that's the faster and the easier way to push out the change. And then, we take time to evolve on this architecture. I think this is the biggest reason we succeeded in such a big effort.
Journey Ahead
And we have a long way ahead. Currently, we are working on pre-worn Flannels cache, moving it out of the web socket path, and every pub/sub. And we are also responsible for job queue, and the storage layer. If you are interested, get in touch. Thank you.
Now, I will take questions.
About your messaging server, why did you implement pub/sub, rather than middleware, such as Rabbit?
We discussed that within our team. The thing is, we had a system ready there where we implemented most of the logic. When we do the architecture, we feel like we don't need to spend an effort to bring in something completely new. We can just tolerate -- tailor our existing system to be able to support all those new features.
So, if the client is doing a lot of lazy-loading, as opposed to those big payloads coming across the wire, how are you protecting yourself against, you know, a death by reconnect? So if your service goes down and you have these clients across the world trying to re-connect to your service, how are you protecting yourself against, sort of, a thundering herd problem?
Honestly, it depends on how much reconnects we get. We can only support up to a certain point. For example, two weeks ago, we actually had an incident where we are -- like, the whole service was down because of a thunderherd problem. But actually, that problem is, we lost half of our web socket connections, and all of them are connecting to Slack all at once. We actually didn't handle it well enough today. What happened two weeks ago is AWS EOB was down, because the EOB cannot scale fast enough, and then we have to set up a new EOB cluster, and scale it over time. That's how we recover it out of traffic.
Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.
See more presentations with transcripts