BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations How We Created a High-Scale Notification System at Duolingo

How We Created a High-Scale Notification System at Duolingo

Bookmarks
49:01

Summary

Vitor Pellegrino, Zhen Zhou discuss how they built/test an on-demand notification system, what it takes to manage resources/site-reliability at the same time, and how to mitigate reliability issues.

Bio

Vitor Pellegrino is Site Reliability Engineering @Duolingo. Zhen Zhou is Software Engineer at @Duolingo, Previous Theoretical Computer Science Enthusiast @CMU.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Pellegrino: I lead different parts of the platform organization. We're going to talk about something that we had to do specifically for one event, which happens to be the Super Bowl. Who here knows Super Bowl, know vaguely what it is? Who here uses Duolingo? Duolingo is the largest language learning app. By delivering high quality content to most of the people, like they actually use us for free. We hope to actually develop the best education app and make that available to anybody. This is what we stand for. First, another important thing, especially for this presentation, is actually our social presence. Duolingo very much creates new trends. There's a lot of things that our mascot does.

More specifically, Duo, our mascot, this green owl, is very well known for going great lengths to make sure that people do not miss their daily practice. A little bit more about this. Does anybody want to take a guess at what this number is about? This is actually the number of people that learn with us every single day. A little bit under 27 million do come back for their daily practice. This one here? That's Super Bowl. That's actually the number of people that watched Super Bowl, the last one. That actually makes it the largest number of people watching the same broadcast in the history of television.

To put that into perspective, that's roughly the population of the United Kingdom, Spain, and Portugal combined. These are the people that are just like there watching Super Bowl. For most of you, that probably might be a little bit of old news, but it is an American football event. It is the Grand Finals of it. Not only that, people actually get together, and that is almost a cornerstone of important parts of the American culture. People get together. They watch it. Think about any big event that you can think of. A very important thing throughout this Super Bowl event is actually the ads.

Duolingo and Duo specifically, because it goes great lengths to make sure that people do not miss practice, it was just a perfect event for us to try to make sure that they would not miss that regardless with whether they were actually watching the Super Bowl match with friends or whatnot. The ads, as I said, they were a very important part of it. Actually, how does a Super Bowl ad look like? As you can clearly see here, it's a very high production value. There is humor. There are things exploding. There is all of that, like big names and a very high budget to spend. We wanted to do our own version of it, and that's what we went with. Yes, do your Duolingo. Do not forget, you're having fun, but make sure to do your practice. That's right. That's exactly what we went for, all 5 seconds of glory. That's all the time that we had to make sure that 4 million people actually got a notification there with mobile phones. This is pretty much to set the stage and actually explain about the work that we did.

This talk is going to be divided in two parts, one being the general architecture, like what was actually in place that actually allowed us to try something ridiculous like that. In the second part, Zhen is going to lead the deep dive about the software and the architecture that we had to put in place to make sure that this would work.

General Architecture

First things first. How does the general architecture of Duolingo look like? We've been around for a little bit over a decade, and like many companies started, we did everything right from the get-go. Of course, no. We actually started as a big monolithic application. Over time, we actually broke that into something that now is in the hundreds of microservices. Most of them, they are written in a few languages. We're going to talk a little bit more about that, but at the scale we're talking about. To make sure that roughly 26.9 million people do not miss their daily practice, some of them, they have to sustain the load of tens of millions of requests per minutes.

Some of them have like, in the single digits, millions request per second. That's the scale we're talking about. Our technology stack, from our beginnings as a Python 2 application, has evolved into what is mostly like a Python 3 for most of the services. We have a fair chunk of new things being built with JVM. Some Scala, a little bit. Kotlin is actually where most of the new things are going to. A lot of that is actually done using Terraform. Also, very consistently, most of these services are relatively simple. They're usually like your backends with a database, and some caches.

Most of them, they use RDS, which is Amazon managed database system. Some of that was MySQL. Some was Postgres. We use a whole lot of DynamoDB. We're one of the largest users of it. Which is also like their serverless database, which is more of a key-value store. All of that, virtually everything is actually running on Amazon Elastic Container Service, ECS. There is one thing that maybe we might be wondering, why ECS? Why not Kubernetes? A lot of people are using Kubernetes these days. We're thinking about that too. It was very important for us, this reduced complexity, in the beginning, first and foremost.

In the beginning when we started, some of the things were not exactly there in Kubernetes. A lot of this drive to microservices happened around 2018. The container wars were not fully settled then. Also, very importantly, was the AWS native constructs. The vast majority of things that we do, they run on AWS, and it was important for us to have that native integration. Also, the size of the platform team. I've been with the company for a little bit over a year, but I know that around that time we only had a couple of people working on that, to support the whole company. This was very important for us. That's the reason why we did not take Kubernetes.

How we create those microservices, internally, you're going to see the term Galaxy App being used very liberally, which means a set of applications that have been created using the same set of best practices, some set of reusable modules. It's a very light touch, small approach towards microservices. Mostly what they define is a shared communication paradigm. A lot of them, they use gRPC. There are some OpenAPI.

Then we make that easier to be developed by shipping also prebaked CI/CD pipelines for these apps, and also the shareable Terraform modules. That is our platform right now. That is actually the platform that supports hundreds of microservices currently. This is how you create one of these new Galaxy Apps. This is OpsLevel, which is an internal developer platform. I know in a talk, we're going to have something from Backstage. Those are equivalent. That just happens to be the one that we use. If you want to create a new service, that's how you go. You go there, click a button, you pick one of the four templates that we provide, and then you get going. Clusters, databases, all of that is not abstracted away, however, but it's all predefined for you, so you don't need to actually pick your desired sizes and stuff.

Resiliency Tooling

Talking about the Galaxy Apps, now I'm going to talk about a few tools that are very important for us to achieve that kind of resiliency that we talked about and actually maintain some sanity while we scale the company to these numbers. The first one is what we call internally, Zombie mode. Zombie mode is actually a way that we have to make sure that whenever we have problems in the backend, they do not affect the learner's experience. Let's say we're completely down for a while, you can still do your lessons while we're recovering on the backend. How does that work? This is the other architecture. It's a fairly simple solution. That pretty much looks like this.

We have a script runner, and that can be either done automatically, or it can be somebody invoking that directly, which is mostly like a lambda function. This lambda function does three things, more importantly. The first one, it writes in a known file that the client should actually run on a degraded mode. The client is the devices that you have on your pocket, they actually do have some lessons already prebaked in cache. If there's any problems, they have the ability to continue running, and then not to disrupt the learner ability to continue going. That's also broken down per platform.

Let's say web is having issues, we are able to toggle that, Android or iOS. We also keep track of who has been seeing that, how many users have been impacted by this degraded mode. We also notify operations channels. More often than not, we actually know that something is going bad after we already have applied the Zombie mode to the specific app. Here's an example, so you can see the timeline there. We had an incident affecting most of the Android users. At 2:25 p.m., we have detected that the rates of users that are completing the sessions that they created has dropped below 70%, that's our key North Star to make sure that people are able to continue using the platform.

Once we detected that this is below what we deem acceptable, we already declare an incident. You can see, yours truly has been chosen the incident commander for this one. We also put 80% of the users in this degraded mode. All of that happens in the span of one minute. That's Zombie mode.

Now the users are able to continue. As part of handling an incident, you want to make sure that you understand what impact people are actually seeing. Here comes Jeeves. Jeeves is an internal tool that is very important to keep track of what's happening around us. Think about what people are talking about us on Reddit, our support channels, our Twitter, like all of the things that we can gather information about, like we fetch with this tool. You can see that it also understands different languages. I do not speak Hungarian. I don't know what this first one means, but that's actually something that we were able to detect that was happening consistently enough.

Then we also use some AI models, based on what people are talking about, to infer what the problem might be. Unfortunately, I cannot give you specific things, but if this thing was not blurred, you probably were going to see something like this. Users are reporting problems, and they were complaining on social media that they cannot see something about pillows. Support people actually told me, this has actually been confirmed, non-confirmed, ignored, and so on. That actually feeds into other systems that we have, such as this. This is a way that we can generate some time series visualizations of those reports, those mentioning, and everything. This, for me, when I joined the company was actually very useful, especially when you are in an incident.

If you've ever been in a situation where you're asking your support people like, how many tickets we are getting for this? Is this actually affecting real users? As an incident commander, I can actually go there any time and then actually look for how many tickets were open since what time? Also slides that were like, which version of the app and whatnot. This is also seeing the same events, in this case it was just like crashes. That is Jeeves main use cases that we have. How does that work? This is the actual architecture. As I said, I get reviews from the app, reports from customer support, and actually creates Jira for us.

Also, internally, we have this concept of dog fooding, which is like, we actually tried our own stuff. One important thing is what we call shake to report. You can see the bars for the movements as you're shaking it. As a Duolingo employee, you can actually see something, shake your phone and use that to actually notify that something's going on. We use that extensively for testing, making sure the quality of the content we're producing is what we expect. All of that flows into the same Jeeves flow. That is pretty much the other side of it. We create those internal formats that people can query, those can be plugged into dashboards. We also produce reports to every single team. Also, team leads, they have an overall view of the quality of their area.

Lastly, Freeze Gun. Remember when I said about Zombie mode that we keep track of users that are seeing this degraded mode? This is also very important, because we use this tool called Freeze Gun. Because if you use Duolingo for a while, you might be very proud about your streak. I'm coming out strong 385 days without missing a lesson. The last thing we want to do is actually make people lose their progress, because we had some incidents. We make sure that we're preserving their streak, if they had seen one of these degraded modes. If you want to know more about this, that's a blog post we did about this, about this culture of dog fooding, https://blog.duolingo.com/dogfooding-app/. It talks a little bit more about Jeeves. That's it.

Deep Dive: Superb Owl

Everything was fine for a while. Things were chugging along just nicely. Then it comes one day, Bitor the marketing manager. Yes, I had this idea, Zhen, I don't know what to think about this. It should be easy for you guys in engineering. We are thinking about like, there's the Super Bowl thing, and then people say we do crazy stuff. Let's go beyond TikTok. Let's just go crazy, do an ad in Super Bowl. So far, so good. It really needs to be something a little bit crazier this time. I'm not sure if you've seen, people are talking about Duo doing some weird stuff. That was pretty much our idea. There was one thing, very importantly, can you make it that as people are seeing this, they get a notification at the same time.

Zhou: It's hard.

Pellegrino: What can possibly go wrong?

Zhou: What could possibly go wrong? Our second part of the talk is a deep dive into the Superb Owl service. We designed a service called Superb Owl to send the Super Bowl notification. As you may notice, we reordered the letters a little bit. That is a pure coincidence. We've already talked about where the impossible task came from. Basically, our marketing manager, Vitor, came up to me one day and said, we have to do this. I told him, I don't think that's possible. He said he already paid for it. That's a different story. To send 4 million notifications in 5 seconds is a huge challenge, and at Duolingo, we have an operating principle to embrace challenges.

To summarize the challenges that we faced in this project, we put it into three bullet points, essentially. It's the speed, the scale, and timing. I'll go into details. The first challenge on this is speed. If you're quick with math, 4 million notifications in 5 seconds, roughly is 800,000 notifications per second. London has a population of 8.9 million people, last time I checked. By the time I finish this slide, basically everyone in London gets a notification. That's how fast it goes. The second challenge is scale. We built a lot of our services on AWS, but unfortunately, we have to pay the money. It's not free.

I'm advised by Cassandra to not talk about cost, because we are already paying for a Super Bowl ad. That is part of the concern. The other part of the concern here is, how do we actually scale up the service fast enough so that we can actually serve all these notifications really fast. How do we make sure that we scale up our backend system enough so that they can endure the incoming traffic. These are all the problems that we actually have to face. The third challenge here is, of course, timing. Super Bowl, it's just like any other sports event, it's unpredictable. Just like in the World Cup, where you had added time, you don't know how long that will be. The same case with Super Bowl, there are timeouts.

People get injured all the time. We bought our ad, but the problem is we don't know exactly when it'll air, and it'll be totally different if we say we send the notification at a totally random time, rather than right at the moment, right after we show the ad on TV. In this case, we have to build a system so that a marketer can ideally press a button and the notification shows up immediately on millions of users' phone screen, right after the ad airs. That is the third challenge.

To summarize all these challenges into technical requirements, we have three. The first one is to send a notification at the target rate. Very straightforward. The second one is to get all the cloud resources we need just in time. The third challenge is to ensure resiliency and idempotency. Idempotency basically means to do things multiple times, but having the same effect of doing it once. In this case, you could imagine that multiple marketing managers will need to be watching the ad, and they'll want to make sure that the system works. They'll want to press that button multiple times.

As human beings, we all want to make sure that that very important thing happens. You don't necessarily want multiple notifications to pop up on a user's screen, especially when they are the same notification. That's just spamming them. It's important to make sure only one notification ends up on the user's screen, in this case. What is our solution to this problem? We build an asynchronous system. Now you might be thinking, we're crazy, like this is such a high throughput system, and you're building it asynchronously. Here is a system diagram that we have for the system. I'll walk you through some use cases that we have.

Imagine you're a marketing manager, you came up with an idea. You know what users you want to send notifications to, so you have a list of user ID, but you're a few months in advance or maybe few days. What do you do? You want to maybe create a marketing campaign, a push notification campaign through this system. To start with, an engineer will actually create a campaign with a list of user ID. The server here will acknowledge the request and then asynchronously starts fetching data from DynamoDB. Then, after it fetches the data, it puts the data in S3 and logs the results to CloudWatch. Simple as that.

Let's assume that a few days has passed and we're on the Super Bowl Day, and you are a cloud operations engineer, and you're wondering how you can scale up all the resources. That's the second use case, is to prepare the send. The first step starts with cloud operation admin scaling up the ASG, which is short for Automatic Scaling Group. After that, an engineer will actually scale up the workers in the ECS console. After that, the worker will fetch all the data we stored in S3, store the data actually in memory. Remember, we're on the game day, so it won't be there for much longer, really.

The last step is, of course, when all the workers start up, they will log their complete status to CloudWatch so that we know everything is ready. The last step is to actually wait for the ad to show up on the screen and someone press it. That's our third use case. This is what happens when a marketing manager hits the go button to send out the campaign. After that, the API server will dispatch 50-plus messages to an intermediate FIFO SQS queue. Then, after processing these SQS messages, the interim worker tier will dispatch more messages to the next SQS queue. The last tier of worker will proceed to send notifications by calling the batch APNS/FCM API. They are the Apple and Google's respective messaging APIs. After all that, the worker will log their process time to CloudWatch for our future analysis.

We have to come back to all these technical requirements at the beginning, after we told you how we designed the system. The first question is, does it really send the notifications at the target rate? Our desired rate is to send 4 million notifications within 5 seconds. That is 800,000 messages per second. Some of you who have worked with AWS or SQS may notice that it has an in-flight message limit, maybe 120,000 messages per second. How do we get past that? We actually did it a very simple way, by batching the users.

We put 500 iOS users in the same batch, 250 Android users in the same batch, and we resolved that problem. Second problem is, can we provision all the cloud resources we need in time? To solve this problem, we actually got a technical contact from AWS who helped us draft an IEM, an infrastructure event management document, which included detailed steps and action items that we could take to follow up and address our system's deficiencies. We also used spot instances to save money. Last but not least, we used a dedicated ECS cluster. You might be wondering why we did that. I'll talk more about that.

The most interesting topic, can we ensure resiliency, which is, can we ensure when multiple marketers press the go button all at the same time, only one notification shows up on our user's screen? Actually, in this case, we leveraged a solution from AWS. We leveraged the FIFO queue from the SQS service. It deduplicates messages by identifiers, and it has a deduplication window of 5 minutes. When multiple marketers press the button at the same time, it actually will try to send all these messages to the queue, but only one will effectively go downstream. Bear in mind, though, it has limited capacity, so we have to design the system around the queues in some way. If you're building it on your own, an alternative to this will be using a cache or a table.

Testing the System

Let's come back to the system diagram, and we checked all these marks. Seems that we're done. That's how you build systems. No, actually, we have to test it, unfortunately. At Duolingo, we are obsessed with testing. We do a ton of A/B testing. We even have an operating principle of test it first. How do you really A/B test this system? The talk on Threads by Meta talks about, they have invited some users to test their features. This is our marketing campaign. We can't invite users to test our feature. We have to come up with our own ways of testing the system. Here are three things that I'll go into detail on how we tested the system. The first test that we ran was on the throughput.

Basically, this is what we did right after we built the MVP. We want to know whether the system can actually deliver what it advertised. At first, we start simple. We started by testing with silent notification. Basically, we send some payload to a user's phone, but it won't actually show up on their screen. It won't show up as a notification, but we know that it's delivered. At that time, we found a bottleneck. The bottleneck was the thread count. I didn't mention that we built this entire system using Python, which is not exactly a very performance friendly language, but we're using Python, and a lot of threads means a lot of waiting time in this case. To address this problem, we test it with different number of threads.

We decrease the number of threads from 10 to 5 to 1, and it seems that the bottleneck is gone. Is it really? No, we tested that with a large audience. We changed the audience size from 500,000 to 3 million, and we discovered that we had a bottleneck of task count. Namely, we couldn't scale up the system. We couldn't scale up enough like cloud resources. We experimented with different ideas, but one idea we did was to pack different processes in the same task. As you may see, we tested with different configurations, and eventually we were at the state where we were comfortable. As I just mentioned, we also tested with the cloud resources.

It's important to have a system that can send those many notifications all at the same time, but if you cannot scale it up, you don't have the system. The first thing we test is whether we can scale up the Superb Owl service. That was actually pretty easy. Then we went on to test whether we can scale up the backend, because we would anticipate users to click on the notification and come back to the app, and that will drive a lot of requests to our backend. That turns out to be pretty simple as well. What's the problem? The problem is on Super Bowl Day, we have to scale up both of them. That turns out to be a problem.

We discovered that if you're scaling up the backend already, and you want to scale up this Superb Owl service, you put in the request to scale up, but then it has to wait in the queue to get resource allocated. This happens when basically you have different services in the same ECS cluster. We discovered that problem early on, so we ended up giving it a dedicated ECS cluster, which resolved that problem so it doesn't have to wait in the queue. The last thing we also want to test is whether we can scale up both the Superb Owl service and the backend both in less than 3 hours. Because, after all, we don't want to pay AWS too much money.

After we did the cloud test, we actually tested with real users. You might remember me saying, we can't test with real users, just like Meta did with Threads, but I was actually lying. We did some tests with real users, but we did it in a smart way. We didn't want to leak the marketing creative, so we sent out various different themed notifications throughout the year, at the end of the year. In October, it was like a Halloween themed notification.

In November, it was year in review. In January, it's, please come back from your New Year break. Actually, thanks to the Zombie mode that Vitor mentioned earlier, we were a lot more comfortable with testing with real users, because we know that we could detect issues and we could fix issues reliably. Our actually lesson learned here is to actually send yourself a copy before sending it to the user. Don't ask me how I know this. The last test that we ran, or the last step of testing the system, is to make the system foolproof and write a foolproof plan. I'll use a real example in our system here.

As you can see, this is a very simple go button that nobody can miss. Everybody can press. We thought it was perfect, until someday we realized that you could actually click on the square area around the circle, and it'll send the notification. While that's highly unlikely, we don't really want that to happen. We don't really want anyone to make that mistake and feel bad about themselves. We actually discarded that and made it a square button. It's not pretty, but it works.

The Day of the Super Bowl

On the day of the Super Bowl, we had a well clear written playbook. We had engineers on Zoom. We had marketers in front of their TV holding popcorns, or at least that's what I expected them doing. We had the square go button. Everybody was waiting so anxiously for the ad to happen, and the ad didn't happen. We waited 20, 40 minutes, and finally the ad happened, and someone pressed the button, and something goes out. The result of our campaign was really great.

Ninety-nine percent of the notifications went out in 5.7 seconds, and 95% went out in 3.9 seconds. We got a lot of positive buzz on the Slack channel internally and over on Twitter as well. We even had people write some blog posts and journals about it. If you have interest in that, you can read those blog posts, and The Wall Street Journal that is written mostly on the marketing campaign, but also has some portion on the notification.

Lessons Learned

To summarize what we actually learned in this project, we had some lessons learned. The first one is to always build a solid foundation. Like Vitor introduced, we have so many wonderful tools, like Galaxy App, which really sped up our development here, and had our Zombie mode and Freeze Gun, which empowered us in our development.

Pellegrino: It would be impossible for us to try to do something like that, had we had those things in place. I think we also learned a little bit with the talk from Meta, that they were able to do all of that because they already had things. With us it was no different. This was very important.

Zhou: Our second learning here is to be open-minded about the system design, but be rigorous with testing. In this case, at the very beginning, we had so many potential designs of the system. People proposed all sorts of things, using lambda or even scheduling the notification on a user's phone, which actually didn't quite work because we don't know exactly what time it will be. We ended up choosing a design so that we could actually reliably iterate on and test, which gave us the end product. Our third learning is to always build a system with resilience and robustness.

In this case, specifically, we want to build the system so that marketing managers cannot make mistakes. They could press the same button multiple times, it's just human behavior. It's normal. You have to build your system to prepare for those cases. Last but not least, things can always go wrong, so accept it. I'm not saying to ignore all the errors or just try not to fix anything. I'm saying that having problems is normal, just like what we saw with the square and the round button.

I mentioned earlier, don't ask me why we need to send a copy to ourselves, because we almost accidentally send some nonsense to our users, internal IDs, because we didn't proofread it. To deal with those problems, different people have different mechanisms or ways. Our way was we built a playbook that has steps written out. A human being can actually follow these step by step and check things so that we minimize the chances of errors.

Pellegrino: In fact, actually two people pressed the button.

What Is Next?

This was actually what we did, the results of the campaign we did. I'd like to present you a little bit about what we're thinking about next. These are a few topics. We don't have all the answers for it. These are things we're thinking about. I talked about why we don't use Kubernetes. We're actually looking more, and then we're probably going to start using it more aggressively. The reason for that, as I heard about the architect in this track, we actually want to start building something that resembles more APIs, and they actually speak to the problems that the developers are facing.

In the beginning, as I said, we had our big monolith, and then we had perhaps one or two databases, and then that went to the hundreds. We're very good about doing something very low touch, that made actually our jobs as the platform people relatively simple to maintain all of that. Now the next step for us is like, how can we deliver that in a way that makes sense to the end users, which is our other Duo engineers? The next thing we're looking into is edge computing. Zombie mode is effectively like an offline mode for us. It's a very important thing for us to start to explore how that experience would look like, instead of doing that only in extreme situations.

How can we actually do that more aggressively? How can we actually do some fetching close to the end user's devices, or even in their devices themselves. This is something we're thinking. Last but not least, MLOps. It's 2024, we all know what's happening around us. We want to also figure out, how can we make the delivery of those AI models and all of the tasks around that to make that actually be more productive.

Questions and Answers

Participant 1: How did you consider the difference of latency you're going to get from different broadcasts? For example, you could receive the ads a couple of seconds later than someone that's been receiving from the air TV, from, for example, internet or something like that. How did you decide what will be the audience for this notification? Did you just go for North Americans or people that are in the same time zone, or globally?

Zhou: The first question that we thought about is, how do we decide when is the actual time that it shows up on TV, because there are so many platforms. We actually ended up having the three marketers watching it from different streams so that whoever sees it the earliest will press the button, so nothing could go wrong from that moment. Because it's the first year that we're doing a marketing campaign like this, I think what we set out to do was initially three cities in the United States, because it will be the most relevant sporting event in the United States.

It was at the beginning, New York, L.A., and, I think, Pittsburgh, but at the end, we actually expanded to seven different cities. You may see in some of my slides, I mentioned that during our testing, we tested with 3 million users at the beginning, because that's our initial estimate, but that number actually grew as we decided to put the ad on more areas, and we want to target more users in this notification campaign.

Participant 2: How did you identify those users, was it geo-localization or subscription?

Pellegrino: It was basically when people were using the app last. We had an idea about where they were based, and then, because we knew which regions we would like to target, that's how we got the user base.

Participant 3: Did you track the successful notification campaign in terms of user effect? In terms of tracking the success rate in terms of knowing that the users interacted with it, as in, was it worth it by the time you did it?

Pellegrino: It depends on how you measure. If you are only looking at one dimension, like that might be worth or not. For us, it was more about opportunity to do branding, to do something that would bring forth the idea that, like Duo being perceived as this character. I think a lot of these things is actually measuring the long run. If you look at, how many new users? Perhaps that's not going to tell you the entire idea. People were very happy with the results. We got what we wanted.

Participant 4: There's one part of the system that you don't have control over, which is how fast Google and Apple will actually send a notification. Did you talk to them before?

Zhou: That definitely came up to us that, first of all, we don't have control over their system. Second of all, we actually had to do some investigation to understand whether they have rate limits against doing things this way. We reached out to them and asked whether they have specific rate limiting regarding sending notifications in a bursty pattern, in which they don't. The first one is actually a hard problem, because we don't have really control over their API. We only have control over our internal system. We build it with the goal in mind that we want to measure only the performance of our internal system.

It's definitely a harder problem to approach, when you're working with a vendor where you don't have control over their system. It really depends. If we were to say we want to send 400 million notifications within 5 seconds. We might as well talk to some of the representatives from those companies, Google or Apple, because I think at that level you would actually want to be more sure about the performance of their APIs and the guarantees, the SLOs they can offer.

Pellegrino: We've seen that not exactly on the mobile part, but we saw that on the AWS side. It was very important, like whenever you're doing something like that, maybe one takeaway, if I can emphasize that, work together with your vendors to understand their own limitations. This is not an everyday thing, for us, and perhaps not even for them.

Participant 5: How much of an extra increase was this on your cloud bill? Was it a lot?

Pellegrino: It was a lot for a brief amount of time.

Participant 6: After preparing for all this, did you have contingency for the AWS region going down, like at the wrong time. Then the second thing is, you have to devise campaign data in the S3 bucket, and again, is it all in just one bucket? Are all these workers competing to get from one bucket, or did you just split the bucket?

Zhou: What we did was actually just in the same bucket, because the workers can be scaled up in a long period of time, maybe over the period of one hour or so. They only need to fetch the information once into the memory, so there isn't actually that much competition. They're not refreshing it at all times. We're not actually running into the bottleneck there. We had that AWS contact and actually held him accountable in some way that, you need to make sure you do your job, and we have to follow all these steps, all these contingency plans that we make. If your system goes down, I don't think there's a backup plan in that case.

Pellegrino: They pick and promised it. That's a lot of the discussions that we have there. We try to make the plan foolproof, but we know exactly where it could potentially break, so we're ok to accept that a regional outage at that moment probably would be a bad deal for us, too.

 

See more presentations with transcripts

 

Recorded at:

Sep 24, 2024

BT