BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Orchestrating Resilience: Building Modern Asynchronous Systems

Orchestrating Resilience: Building Modern Asynchronous Systems

Bookmarks
44:06

Summary

Sai Pragna Etikyala discusses her journey at Twilio, sharing practical examples from their projects, the challenges they faced, and how they overcame them.

Bio

Sai Pragna Etikyala is a Technical Lead at Twilio, currently leading the team responsible for A2P 10DLC compliance for messaging. Before joining Twilio, she worked at Amazon Web Services, Yahoo, and Cerner. Throughout her tenure at these companies, she developed robust end-to-end solutions and successfully managed complex operations.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Etikyala: I am Sai Pragna Etikyala. I am currently working at Twilio as a technical lead. I've been with Twilio for almost four and a half years now. Before that I worked with AWS, Cerner, Yahoo. I'm very thrilled to be here to talk to you about how we're building modern asynchronous systems in terms of resiliency. Twilio is a customer engagement platform. Essentially, you can engage with your customers via voice, messaging, video, and email. I particularly work on the messaging side of Twilio. If you ever received a message from Amazon saying your order is on the way, or DoorDash saying your pizza is delivered at your doorstep, those messages were probably delivered by Twilio. Twilio makes it very easy for you to engage with your customers on your application using different channels. When you think of SMS, lately there has been so much spam on SMS, it's been annoying. There's a lot of smishing that's happening over SMS. Smishing is a term for phishing over SMS. Attackers are getting very innovative and creative with their smishing attacks. The risk is only going up every single day.

A2P Messaging Compliance

In order to combat the spam and smishing in telecommunication ecosystem and make SMS a reliable channel, here is a new regulation, a new thing that came up. Basically, all the telecom giants like T-Mobile, Verizon, AT&T, all of them came together and said, we're going to put out some regulations, we're going to make people follow that. This would help us combat spam and smishing attacks. Also, make SMS channel a reliable one. The initiative is called A2P messaging compliance. A2P basically is application to person. A2P applies to all the messages that are sent from an application to an individual. Basically, the compliance is applied on that communication. It aims to basically combat smishing and spam by regulating SMS traffic.

What does this compliance say? Basically, to be compliant while sending SMS messages to your customers, you need to register. Basically, you need to register, and the registrations are three different steps. One is brand registration, campaign registration, and number registration. Let's assume I run a pizza shop. Brand registration is me going and submitting my information about, this is my pizza shop. This is my business name. This is my AI, all the details. I'm registering myself as a business. Someone's going to verify that. The next is campaign registration. Campaign registration is when I go and tell this, I'm a pizza shop, and I want to send two-factor authentications for basically account setup and login. Also, I want to send delivery notifications when my pizza gets delivered to my customers. Here, basically your use case is vetted against what you're doing. For example, if I'm a pizza shop, and I say I want to send charity messages, it's probably not a valid use case. You got to explain yourself more. Essentially, you're registering before you can even send messages. The third step is number registration. That's essentially saying again, I'm a pizza shop. I'm going to use these 10 phone numbers to send messages to my customers. That is A2P registration.

The Problems

We're here to talk about resilience. Why am I talking about A2P and compliance and messaging? That's where I'm getting at. So far, everything sounds good. There was a problem in the ecosystem and we're combating that using regulation. The problem here, however, was we built this system to enable our customers to register for A2P, and be compliant themselves. Ours was a very event-driven architecture, like many others are here, we had lots of queues. We ended up building a state machine that would handle this entire registration process. It was manageable in the beginning, but as the system evolved, it became complex. Essentially, this state machine evolved into this one big file with 2000 lines of code that's orchestrating everything. The problem here was, the state machine didn't know the sequence of steps that it needs to perform to finish a registration. For example, if I were to do a campaign registration, I would take details from my customer saying, I want to send two-factor authentication messages. I would send all these details to AT&T, Verizon, and T-Mobile, wait for responses from all these people. These are steps that are happening in a registration process. My state machine doesn't really look at it as a sequence of steps that are happening. My state machine looks at it as, I am getting event X, my database is in state Y, so I'm going to perform action Z.

That made things complex, basically, when we're implementing even the simplest of the simplest things. Let's say we have 10 steps, and all I want to do is add a small step in between 5.2. We had to think in terms of state machines and state transitions. Instead of thinking just in terms of, I just need to add this new step in between, it's a sequence of steps. This thinking in terms of state machines and handling state transitions, became very complex, and was often very error prone. Our system was primarily event driven. We were getting acknowledgments from different places. For example, if we send the campaign to AT&T, at a much later time, AT&T would send a webhook back. All the time these campaigns are manually reviewed. Someone's going to a dashboard and saying, approve or reject this campaign. We're getting input signals. The state machine is getting input signals from a lot of places. Then, eventually, our code ended up becoming this very complex maze, which was hard for us to understand. Obviously, onboarding people became very hard. Our engineering team wasn't scalable. Also, just the cherry on the top, this A2P ecosystem, this entire regulation and everything, is relatively a very new thing. We are integrating with these systems as they are being built. Essentially, which means our system was supposed to be built with a lot of flexibility so we can keep changing things based on how the ecosystem reacts to it. That was hard. Building flexible systems became hard for us.

This is a very small example of what our state machine code looks like. Essentially, if you look at here, all this piece of code is doing is, if my database is in pending state, like if my campaign is in pending state, and if I get a create campaign message, then handle this block of code. Now, if my campaign isn't waiting on review state, and if I got review received message, then execute this block of code. That's exactly what it's doing. It's very simple. The problem here as you can see is, you don't really know what comes first and what gets executed next. All you know is there are these two disjoint events that are getting executed. Again, in a perspective and a bigger picture when you have lots of events, lots of states, it'll be really hard to piece everything together to understand what's happening, step 1 to step 10.

Challenges in Building Resilient Workflows

We had a lot of challenges building resilient workflows. There was a lot of overhead when we were building our workflows using a state machine. First one is state management. This is the same piece of code that I was showing earlier. You're just saying, if my database is in pending state, do this, and if my database is in waiting on review state, and you got a review received message, then execute this piece of code. This is all good. What if we get the review receive message while our campaign is in pending state? That could happen. You could be using an eventually consistent database, and when you pull the record out, the database wasn't updated yet, and you still got the pending. Probably, I need to go handle it here as well. What if I got an out of sequence event from somewhere, and I got a received message in empty state. I should probably go ahead and handle that as well. What about if I get the same message while the database is in pending day registration state, I should probably handle that differently. Essentially, this is just two states we're talking about. Imagine if you are handling 40 messages with 20 different database states, it will get very complex for you to ensure that you are handling your message in all states accurately. Because in some states, you want to handle it differently. In other states, you might not want to handle it at all and ignore those events. Again, it becomes very complex and very easily error prone, because when you're building on functionality, you're not just focusing on just adding the step 5.2. Instead, you have to now look at all your states and add code to ensure that this message is reliably handled even in different states.

Next, retry mechanisms. Handling retries almost became a task as complex as implementing primary logic, sometimes even more. A couple of ways that you could do your retry mechanisms. One is storing retry count in the database. You could say, I retried once, it failed. I'm going to increment the counter in the database, or I retried a second, and I'm going to increment the counter in the database. A huge overhead. Or you can get creative and then embed it in your SQS message, or something like that. Essentially, you dequeue a message, process that, if it fails, re-enqueue the message, incrementing the retry count. A huge overhead. Different code paths would demand you to handle it differently, which would make everything inconsistent. Now think of this other scenario, where I've sent this campaign to AT&T for approval. Now, AT&T sent a webhook back to me in 2 hours or something like that, but I haven't received that webhook, the webhook somehow got lost. Now I need to have logic or a mechanism to ensure that I can retry after 24 hours. How would I do that? In an event-driven system with a state machine, I would run a cron job. I would run a cron job to crawl the database to check, any of the campaigns that have been stuck in the same status for longer than 24 hours, then I would just grab those and then retry those. A huge overhead running cron jobs on your production database.

The first step for building reliable workflows is ensuring that we have good auditability. In these complex systems, it's on developers to push out the right logs in the right places. Auditability will become a huge hassle. The number of hours we spend trying to dig into these logs and piece together things when we got a customer escalation, or when we had like a third-party API outage trying to gather impact and revive or handle those customers was a huge pain. The next one is observability. Again, huge overhead on the developer to figure out, how many registrations are in flight right now, and how many of them succeeded? How many of them failed? Are we seeing any continuous errors with downstreams? Are we seeing any retries? All these metrics have to be pushed by the developer, and it's basically another overhead on the developer.

Solutions

We were at the stage where we were like, we have to do something about it, because we can't live with this system. We knew we had to go ahead and rearchitect our system. We were looking for options out there. At that point in time, we didn't know what options are there for us. We were like, either we have to find a solution outside that would help us or we would build a system that would help us. Thankfully, there were lots of options out there. We basically evaluated Temporal, Apache Airflow, Step Functions, and Netflix Conductor. This was our evaluation criteria. First of all, we wanted to write workflows as code, because we had complex business use cases with a lot of branching logic, and having workflows as code means it's better for our readability, and it's better for our writability. It's more flexible having workflows as code. It would be a good plus for us to have something that could run Java, because our existing code base that we've already built was already in Java. It would be a huge plus for an orchestrator to support Java. Also, one of our criteria was, we already have the system built, we want to reuse as much code as possible. We were looking for an option that wouldn't require a significant overhaul of our system.

The next thing is dynamic changes. How we define dynamic changes is, we had very long-running workflows. Some of our workflows run for almost 90 days or 100 days. We kicked off these workflows, and they're already in flight waiting for different approvals. Workflows were kicked off with, let's say, 10 steps, now we have this new requirement saying, can you just add these extra 2 steps at the end of the workflow, 11th and 12th. Now, when we push code changes for 11th and 12th, all the workflows that are in flight need to pick up and execute these new steps, 11th and 12th, because we couldn't afford to not have the existing workflows not get any new updates for 90 days. That is what I would define as dynamic changes. That was a really hard requirement for us. Then we wanted rate limited requests to downstreams. We were integrated with a lot of third-party APIs. These APIs had very low rate limits, like one or two requests per second. To ensure that we adhered to those rate limits and not get throttled by our downstreams, in our previous architecture, we introduce lots of queues, managed lots of queues, enqueue request, dequeue request, handle failure. Something that would help us easily rate limit requests to downstream would be a huge plus for us.

Then we were also looking for something with managed cloud, because we didn't want to maintain the infrastructure. We took our evaluation criteria, and then compared it with all the options that were available out there. The first one was Temporal. In Temporal, you define workflow as code. There are lots of SDKs that are available in multiple languages, which support Java, Go, Python, PHP. It also supports dynamic execution. That's what I was talking about, where we had workflows running for 90 days. Then we wanted these old workflows to pick up all the new updates. It supports dynamic execution. Temporal sat as a wrapper on top of our existing code. It didn't require a significant overhaul of our system, so check. It helped us rate limit requests to downstream. It was just a neat configuration. I think the configuration was global queue rate limiting or something. Then, with just that one configuration, we could just rate limit requests to our downstreams. While on the other hand, we didn't have to set up new queues just to handle rate limits to downstream. Another neat thing about Temporal is the execution is completely decoupled from basically the server or the orchestrator. Temporal offers a managed cloud option where they take care of infrastructure for you, so you don't have to think about maintaining the infrastructure yourselves. Temporal is also open source. Essentially, you can see what's going on in the background.

The next thing we evaluated was Apache Airflow. In Apache Airflow also, you write workflows as code, and you write workflows as code in Python. Apache Airflow has a very active open source community. Essentially, if you find a bug, and you report it, you will probably get a response in a couple days. Apache Airflow has been around for almost 10-plus years, so there are a lot of plugins available. Apache Airflow works differently from the way Temporal works. Apache Airflow supports static workflow execution. Essentially, when you're starting the workflow, and you say, these are my 10 steps that I need to process in this order. Apache Airflow takes those 10 steps, stores it in its database, and then orchestrates them exactly in that order. It also has managed cloud options. Most of the use cases that I've seen for Apache Airflow were data pipeline use cases.

The next was AWS Step Functions. It integrates very seamlessly with AWS services, and lambdas are primarily used to run the steps in AWS Step Functions. The workflows are written using drag and drop interfaces. Essentially, if you're already in the AWS ecosystem, you're running your code using lambdas, and using DynamoDB triggers, and so on, probably you could consider using AWS Step Functions. Especially if you have very simple use cases, and you're already in the AWS ecosystem, AWS Step Functions would be a no-brainer. Our final decision was Temporal. Temporal was the one that checked all the boxes for us.

Architecture Migration Using Temporal

The way we started basically migrating our old architecture to the new architecture using Temporal. The first thing, we started running it in hybrid mode. We started very small. For example, if your registration has 10 steps, or if your workflow has 10 steps, and I need to add just this one step, 5.2 in between. We run it in a hybrid mode, such that we had the old code execute the first five steps, and Temporal execute the 5.2 step, and then Temporal would hand off the execution to the old code again. We started running it in hybrid mode. We adopted Temporal in any small initiative that we used. Temporal facilitated us to trigger child workflows that are owned by different teams. For example, I could be this parent workflow, and I want to trigger this workflow that is owned by a completely different team. They could host and run that child workflow in their own code base. Child workflows, parent workflows, they don't have to share the same code base, so we were able to draw team boundaries very easily, and then rearchitect our platform. We ended up finally sketching out our workflow designs, what our workflow should look like, our parent-child relationships. Then when we split the workflows. We had our end-to-end testing plan, and we started running in hybrid mode.

Immediate Impact

The very immediate impact that we've seen, first thing was change in our thought process. We could stop thinking in terms of states and state transitions, and we could finally start thinking about it in terms of simple sequence of steps. We're like, add a new step in between, so just go ahead and add that step in between. We could stop thinking about how this one step that I'm adding would impact the rest of the 20 steps that I already have. Things became more simple. Our thought process became more simple. We've noticed that the teams had a very small learning curve. Teams loved it. Then they adopted it in every initiative. Remember how we started very small with running it hybrid, saying we'll just add this one step of the entire registration in Temporal. We've seen that get migrated so fast. In no time, we've migrated the entire end-to-end workflow to Temporal. We were able to scale engineering teams. That was a big problem for us before, where onboarding new people to the team itself was so hard. It would take couple of months for people to get used to the existing code base to start making progress on it. After adopting Temporal and switching to the new architecture, we were able to scale engineering teams very fast.

This was the code that we previously discussed, where we had the state machine. It's handling messages in different states and executing different blocks of code. On the other hand, the same thing using Temporal, is just three steps. Essentially, step one, you're creating a campaign. Step two, you're doing an await. You're waiting for the campaign to get approved or rejected. Step three, once you have the review result, you are marking the campaign as approved in the database, or marking the campaign as rejected. This is not just pseudocode. This is what your production code would look like, very simple. If you look at it, you just have three steps that you're doing. First, create campaign. Then you're waiting for it. Then you're marking that in the database. You can just write it as simple sequence of steps. The best part was, code was reusable during our migration. This is what I was talking about before where Temporal sat as a wrapper on top of our existing code. For example, we already had code written for loading a campaign from our database. All we had to do was just write this interface with the right annotations. Essentially, in the implementation class, just use the already existing code that we have. That's it. Just call the existing code that we have. We didn't have to rewrite our code, and it was just sitting as a wrapper on top of our code.

Building Resilient Workflows Made Easy

We've discussed all the problems that we had in our previous architecture. Now I'm going to go over those and talk about how the new architecture helped us. State management. We didn't have to think in terms of state management. There was no concept of state management for us. It was completely abstracted away from us. We could just stop thinking in terms of that. Again, things were as simple as just writing sequence of steps. This is step one, step two, step three, that's it. No issues. Next, retry mechanisms. If you remember last time, to ensure that we retry a certain failure, let's say, I want to retry three times, I would probably persist it in the database saying I've retried three times. Or, I would go ahead and embed it in the SQS message or any queue message. A lot of development overhead for that. On the other hand, when we moved to Temporal, all we had to do was just say, set maximum attempts. It was just a simple configuration. I could say, set maximum attempts. That's pretty much it. There was no other overhead of retries for us.

The other bigger issue that we discussed with retries was timeout. Essentially, I sent a request to AT&T for approval, AT&T sent me a webhook back, but the webhook got lost in somewhere, and I want to retry this operation after 24 hours. In the previous implementation, we wrote a cron job that crawls the database to check, which campaigns went past the SLA of 24 hours. I'm going to find those and then retry those. Previously, that's what we had to do. I would set a timeout here, and if this action timed out, basically, we waited for 24 hours, and we haven't gotten an approval from AT&T. If the timeout happens, I'm going to retry this action. That's it. It's simply handled in code with two extra steps. That's it.

Then, next is auditability. Previously, to build these reliable workflows, the first step is ensuring that we have good auditability to understand where things went wrong so that we can go fix them. After migration, we had auditability out of the box. The way Temporal works is it basically uses the concept of event sourcing, and it stores every event that you received as a log. Then you can see all the events that you've received, all the inputs that you've given to a certain downstream, all the outputs that you got. Everything is essentially logged. It made our debugging very efficient and easy. The next thing was observability. Again, all the metrics that we discussed came out-of-box for us. Essentially, we were at a state where we got state management out-of-box, abstracted away. We got metrics out-of-box. We got auditability out-of-box. Our retries were pretty much configurations which made it very easy. This concept of the queues being abstracted away from you, the state management being abstracted away from you, is called durable execution.

We've discussed all the features of a durable executor, and all the things that come out of the box for you. There are some more features that we haven't discussed, my favorite, which would be global rate queue limiting. All you have to do is to basically ensure that you're not getting throttled by your downstream, and you're adhering to your downstream rate limits. It's just a small configuration to set max queue activities per second, and that's it. In our registration step, we had a lot of manual approvals for campaigns or brands and stuff like that. Signaling was another thing that we very heavily use. There's scheduling. You can basically go ahead and schedule tasks that run at a specific time. You also have cron jobs that you can run out of the box.

Key Takeaways

So far, we've discussed all the challenges that we faced when we built an asynchronous system using a state machine, and also the value of durable executors that is provided where everything is abstracted away from you, and you can basically just focus on your functionality, and building resilient, durable workflows will become much easier. We've also looked at different workflow orchestrator options.

Questions and Answers

Participant: How did you handle some of the things you've talked about, like the signals, and the retries, and all that stuff, where you showed a little bit of code. Is that distributed to the signal that Temporal is managing, or that's [inaudible 00:33:22]?

Etikyala: If you're asking about if it's global, like if it works across distributed systems, yes. It's not local to single hosts. If you set your rate limit to your downstream as 3 or whatever, that is distributed across all the hosts. Even if you're running 20 workers, that rate limit of 3 will be of those 20 workers. It's global.

Participant: I imagine since you're in a big company, and there are a lot of teams with thousands of engineers. This problem is not very special. I'm curious why you had this discovery process done in your team. I would imagine something like this on a platform level should have already existed and should have already been built by other teams.

Etikyala: Before we even got into this whole problem with the state machine, and we ended up with this complex thing, we thought this problem was very unique to us. This is something that we were facing. Now if I take a step back and look at all the other solutions that I've built, I would use a workflow orchestrator with Temporal, basically. I would think of using it anytime I have to manage my own queue. You would think this is a very niche use case, but I think this is literally there every day. If you have these data pipeline tasks, you can use an orchestrator. For example, if you use DoorDash, essentially, it's an entirely asynchronous process, you kick off an asynchronous process in the background. I would think of it as, if I'm managing queue by myself, I would abstract it away using this durable executor.

Participant: Are there Twilio teams that probably have something else, like something similar already?

Etikyala: Other Twilio teams? We had some of the teams using Apache Airflow, and we had some of the teams using Temporal. There was actually nothing built at Twilio for this.

Participant: I've used a lot of workflow engines over the years, it's obviously changed a lot from the earlier BPMN and that kind of thing. Over time, I've found this, especially in applications where there are a lot of these workflows, that the problem is sometimes the workflow engines can be so generic and usually the backing state store for the workflow engine comes with a huge performance bottleneck. I've encountered that with a couple different technologies. I'd be curious if that's something you encountered with Temporal, and if there's anything you had to do to overcome the performance of the workflow engine itself.

Etikyala: One of our evaluation criteria was choosing a workflow orchestrator with a managed cloud option. We are now on Temporal Cloud. We haven't seen any issues so far. You can basically sit down with the team and say, this is what I'm expecting. This is the rate limit that I'm expecting, and your cluster will be scaled up to handle that much. There's also autoscaling and all that. We haven't experienced any bottlenecks on cloud so far.

Participant: How does Temporal communicate with your application service? What if your application service is on, how does it manage it?

Etikyala: The way Temporal works is it decouples the application code and the execution of the code, decouples it from the server. Your server is just orchestrating these things. Even if your application is dead, for some time, the orchestrator still has an understanding of where exactly the workflow stopped. If your application stopped after executing step five, that is recorded in the server, and if your application is dead, when it comes back up again, it will ask the server, "I don't know where I am. Where do I start executing?" Then Temporal will send the entire workflow history to the application. Then you can check, I've executed step 1, 2, 3, 4, 5, so I'm going to start executing from step 6.

Participant: The application, is it of a queue, or the gRPC call?

Etikyala: The gRPC call. You're not managing any queues. It's completely abstracted away.

Participant: Your point-to-point are gRPC calls?

Etikyala: Yes.

Participant: My question is more around the business side of the A2P registration solution, the workflow. At what layer does the state come in? Your current workflow now, does it support Twilio's A2P registrations on the console or does it parse the APIs that will be exposed to your ISVs for instance. Because I understand that, for your ISVs to also do A2P registration, they have a bunch of APIs that they have to call, they talk to them. Typically, each of those APIs that they call are somewhat isolated, and they are not exposed in terms of the workflow. In Twilio, you just call this API, this API, this API in some specified steps. Now that you explained workflow, I'm just wondering, how do I marry that in my head to the APIs that will be exposed? Or is your workflow essentially powering your registration that you exposed on the interior console itself?

Etikyala: No. Basically, we have these backend APIs where both Twilio console and the public APIs that you're talking about, both, in the end, up triggering the same exact APIs, and these APIs whenever you kick off a registration, will basically kick off a Temporal workflow on the backend.

Participant: It seems that the capability of being able to orchestrate all those workflows is central for Twilio. My question is regarding the regulating, because even though Temporal is open source as I understood, you're using another cloud service, another SaaS service, so you created a dependency with another company that really can change their business model in future. With that dependency, what was the reasoning behind that choice?

Etikyala: Temporal is open source. Yes, we're using managed cloud. Our code and execution entirely is in-house. It's executed in-house. Our code is not executed on Temporal Cloud at all. The Temporal Cloud just orchestrates. If Temporal Cloud changes their plans, or we want to move away from Temporal Cloud, all we have to do is spin up our own Temporal cluster in-house. In fact, we have Temporal Cloud in-house, and we're migrating those workflows to cloud. We always have a backup, where if we were to ever change our path, and say, we don't want to use a hosted solution we want on-premise. Again, the way Temporal works, your application code entirely is running on your host, so the cloud is just orchestrating. That's why we're not worried about that.

Participant: I have a question related to managing upgrades to the workflow. For example, you defined about probably the same 5 steps, and now you have 10 workloads running at step number 3, hypothetically. You need to add a new step, new step number 3, or to a step which is after the one which is the greatest. How did you manage those upgrades, and for these steps, making sure those are not broken?

Etikyala: Definitely. Versioning has been a pain for us where, basically, if you have five steps, just adding a new step in between was a pain. The way we would manage it is obviously using versioning. Or if you have situations where we're making a lot of changes to workflows, we just clone that workflow as a v2 or v3 workflow. Temporal announced a better way of versioning. Maybe that's something that will help you out.

Participant: Your system, you're not thinking about it as a state machine but as a workflow. Does this apply to all asynchronous systems or it's only specific for this problem?

Etikyala: I think it applies to all event-driven asynchronous systems. In any event-driven system or asynchronous system, you have a queue and you're managing that queue. Then, when you dequeue something and you process it, what if it fails? You have to re-enqueue that message. Sometimes you might end up being in a loop. Essentially, yes, it would apply to any asynchronous system or event-driven system. You're thinking of adding a queue and managing a queue yourself, probably think of using an orchestrator instead.

 

See more presentations with transcripts

 

Recorded at:

Feb 19, 2024

BT