BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Building a Reliable Kafka Data Processing Pipeline with Lily Mara

Building a Reliable Kafka Data Processing Pipeline with Lily Mara

This item in japanese

In today's episode Thomas Betts talks to Lily Mara, Engineering Manager at OneSignal in San Mateo, California. She manages the infrastructure services team, which is responsible for in-house services used by other OneSignal engineering teams. They will discuss how to build a reliable Kafka data processing pipeline. OneSignal improved the performance and maintainability of its highest-throughput HTTP endpoints, which are backed by a Kafka consumer in Rust, by making it an asynchronous system. This shift from synchronous to asynchronous can simplify the operational cost, but it introduces new challenges, which we will dive into.

Key Takeaways

  • OneSignal has a primary pipeline that can be simplified as taking a request in an HTTP handler and shoving something into Postgres. Doing this synchronously in the HTTP handler would not scale.
  • Because Postgres would be overwhelmed by the load, Kafka was introduced to more carefully control the load.
  • The HTTP handler used a Kafka producer for Go, but OneSignal had to write a Go consumer in Rust, their preferred language.
  • Safely and reliably scaling out requires thoughtful sharding and partitioning of the workload. OneSignal uses a customer app ID as a key factor in their sharding and Kafka partitioning strategies, which keeps customer messages in the correct order they were received.
  • Moving from sync to async processing required new metrics to measure service health. Two of these were message lag (the number of messages sitting in Kafka awaiting processing) and time lag (the time between a message being received and processed).

Transcript

Intro

Thomas Betts: Hey folks. Before we get to today's podcast, I wanted to let you know that InfoQ's international software development conference, QCon, is coming back to San Francisco from October 2nd to the 6th.

At QCon, you'll hear from innovative senior software development practitioners talking about real world technical solutions that apply emerging patterns and practices to address current challenges. Learn more at qconsf.com. We hope to see you there.

Hello and welcome to the InfoQ Podcast. I'm Thomas Betts. Today I'm joined by Lily Mara. Lily is an engineering manager at OneSignal in San Mateo, California. She manages the infrastructure services team, which is responsible for in-house services used by other OneSignal engineering teams.

Lilly is the author of Refactoring to Rust an early access book by Manning Publications about improving the performance of existing software systems through the gradual addition of Rust code.

Today, we'll be talking about building reliable Kafka data processing pipeline. OneSignal improved the performance and maintainability of its highest throughput, HTTP endpoints, which are backed by a Kafka consumer in Rust by making it an asynchronous system.

The shift from synchronous to asynchronous can simplify the operational costs, but it introduces new challenges, which we'll dive into. Lily, welcome to the InfoQ Podcast.

Lily Mara: Thanks, Thomas. I'm really excited to be here and to have an opportunity to chat about this with more people.

Messaging at OneSignal [01:06]

Thomas Betts: So you spoke at QCon New York and your title of your talk was, and this is a little long, "How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency." That's a lot of stuff to focus on.

I want to get to all those things and we'll break it down, but I want to back up a bit. Let's start with what does OneSignal do for the people that aren't familiar with it, and then what were the data processing pipelines specifically that you were trying to optimize?

Lily Mara: Definitely. I do also want to say that talk title was not chosen by me. The track hosts I think filled it in as a kind of temporary holding thing until I filled it in with something a little bit snappier, but I never ended up doing that and it was so long I couldn't get it to fit on the title slide in my slide deck. So my title slide just said “The One About Kafka.”

I think at the original conference where I gave that talk, it had a snappier title, but QCon likes the very specific titles that walk everybody through what to expect.

I think the original snappy title of the talk was A Kafkaesque Series of Events because there's quite a confounding series of steps and debuggings and incidents that we went through during that process at OneSignal.

So OneSignal is a customer messaging company. We allow our customers who are website builders, app builders, marketers, developers to contact their own user bases more effectively.

So this might be things like sending people real-time information about live events happening in the world. It might be letting people know about promotions that are customized to their own particular interests or needs. It might be letting them know that they have abandoned carts on the website that they should come back and take a look at.

There's all kinds of ways that people use our products and it would be probably a waste of my time to try and come up with all of them because customers are so much more creative with open-ended platforms like ours than our developers ever could be.

Before Kafka, OneSignal had a Rails monolith [03:08]

Thomas Betts: But that messaging is the core element that OneSignal is definitely trying to provide communication pathways. And so you're doing that with Kafka underneath it or did you have something before you introduced Kafka?

Lily Mara: So before we used Kafka, OneSignal was a single Rails monolith that absolutely everything happened inside of, right? We had Rails for HTTP handling and we sent asynchronous jobs off to Sidekiq for processing.

I think from the origin, I think we were always doing some amount of asynchronous processing because the nature of our original product offering, we've since grown much broader, but the nature of our original product offering was you have a large user base and you want to send that user base or some subsection of that user base, a large messaging campaign.

You want to let everybody know that there's a sale and you can't really do that synchronously in an HTTP handler because you don't want to be doing N things in an HTTP handler. You want to be doing one thing, and that one thing is shoving a little bit of metadata into Postgres and then firing off a Sidekiq job.

And then that Sidekiq job can have queuing applied to it so that it doesn't overwhelm your database. And later on when there's capacity to process that notification delivery, you can use your Postgres database to enumerate all the users that a notification is going to go to and deliver the messages.

So that was kind of our first foray into asynchronous stuff, and it was really there from the beginning.

But yeah, what I want to talk about today is several years down the line for OneSignal was the point where we started having some trouble with one of our seemingly more simple but much more high throughput HTTP endpoints.

This is the endpoint that allowed our customers to update properties on each subscription in the database, like each device record, each app installed on an iPhone, each user profile on a website, things of this nature. We allow our customers to store key value pairs as well as some other related metadata on every subscription that's out there.

And anytime we got an HTTP request from a customer, we were taking the metadata from that request and just shoveling it directly into Postgres. And CRUD APIs work fine up to a certain scale, but there will come a point eventually as you start scaling up and up and up where those updates and requests are going to start to overwhelm Postgres, right?

Postgres is a great database, but it certainly has some limitations and we started to run into those pretty aggressively.

Why async processing was needed [05:39]

Thomas Betts: And so you're talking about the input stream, like you were getting more requests than you can handle. It wasn't that the async going out was being too slow, it was you got thousands or millions of requests a second and that was overloading Postgres?

Lily Mara: That's correct, yeah. So we're at the point now where we haven't even introduced this asynchronous processing for this particular endpoint. So this is all happening synchronously inside of our HTTP handlers.

And at that point we decided we want to take this synchronous work and we want to add queuing to it so that we can much more carefully control how much concurrency is hitting Postgres and how much load we're applying to Postgres.

And that was the point that we shifted from a synchronous workload in our HTTP handler and we turned that into a Kafka produce that was later picked up by a Rust Kafka consumer, which we wrote and that Kafka consumer would apply the data change to our Postgres database.

And there were a lot of interesting engineering challenges that came along with this, not the least of which was at that time especially absolutely no one was writing Kafka consumers in Rust.

We based our stack on the RdKafka library. There's a Rust wrapper library for RdKafka that we use very heavily, but beyond the basic, "Please give me another Kafka message," there really wasn't much in the way of control for writing a high level Kafka consumer.

So we actually developed our own library for controlling concurrency within our consumers, which we very originally called Kafka Framework internally. And we've had talks to open source this because presumably someone else is just as silly as us and wants to write production Kafka consumers in Rust, but we have so far not been able to do that.

Why OneSignal chose to write a Kafka consumer in Rust [07:23]

Thomas Betts: So what was the decision behind using Rust if it didn't exist already? Did you look at other options or is it, "We have other things. We're using Rust and we're going for Rust on all these other platforms, we want to use it here"?

Lily Mara: So we had bet big on Rust pretty early. One of the core pieces of OneSignal's technology is OnePush, which is the system that powers all of our deliveries.

I think we wrote about that publicly on our engineering blog in, I want to say like 2016, which is about a year after Rusts 1.0 release, maybe even less than a year, maybe like eight or 10 months after the 1.0 release of Rust.

So we were pretty early in putting Rust into production at OneSignal, and we were really happy with both the speed of runtime once it's deployed in production, the low overhead, the low cost of running it because of that low overhead, and also frankly, the ease of development.

Everybody talks about how Rust is kind of difficult to learn, but I think once you have a certain level of competency with it, it actually lends itself to being a very productive language for people to use in production.

So many classes of bugs that you have to worry about in other languages, you just don't even have to think about those in Rust. And that's really convenient for folks.

Everybody who reports to me at least is, I would say, primarily a Rust developer, and they also do some amount of Go at this point, but I have taken people from absolutely no Rust experience whatsoever to being pretty proficient Rust developers. So I have definitely seen what it takes for folks to get spun up there and it's, I think, honestly much less than the internet Hype Machine makes it out to be.

The tech stack: Go, Kafka, Rust, and Postgres [09:06]

Thomas Betts: You went with Rust, and so how did it fit into the stack right now? So you had an HTTP endpoint and there was that on the Rails monolith you had and were you changing those?

Lily Mara: So at that point the HTTP handling for that particular endpoint had already been moved out of Rails. So I want to say in 2018, we had taken some of our highest throughput HTTP endpoints and we had actually moved them to a Go HTTP server.

We had wanted to put them in a Rust HTTP server, but at that point there really wasn't a Rust HTTP library that we felt was kind of production grade. There really wasn't anything that had the kind of feature set that we were looking for in an HTTP library.

I think the situation is much different now, but we're talking about five years ago. I think the libraries were Iron and I don't remember what the other ones were, but it was a pretty complex story for HTTP in Rust.

And compared with write something that manages concurrency, the level of engineering that we would have to put into, "Write your own HTTP server library," was enormous and we weren't ready to take that on.

Thomas Betts: No, that makes sense. So you had Go for the HTTP server, and then how did you get to the Rust code? Where does that live in the stack?

Lily Mara: So basically everything that we have today, we have Ruby and Go at the HTTP layer. So touching customer data directly, and then I would say everything underneath that customer data handling layer is a mix of Rust and Go.

In the future, we are developing new services exclusively in Rust. So everything that's kind of under the skin of OneSignal is written in Rust with some older stuff that is written in Go.

Thomas Betts: So is there a Rust service that the Go HTTP just calls and says, "Please handle this request," and then it's very thin at the HTTP endpoint?

I'm trying to go back to you said that you had a lot of requests coming in and the Rust into Kafka is going to be really quick, but now we've got Go on top of it. I'm just trying to understand where's the handoff go from one to the next?

Lily Mara: So we've got to Go HTTP server and basically what it does is it dumps an event into Apache Kafka, right? And then later on we have a Rust Kafka consumer that picks that event off of Kafka. So those two things don't speak to each other directly. They kind of use Kafka as a bridge to go in between them.

Thomas Betts: Gotcha. So the Go is straight into Kafka, and it's the consumer pulling it out of Kafka is what you wrote in Rust?

Lily Mara: That is correct. Yeah.

Thomas Betts: It wasn't the Kafka producer. I was conflating the two in my head, but you had a Go Kafka producer, and now a Rust Kafka consumer. And so again, that Go HTTP server, very thin. All it does is sticks it into Kafka and you're done. And then it's async after that.

The Rust Kafka consumer [11:43] 3

Thomas Betts: And so now it's running, you've got Rust, and what does it do and how does it handle concurrency?

Lily Mara: That system more or less takes the event data off of Kafka, deserializes it into a struct that the rest library can understand, and then it sends Postgres update calls across the wire to Postgres, and the concurrency story there went through a lot of phases.

There's far too much traffic, of course, for us to just process one event at a time per process–that wouldn't really work for us. Kafka has some inbuilt support for concurrency already. It has a data structure called partitions, and basically each partition of a topic inside of Kafka is an independent stream of events.

So you can consume those totally independently from one another. And this is kind of the first layer of concurrency that we have. This particular data stream is divided into, I want to say 720 partitions. So we could, in theory, if we did no additional concurrency support, we could consume 720 events concurrently and be sending Postgres 720 Postgres update calls concurrently.

That's turned out to not be enough concurrency for us, because we had more events than that. So what we did is we developed a strategy for doing concurrency within each partition.

And so we would have a number of worker threads for each partition and we would ask Kafka for a number of events at once and shuffle those events to the various worker threads and process them concurrently.

Maybe this works fine for certain workloads, but we had an issue which was we didn't want to process any updates for the same subscription concurrently or out of order.

So if my iPhone has two property updates coming for it, one setting, eight to 10, one setting eight to 20, if the order that we received those at the HTTP layer was 10 and then 20, we don't want to apply 20 to the database and then 10 milliseconds later apply 10. We want the most recent thing that a customer sent us to be the thing that gets persisted into the database.

Why a customer would send us updates like that is not really our concern. We just want to be accurate to what the customer does send us.

Thomas Betts: Right. Can't control what the customer is going to do with your system.

Lily Mara: You certainly cannot.

Thomas Betts: You at least acknowledge it could be a problem.

Lily Mara: Yes, we acknowledged it could be a problem. And so what were we to do? We can't have unlimited concurrency. So we use a property on each event, which is the subscription ID, more or less the row ID in Postgres.

We hash that and we make a queue in memory that's for each one of the worker threads. So each worker thread, maybe there's four worker threads per partition, each one of those four worker threads has a queue that is dedicated to it.

And this is like a virtual queue because there's a Kafka partition that's like the real queue. We can't really muck around with that too much. But for each worker thread, we have kind of a virtual queue, like a view onto the Kafka partition that only contains events that hash to the same worker thread ID.

And what this means is if there are two updates that are destined for the same row in Postgres, the same subscription, that means they're going to have to be processed by the same worker thread, which means they're going to sit on the same queue, which means they are going to be processed in the order that we receive them at the HTTP layer.

How to handle partitioning [15:18]

Thomas Betts: So you've got a couple layers, you said they've got the partition is the first layer and then you've got these virtual queues. How are you partitioning?

You said you've got 720 partitions, you said that was the max. Are you using the max and then what are you doing to separate that? Is there some sort of logical structure for choosing those partitions to help you down the line so you don't have some data going to one place and some going to another?

Lily Mara: Absolutely. So 720 is not the maximum number of partitions you can use in Kafka. I'm not sure what the maximum is. I would assume it's something obscenely large that no one actually wants to use in production, but yeah, it is definitely not the max.

We picked 720 because it has a lot of divisors. And so when you start up Kafka consumers, Kafka will assign each consumer a number of partitions to consume from.

And basically we wanted a number of partitions that would very easily divide into an even number of partitions per consumer. And also the second half of your question, an even number of partitions should be assigned to each database.

So the partitioning strategy that we use is to divide messages onto particular partitions that align with Postgres databases. So the way our system works is internally we have a number of Postgres databases that are sharded and they're sharded based on the customer's app ID or dataset ID, right?

I think it's the first byte of their app ID determines which database shard they're going to live on. So occasionally when we are having performance issues or incidents or things like that, we will publish a status page update that says, "Customers who have app IDs beginning with D might be having some performance issues right now that is because we're having some problems with one particular Postgres database."

So we use this to determine which Kafka partition things are going to be assigned to. So this means that one customer's app, they are always going to have all of their Kafka events go to the same Kafka partition.

And datasets which are going to the same Postgres server, they're going to be either on the same or slightly different Kafka partition, not necessarily guaranteed to be the same, but it could be the same.

Thomas Betts: So that's part of the customer app ID. So that's in the request that's coming in. So from that HTTP request, "Here's the data I want to update and here's who I am." That's when you start having the key that says, "I know where to send this data to Kafka," and that's going to correspond to the shard in Postgres as well?

Lily Mara: That's correct.

Thomas Betts: Okay. So that's built into the design at the data level of how your data is structured?

Lily Mara: That's correct. At the end of the day, most of the problems that our company faces are Postgres scaling problems. So as we have gotten our stuff to be more and more and more reliable, a lot of that work has come down to making more of our systems aware of the fact that they have to queue things up based on which Postgres server particular kinds of traffic are going to.

Thomas Betts: I think that covers a lot of the partitioning, which is complicated and I think you've explained it pretty well, at least I can understand it now.

Changing metrics for sync vs. async [18:24]

Thomas Betts: When you shifted from synchronous to asynchronous workflows with this data processing, that changes how you measure it. I guess you had performance issues before.

How do you know that this improved performance and how you have to measure the state of the system change because it's now asynchronous and it's not the same... the server's not getting overloaded, the CPU is not spiking, but is it healthy?

Lily Mara: Definitely. Yeah. So this very dramatically changed how we thought about our metrics. It very dramatically increased the scope of monitoring that we had to do on our systems.

When you have a synchronous system that's just serving a bunch of HTTP traffic, you care about how long does each request take to serve, and how many successes, and how many errors do I have.

I don't want to say it's simple, because of course it is not simple, but you could be pretty effective by monitoring those three things and driving response time down and driving errors down.

But as soon as you move to an asynchronous system like this, if those are the only three metrics that you care about, you might not ever be processing any of your updates.

A successful HTTP response doesn't mean very much in a world where actually applying the data that's in that HTTP request might happen 12 hours later. And of course we don't want it to happen 12 hours later, but these are the nature of some of the metrics we have to track now.

The key one that we originally started looking at was message lag, the number of messages that are currently sitting in Kafka that have not yet been processed. So normally when the system is healthy, this is fairly close to zero and it might go as high as a couple of hundred thousand and it can drop down to zero again very, very quickly. That is kind of normal, nominal operation.

When things start to get to be in a bad state, we can get tens or I think in rare cases we may have even seen hundreds of millions of messages sitting in Kafka waiting to be processed. Right?

This is not a good thing, and this is an indication that your system is not healthy because you have a bunch of messages that you expected to be able to process that you somehow are lacking the capacity to process.

There's a number of different reasons why this could be happening. Maybe your Postgres server is down, maybe there's... in one case for us, maybe you have a customer who sent a ton of updates that were destined for a single row in Postgres, and this kind of destroys your concurrency logic because all of the messages are being queued for a single row in Postgres, which means all of your worker threads are sitting there doing nothing.

But the nature of why you have a bunch of messages in lag is kind of broad. We're talking about what are the metrics, that you have the number of messages sitting in Kafka.

And then one that we've started looking at more recently as a helpful metric is we're calling time lag because every message that goes into Kafka has a timestamp that's on it that says this is the time that this message was added to the Kafka queue. So we've recently added a metric that reports the timestamp of the most recently processed message on the consumer side.

So we're able to determine, "Right now it's 2:30 PM. Well, the last message that we processed was actually put onto Kafka at noon." So why are there two and a half hours between when this message was produced and when we're consuming it? That means we are probably pretty far off from keeping up with reality and we need to update things.

This metric can be a little bit confusing for folks though, because in a normal state when things are nominal, that timestamp is always ticking up at a fairly steady state. It's always basically keeping up with reality, right?

It's 2:30 now, and the most recently processed message was also at 2:30. That makes sense. And if you have an elevated production rate, you have more HTTP requests than can be handled by your Kafka consumer coming in for a sustained period of time, you're maybe going to see a slightly lower slope to that line. You're not going to be keeping up with reality and at 2:30 you're still going to be processing messages from 2:00 PM or something like that.

But something kind of confusing can happen if you have a pattern like... Let's say your normal state of message production is like 2000 messages per second are coming across the wire. If you have say a one-minute period where that jumps up to 20,000 messages per second, say you have a big burst of traffic that is abnormal, that happens over a short period of time, and enqueues a bunch of messages into Kafka, your keeping up with real-time metric might actually plateau because say from two o'clock to 2:10, you have somebody sending you 20,000 messages a second.

So over this 10 minute period, you have maybe 12 million messages come across the wire and they're all sitting there in Kafka waiting to be processed. You can't process 20,000 messages a second. Maybe you can process 4,000. You can process slightly above your steady state of incoming messages, but you are going to fall behind reality. And so your time latency is going to kind of plateau. It's going to go, this maybe starts at 2:00 PM and by 2:30 you have maybe processed up to like 2:03.

And this can be kind of confusing for people because if they're just looking at a message latency graph, if they're looking at the thing that shows me number of messages sitting in Kafka, it looks like there was a big growth in the number of messages sitting in Kafka.

But now that line is starting to steep down really dramatically. Now it's starting to return back to normal really dramatically because we are now processing those messages and it's starting to go back to normal.

But your time latency is going to keep getting worse and worse and worse and worse until you manage to catch up with this because your real-time, the wall time, that's continuing to tick forward at one second per second. But this time latency, this thing that's measuring how far the wall time is from the time of the most recently processed Kafka message, that's continuing to fall further and further behind reality.

So that can be a little bit confusing for folks to wrap their head around, but we found it to be a little bit more useful for actually determining how real-time is the system.

Number of messages sitting in Kafka is great as a first baseline health metric, but it's not great for answering the question, how far behind reality are we? Maybe there's 20 million messages in lag, but maybe they've only been there for two minutes. So is that really a problem? I don't know.

Thomas Betts: Yeah, I like the default response is the queue length for a queue should be close to zero. That's standard, how do you know any queue is healthy? And Kafka is just much bigger than typical queues, but it's the same idea of how many messages are waiting.

And the time lag seems like, okay, that was giving you one metric and the first thing you do is, "Okay, there's a problem, we can see there's a problem, but we don't understand why there's a problem." And then the time lag is also the backend.

So for both of those, because you're talking about the producer side and the consumer side building up, but for both of those, you wanted to start digging in and starting to do analysis. What did you get to next? Did you have to add more metrics? Did you have to add more logging dashboards? What was the next step in the troubleshooting process?

Lily Mara: Yeah, definitely. So we have found distributed tracing really valuable for us. At OneSignal we use Honeycomb and we use OpenTelemetry absolutely everywhere in all of our services.

And that has been really invaluable for tracking down both performance issues and bugs and behavior problems, all kind of stuff across the company, across the tech stack. We produce OpenTelemetry data from our Ruby code, from our Go code, from our Rust code. And that has been really, really valuable to us.

OpenTelemetry and Honeycomb [26:29]

Thomas Betts: And so that OpenTelemetry data is in that message that's being put into Kafka. So it comes in from the request and then it always sticks around. So when you pull it out, whether it's a minute later or two and a half hours later, you still have the same trace.

So you go look in Honeycomb and you can see, "This trace ran for two and a half hours," and then you can start looking at all the analysis of what code was running at what time, and maybe you can start seeing, "Well, this little bit of it sat and waited for an hour and then it took a sub-second to process." Right?

Lily Mara: We don't actually tie the produce to the consume. We have found that to be rather problematic. We did try it at one point when we were kind of very optimistic, but we found that that runs into some issues for a number of reasons.

One of those issues is that we use Refinery for sampling all of our trace data. Refinery is a tool that's made by Honeycomb, and more or less it stores data from every span inside of a trace and uses that to determine a kind of relevancy score for every trace and uses that to determine whether or not a trace should be sent to Honeycomb.

And that's because each trace that you send up to the storage server costs you money to store. So we have to make a determination if we want to store everything or not. And of course, we cannot afford to store absolutely everything.

So if we were to match up the produce trace with the consume trace, we would have to persist that data in Refinery for potentially several hours. And that is problematic. That system was designed for stuff that is much more short-lived than that, and it doesn't really work with that super well.

One thing that we have found that works a bit better is span links. So I am not sure if this is part of the OpenTelemetry spec or if it's something that's specific to Honeycomb, but more or less you can link from one span to another span, even if those two things are not a part of the same trace.

So we can link from the consume span to the root of the produced trace. And this is helpful for mapping those two things up, but it's generally not even super valuable to link those two things up.

You might think that it is helpful to have the HTTP trace linked up with the Kafka consumed trace, but in general, we have found that the Kafka stuff has its own set of problems and metrics and concerns that really are kind of orthogonal to the HTTP handling side.

You might think, "Oh, it's useful to know that this thing was sitting in the Kafka queue for two and a half hours," or whatever, but we can just put a field on the consume span that says, "This is the difference between the produced timestamp and the time it was consumed."

So there's not really a whole lot of added value that you would be getting by linking up these two things.

Measure the consumer and producer sides independently [29:20]

Thomas Betts: Well, it's helpful. I mean, I think you've just demonstrated that someone like me that comes in with a naive approach, like, "Oh, clearly this is the structure of the message I want to have from the person put in the request to when I handled it."

And you're actually saying you separate and you look at if there's issues on the consuming side, they're different from issues on the production side, and basically in the middle is Kafka.

And so are we having problems getting into Kafka and is it slow or are we having problems getting stuff out of Kafka into Postgres and that's our problem? And you tackle those as two separate things. It's not the consumer thing.

Lily Mara: Yeah, absolutely. And if we want to do things like analyze our HTTP traffic coming in to determine what shape that has, how many requests are each customer sending us in each time span, we actually can just look at the data that are in Kafka as opposed to looking to a third party monitoring tool for that. Because Kafka is already an event store that basically stores every single HTTP request that's coming across our wire.

So we've developed some tools internally that allow us to inspect that stream and filter it based on... Basically we add new criteria to that every time there's an investigation that requires something new. But we have used that stream very extensively for debugging, or maybe not debugging, but investigating the patterns of HTTP traffic that customers send us.

So yeah, combination of looking at the Kafka stream directly to determine what data are coming in, as well as looking at monitoring tools like logs, like metrics, like Honeycomb for determining why is the consumer behaving the way it is.

They sometimes have the same answer: the consumer is behaving the way it is because we received strange traffic patterns. But it is also often the case that the consumer is behaving the way it is because of a bug in the consumer, or because of a network issue, or because of problems with Postgres that have nothing to do with the shape of traffic that came into the tool.

Current state and future plans [31:13]

Thomas Betts: So what's the current state of the system? Have you solved all of your original needs to make this asynchronous? Is it working smoothly or is it still a work in progress and you're making little tweaks to it?

Lily Mara: I don't know if things could ever really be described as done and we can walk away from them. Our systems are very stable, especially been pleased with the stability of our Rust applications in production.

Something that we have noted in the past as almost an annoyance point with our Rust applications is they're so stable that we will deploy them to production and then the next time we have to make a change to them, we'll realize that they are very behind on library version updates because they have just kind of never needed anything to be done to them. They have really, really great uptime.

So the systems are very reliable, but as our customers traffic patterns change, we do continue to have to change our systems to evolve with those. We are exploring new strategies for making our things concurrent in different ways.

We're exploring as we add features that require us to look at our data in different ways. We are also changing concurrency strategies as well as data partitioning strategies, all this type of stuff.

Thomas Betts: So all the typical, "Make it more parallel. Split it up more. How do we do that, but how do we have to think about all those things?"

A lot of interesting stuff. And if you have lots of changes in the future, hope to hear from you again. What else are you currently working on? What's in your inbox?

Lily Mara: I am currently pitching a course to O'Reilly about teaching experienced developers how to move over to Rust as their primary development language. I think there is a lot of interest in Rust right now across the industry.

You have organizations like Microsoft saying they're going to rewrite core parts of Windows in Rust because of all of the security issues that they have with C and C++ as their core languages.

I believe last year there was a NIST recommendation. It was either NIST or DOD recommendation that internal development be done in memory safe languages or something like that.

And I believe there are also a number of large players in the tech industry who are looking into Rust. And I think there's a lot of people who are nervous because Rust is talked about as this big scary thing that is hard to learn.

It's got ownership in there. What is that? And so the goal of this course is to help those people who already have experience in other languages realize that Rust is relatively straightforward, and you can definitely pick up those concepts and apply them to professional development in Rust.

Thomas Betts: Sounds great. Well, Lily Mara, once again, thank you for joining me today on the InfoQ Podcast.

Lily Mara: Thank you so much, Thomas. It was great to talk to you.

Thomas Betts: And we hope you'll join us again for another episode in the future.

Lily Mara: Absolutely.

Mentioned

About the Author

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

BT