Today on the InfoQ Podcast, Wes Reisz talks with two of the engineers currently working on removing the dependency of ZooKeeper in Kafka. ZooKeeper is used to maintain the metadata store required to operate Kafka. While ZooKeepers removal from Kafka will simplify the operational complexity and improve some of the scalability aspects of the platform, it is a huge undertaking that represents major changes to the overall architecture. Justin Gustafson and Colin McCabe are two engineers working on change. On the podcast, the three discuss why the team made this decision, what the ramifications are, and explore what both the near and future state will be with upgrading and operating Kafka.
Key Takeaways
- Removing the dependency to ZooKeeper from Kafka, improves the scalability and simplifies the architecture of Kafka; however, it is a major change that involves many moving parts.
- Not including the Alpha (which won’t have an upgrade process), the upgrade path for a post-KIP-500 Kafka release will go through a bridge release. A bridge release is an intermediary release required to encapsulate ZooKeeper properly before the actual upgrade. The bridge release enables the new quorum to be a drop-in replacement for ZooKeeper.
- Several structural changes will be made in how metadata is stored during the creation of the post-KIP-500 quorum. These include the use of deltas and a more compact binary RPC format.
- A KIP-500 Alpha release is expected by the end of the year. With innovator/early adopter stage releases next year. Future releases will support operating in KIP-500 and in non-KIP-500 mode. Support for both modes of operation will continue for quite some time.
Subscribe on:
Transcript
00:20 Introduction
00:20 Wesley Reisz: If you've ever worked with Kafka, you know that ZooKeeper's an essential part of the architecture. ZooKeeper, a centralized service for maintaining config in a distributed system, is used to store a Kafka cluster's metadata. In 2019, a KIP, Kafka enhancement proposal, was introduced to remove ZooKeeper and use an internal RAFT-based service to manage metadata. So use Kafka to maintain Kafka metadata. Today on The InfoQ Podcast, we're talking about KIP-500 and breaking down where it stands today and what work is being done by two of the engineers leading the effort.
00:52 Wesley Reisz: Hello, and welcome to the InfoQ podcast. I'm Wes Reisz, chair of QCon San Francisco and co-host of the podcast. Today on the podcast, we're talking to Colin McCabe and Jason Gustafson. Colin and Jason are software engineers at Confluent and committers to Kafka. Both are directly involved with leading KIP-500. Colin, Jason, welcome to the podcast.
01:11 Jason Gustafson: Thanks for having me, Wes. 01:12 Colin McCabe: Thanks Wes, it's great to be here.
01:13 Given so many large scale deployments of Kafka today, why is there a need to remove ZooKeeper from Kafka?
01:13 Wesley Reisz: I want to jump right into the problem. Just taking a few examples, one company in particular, I was just looking at their engineering blog from maybe a year ago. As an example, they have a hundred clusters, 4000 brokers, a hundred thousand topics, and over seven trillion messages a day, most companies aren't even remotely close to that kind of scale. Given that kind of scale, why the need to remove ZooKeeper from Kafka?
01:39 Colin McCabe: I think there's a few different things that this work addresses. One of them is scalability, and we can talk a little bit more about that later, but of course there's other dimensions to this too, right? One of them is we'd like to simplify the architecture to make it easier to understand, easier to maintain, easier to extend. The other is we'd like to simplify deployment. Right now users have to deploy two distributed systems in order to use Kafka, they have to deploy ZooKeeper and then they have to deploy the brokers themselves. So I think the deployment advantage is maybe the easiest one to understand right out of the box, but the other advantages are actually very significant too. Going a little bit more into the simplicity aspect. The problem we have now is that our interaction with ZooKeeper is relatively complex.
02:24 Colin McCabe: In a sense, we're maintaining the same state in multiple places. Kafka has this node called the controller node and the controller node is responsible for managing Kafka replication as well as many other important duties in the cluster, but because the store of record for Kafka is currently ZooKeeper that information needs to be maintained not only on the controller, but also in ZooKeeper. Indeed, if the controller has a piece of information and ZooKeeper doesn't, we can't really consider that to be durable yet. Sorry, this may seem a little abstract, but it can become very concrete, right? If the state on the controller diverges from the state on ZooKeeper, then it can become very confusing to analyze those interactions. By combining those two systems into one, basically metadata system, which is inside Kafka itself, we can significantly simplify the architecture.
03:12 Wesley Reisz: So simplifying the architecture, reduce operational complexity and you also mentioned a little bit about scalability?
03:19 Colin McCabe: Yeah. So part of the issue we have is that we sometimes have to load a lot of data from ZooKeeper. One example of this is when the controller fails over, and by fail over I mean the old controller goes away and a new controller is created. Or, equivalently, when the cluster is starting up. In both of those scenarios, the new controller needs to load all of that data from ZooKeeper. The length of time that takes is actually proportional to the amount of metadata that we have. That isn't really scalable as we start doubling, quadrupling the amount of metadata we have. Doubling and quadrupling those load times really starts to hurt.
03:54 Jason Gustafson: Yeah. Just to be clear, we think ZooKeeper is actually quite a great system. I think it gets a really bad rep in Kafka because Kafka is using it in a lot of ways which are not ideal. We have to kind of be tricky about how we manage watches because we didn't have proper RPCs in place to talk between brokers. Then we had to be really careful about how we do watches. I think it gets a bad reputation in Kafka, specifically it's because of some of the poor usage patterns, but nevertheless, we still think there's a big opportunity that gets, by giving Kafka access to the log itself and the ordering of events and all of that. We think there's a big opportunity there. I think that it's also the case that today in some cases, because of some of these poor usage patterns, ZooKeeper, people will kind of blame ZooKeeper and say it's responsible for this problem in Kafka, but in fact, it wasn't really. Really the problems that we've had with ZooKeeper over the years have been quite minimal. I think it's really been quite a resilient system for us.
04:51 How does ZooKeeper use metadata?
04:51 Wesley Reisz: Nine years of operating with it definitely couldn't be a bad thing. I guess we jumped right into the problem. Let's back up a second. When we talk about metadata, just kind of level set, when we talk about metadata and how Kafka operates, let's talk about ZooKeeper and how metadata is used by ZooKeeper.
05:08 Jason Gustafson: Yeah. So as Colin was saying, ZooKeeper is the source of truth for all of the metadata. We have in the Kafka architecture, really two levels of synchronization that need to happen to get the state into a consistent state. First, there's the controller that needs to load up the current state from ZooKeeper. Then it needs to propagate that state out to the cluster. So what kind of state are we talking about? We're talking about topic configurations, we're talking about security ACL. We're talking about partition metadata, such as for each topic partition, who's the leader. So we have basically these two levels where things can go wrong. Furthermore, we don't have any kind of guaranteed ordering on the way that the updates are seen by individual pieces of the cluster.
05:48 Jason Gustafson: So different brokers may see things in different orders, and they may actually reach different states at different times. Then part of the problem is just the controller is taking the state from ZooKeeper and sending it out to all the brokers. If there's any kind of bug there, then you can get yourself into a permanently inconsistent state. What we want to do here is we want to bring that state into a log structure, which solves the ordering problem. Everyone sees the same events and in the same order. It also gives us an easier way to tell which brokers have caught up to the log, which ones have not, which ones have an alt state that is necessary. It gives us all of these things. Basically, it's a way of bringing the cluster into an easier to understand consistent state.
06:31 What will Kafka look like in a post KIP-500 world?
06:31 Wesley Reisz: So, fast forward just a bit. Now. In a post-KIP-500 world, are we talking about just one install everything's together? What's it going to look like in a post KIP-500?
06:42 Colin McCabe: I think the clearest part is that there won't be ZooKeeper. So we know that. To be a little more concrete about what there will be. There's going to be a metadata quorum, which is inside Kafka itself, and that's going to be managed using the RAFT replication algorithm. That metadata quorum will perform many of the same functions that ZooKeeper performs today, such as storing metadata and storing it in a distributed way, making it durable and so on. In contrast to the way ZooKeeper works, the leader of that quorum will actually have a special role in the system. The leader of the quorum will be the controller, and because of this, we know that the controller always has access to the most up-to-date information in the metadata quorum, because the controller is leader of the metadata quorum. By definition, it always has that up-to-date information. Now, aside from that, there's a few other change to the system as well. So currently, as Jason just said, the controller will push out these updates to the brokers.
07:38 Colin McCabe: When a partition has certain changes, the controller is responsible for pushing out those updates. In the new system, though, the brokers will pull those updates from the quorum. We're sort of reversing the direction of the control flow, and the reverse direction has several advantages. One of them is that when you're pushing updates, you never quite know when to stop your retries, right? When is a node totally gone? It's kind of a tricky question. On the other hand, when a node is pulling information, as long as the node is around, it knows that it needs to pull this, it needs to get this new information. Moving to a pull model actually simplifies how we do metadata propagation. More than that, though, because we know the metadata will be propagated in an ordered fashion, we can treat metadata as a series of events. We know that that's a very powerful paradigm to use. Ironically, many of the users of Kafka know this, and they build their systems in this way, to be event driven. We're sort of applying the principles of Kafka to Kafka's own metadata. In a way.
08:37 RAFT was selected for the medata quorum, why RAFT?
08:37 Wesley Reisz: You mentioned RAFT, was RAFT the only option that you considered when you were planning this? Or were there other considerations?
08:44 Jason Gustafson: That's a good question. Yeah. I think RAFT was the main one. Let me just take a little bit of a step back and describe, in consensus systems in general, they work on a replicated log. ZooKeeper itself is built on this with ZAB protocols, which is very similar to RAFT. Internally, what Kafka has is also a replicated log that the approach that we used is in fact, very similar to RAFT. So for a given partition, there are a set of replicas. For each partition, we elect a leader and the leader basically will replicate to the followers, which is the same thing in RAFT. The only real difference is that in Kafka the leader is elected by the metadata quorum, by the controller.
09:24 Jason Gustafson: Whereas, in RAFT, you have a protocol for election. It was a natural transition here to take the existing architecture, log architecture that we've already built in Kafka and then to add an election protocol on top of it so that we could elect the leaders. Other than that we're able to reuse basically all of the log layer, which is a significant piece here. Additionally, it opens the door toward using RAFT for individual partitions. We can talk about that a little bit more later, but for all of these reasons, basically because we were already 80% of RAFT, it kind of made sense to go ahead and do leader election to get that full 100% of the RAFT vertical.
10:02 Wesley Reisz: Today you might install a cluster across multiple availability zones, for example, but not across regions. Will it change that footprint on how clusters are installed?
10:12 Jason Gustafson: Yeah, I would say initially, probably not. Initially there still is a centralized quorum, which is responsible for all of the metadata updates. Now, one thing to keep in mind about Kafka is that that centralized forum, this is only involved when there are leader changes. When we're talking about the replication protocol itself, it's not going through that metadata quorum. It's possible by tuning replication and whatnot to support cross region replication. Then, it's only when you're doing rolling restart, or there's failures, that we need to involve the controller.
10:43 Wesley Reisz: Absolutely. That's a good point.
10:44 Jason Gustafson: Additionally, though, I think this work does open the door. Once you kind of bring the consensus protocol in-house into Kafka, then you have some opportunities here. For example, I mentioned that there's partition replication with RAFT, and then in that case, we wouldn't even need the controller quorum in order to elect leaders, we can do rolling restarts basically, without relying on that metadata quorum. Additionally, I think it makes sense to consider sharding of that metadata quorum so that you can locate quorums based on regional quorums, if you will, so that the changes that you make locally are done on local quorums. All of this is more in the line of future work, but it all builds on taking this consensus into Kafka and really owning that piece.
11:24 What are some of the other important KIPs related to removing ZooKeeper from Kafka?
11:48 Wesley Reisz: I want to come back in just a minute to the future state, the roadmap for Kafka and how some of the things involved with KIP-500 are affecting it. Before we do that, there's some other KIPs involved with removing ZooKeeper from Kafka. Let's talk about some of those. KIP-500 is the main thing, but there's a bunch of supporting KIPs that are in play. Are there any others that, in particular here that you'd like to talk a bit about?
12:45 Jason Gustafson: There are a few that we can talk about. There's one thing I did want to mention that, although the main part of this work, the ZooKeeper removal, we proposed this just a year ago. However, this has actually been a longer-term movement for the Kafka in general. If we rewind all the way back to like 0.8 Kafka, we have clients that had to contact ZooKeeper, all the tooling had to contact ZooKeeper. Basically, ZooKeeper was really right in the way for basically everything that you can do with it. We've been on a transition here for several years, where first we built a new producer. It never talks to ZooKeeper. We built a new consumer, it never talks to ZooKeeper. We had to build a lot of the functionality that we were using ZooKeeper for. Like having a group coordination protocol. Additionally, we didn't have an admin client up until a few years ago as well. So most of the tooling that we had were involving direct rights to ZooKeeper.
13:28 Jason Gustafson: ZooKeeper doesn't make a great API mechanism for this, so it was basically right to ZooKeeper and we have a controller listening for notification and such. How do you actually give an error back in that case? These kinds of problems. So what we did is we created an admin client that lets us communicate directly with the brokers. It's also solved a lot of the security problems, because one of the operational issues with running two systems is that they each have their own security requirements and each their own protocols. Integrating with all that is more work, but we take the clients and we say, okay, well, the clients are only going to talk to the brokers and then we start on this road. Basically, we wouldn't have even been able to propose the ZooKeeper removal if we didn't already have the bulk of the work in the client's already done.
14:22 Jason Gustafson: However, getting the big part of it and then getting that last little 20%, there's actually a significant amount of effort that goes into that. The one or two, maybe two or three straggling tools that we have that we're still relying on ZooKeeper in order to, I think we were using ZooKeeper for quotas. That was one of them. We were using it in some obscure cases to update Sassel passwords, stuff like this. Getting that last delta was actually a pretty significant amount of work. I don't know all those KIPs off the top of my head, but there's probably four or five of those KIPs out there. To change the tooling, basically, to leverage admin APIs and to support redirection, because one of the problems is the tool's assumed basically, that you can always talk to the controller, whereas in the new architecture, that's not necessarily the case. We have to be able to support redirection so that the request can always find the right home.
14:22 Where are you on the implementation timeline of KIP-500?
14:22 Wesley Reisz: Where are you on the implementation? You have the main KIP, you've got all the tooling, things that you talked a bit about. There's other ones, like the snapshots with 630, there's a few that are out there. So where are you on the delivery for removing ZooKeeper from Kafka?
14:37 Colin McCabe: There's a few different phases of this. Obviously there's the tools phase where we want to isolate the tools from ZooKeeper. We're nearly done with that one.
14:46 Wesley Reisz: Curious, what's nearly done mean?
14:47 Colin McCabe: Nearly done means that we have APIs for everything, but there are a few cases where we're going to need to have a slightly different approach. The main case where we need a slightly different approach is when we want to configure something and Kafka hasn't even been started. There are one or two cases where we were doing that. We're going to need a slightly different way of doing that in a world where there isn't ZooKeeper.
15:08 Jason Gustafson: Yeah. So just to add a little bit to that, I think then the nice thing about having ZooKeeper is that you can cheat on some of the bootstrapping problems, because you can preload everything that you need. With KIP-500, you lose that. So you've got this chicken and egg problem popping up all over the place that we have to think the way through.
15:25 Colin McCabe: Yeah. I mean, I think that we will have some mechanism for pre-loading certain things. We're basically going to have a way of getting off the ground. For the cluster. I actually think the way of getting off the ground will be probably cleaner without ZooKeeper, because it'll be a lot more like a traditional make FS than what we have now, which is sort of a little bit ad hoc, but that is work that needs to be done. Coming to the larger question of where we are. We are about to merge the RAFT work. So that KIP was accepted recently, KIP-595. That's about to be merged into trunk, hopefully in a month or less. In parallel with that, we've been working on the changes to the controller and the changes to the broker that we need to do to make KIP-500 work.
16:08 Colin McCabe: We're not currently in a state where you can run this, but we hope that we will be basically in a few months. Our goal is to have an alpha by the end of the year that anyone can run. Just basically pick up Kafka and run it in KIP-500 mode. We expect that when you're in the alpha mode there's going to be some rough patches, some things that don't work perfectly, but basically it'll be running. From there, of course, there's a lot of other stuff that we do want to do. We want to be able to support upgrades. We want to be able to support all kinds of compatibility stuff. That's all very important, but this year is basically all about getting the first version out. I think we're making significant progress there.
16:44 Jason Gustafson: I think there needs to be realistic expectations, because we're involving basically a total rewrite of the whole metadata layer. So even though we're going to try our best with testing and whatnot, you're never going to get that right the first time. It's going to take some iterations here. That's why we released this out as an alpha and people can begin using it and hopefully the community can get us some good feedback on that before we actually are able to iron it out.
17:07 What will the KIP-500 Alpha release look like?
17:07 Wesley Reisz: What does it look like to have an alpha release for Kafka? Is it an experimental mode where you can enable something? What's that experience look like?
17:15 Colin McCabe: Yeah. So it's going to be exactly what you said, an experimental mode that you can enable and you can run your cluster in this mode. Now, initially we're not going to support upgrading from non-KIP-500 clusters to KIP-500 clusters. We will, of course support that later on. That's very important. Initially, when you run in alpha, it's basically going to be a sort of isolated thing.
17:34 Wesley Reisz: Is there an implication of running an alpha cluster in KIP-500 mode and other clusters not that have replication going on?
17:42 Colin McCabe: I don't think so. We're not planning on changing the APIs that would be involved here. It sort of depends on what tools you're using to do replication, but the ones that I'm familiar with, it wouldn't be an issue.
17:52 Jason Gustafson: Yeah, I think that's right. Yeah. So I think, I'm not sure if you're referring to tools like MirrorMaker, that lets you replicate across clusters and whatnot. I think one thing is that the KIP-500 broker, at a metadata level, of course it's a total rewrite, but at the client level, it's still using all of the same APIs. It's compatible with all of the clients going backwards, everything except the ZooKeeper-based ones.
18:14 Colin McCabe: Yeah, I mean, if we talk about MirrorMaker, MirrorMaker to Replicator, those will all be fine.
18:18 Can you discuss snapshotting and log compaction?
18:18 Wesley Reisz: As part of this work with KIP-500, you're doing some work around snapshotting, some things with log compaction. Can you talk a bit about what the differences are, how they work together and what the new things will be with Kafka?
18:31 Jason Gustafson: As a part of KIP-500, we're building, for the RAFT log itself, the metadata log, you always need the, because it's a replicated log, it's an append-only log and you always need some way to deal with garbage over time. How do you get rid of it to keep it from growing forever? So the approach that we've taken is a little bit more conventional from the perspective of the RAFT literature that we use a snapshot. Whereas Kafka typically has something all compacted topics. The distinction between those two really is about, you can kind of think about compaction as an in-place snapshot. So basically we rewrite the full log contents and then we do an atomic swap.
19:09 Jason Gustafson: With the snapshot, it's actually a separate file, which contains all the state of the system at some point. The reason why we wanted to do that is because, for the metadata quorum for the controller, all of the state is already in memory. So compaction involves us doing a pass through all of the log in order to kind of build up that state and then rewrite it. Whereas, snapshots, we can do that much more quickly because we already have the state in memory and already in a structure that's amenable to snapshots. This is just one example, but there's a lot of cases in KIP-500 where we are able to build some of these building blocks, which actually have helped us solve problems with other parts of the system. As an example of that, we have all of the consumer group metadata, consumer group offsets, and all of this stuff is managed by something similar to the metadata quorum.
19:54 Jason Gustafson: It's a replicated log approach, but basically it's as a single topic, an internal topic where we maintain all of the state. Then the leader of each individual partitioning becomes a coordinator for the group. The reason why this is relevant is that whenever there's a fail over, just as with the controller, the new coordinator has to reload all of its state. Currently that involves actually scanning through the log. With snapshots, we're able to build it up directly from a snapshot, which can be done much more frequently because we also have all that state in memory. It gives us this building block, which lets us solve other problems in the system.
20:30 Wesley Reisz: Okay. If I hear you right, then compaction is really about cleaning up logs, getting rid of any garbage that might be in it, and the snapshots is really about quickly, frequently being able to take state so that we can restore it quickly using a memory data structure. All right, let's go back to the upgrade process. You talked about a little bit before initially when you get out with your alpha, there's not really going to be an upgrade process, but you are planning an upgrade process, an upgrade strategy. Can you talk a little bit about what it will mean? What it'll look like to go from pre-KIP-500 to a KIP-500 release and the future state of Kafka.
21:08 Why do you need bridge releases for upgrading to KIP-500?
21:08 Colin McCabe: In order to upgrade to a KIP-500 cluster, you're going to want to upgrade to one of the bridge releases that we'll have, and the bridge release is a release where you can do this upgrade and you can do that without taking any downtime, which is actually very important to us.
21:24 Wesley Reisz: Okay. Can you dive in just a little bit more? What does it mean to be a bridge release here?
21:29 Colin McCabe: The bridge release is a release that can be upgraded to the KIP-500 mode. That implies several things. One thing that it implies is that ZooKeeper is pretty well-isolated and there aren't tools that are going to talk directly to ZooKeeper or anything like that. Another thing is that the brokers have to be somewhat isolated from ZooKeeper as well. These are things that we're doing at the level of the code to make this possible. The bridge release may actually be a set of releases. It's pretty likely that we'll maintain compatibility for a while to allow people time to transition. What that looks like from the user's point of view is that if you're running a release that's prior to the bridge release and you want to be running a KIP-500 release, you have to go through a bridge release.
22:12 Colin McCabe: Your upgrade path leads through one of the bridge releases. I think this will become a bigger issue once we stop having ZooKeeper compatibility, which will be awhile from now. Once we do that, then basically the bridge release will become a lot more relevant, because you'll have this stepping stone that you'll have to do to upgrade from a really old release to a really new release. This is something that Kafka hasn't done yet in the past. We've always supported upgrading directly from the very oldest release to the very newest release. This is a slight change to that, where basically, in order to go from not KIP-500 to KIP-500, you'll need to go through a bridge release.
22:45 Jason Gustafson: We're very realistic about the way that people are not going to upgrade today, even in Confluence, at our own cloud service, we're not going to, as soon as this thing is available, push it out to all the clouds. It's going to take time for people to get used to it, to use the operations and get used to all the new, additional metrics that come with it. To make all that happen it can take several years. That's why Colin mentioned that once we get to the bridge release, we're going to continue support for the ZooKeeper version of Kafka for some time. Give people a longer time to update.
23:14 Colin McCabe: Yeah, absolutely. It's definitely important to give people a path. They're not going to want to upgrade on day one. Although, we anticipate there'll be some early adopters who really want to try it out and a lot of those people are us, but there's actually a lot of people who aren't us, who want to run the really latest, the coolest stuff. I think that's going to be a factor too. So over time, just like any software, this will transition from being bleeding edge to being a lot more tested and solid.
23:38 Wesley Reisz: So really what I'm hearing is just when you're going to be upgrading to KIP-500, there's an intermediary release that you'll need to install first that has to do some operational changes to effect that older cluster state, that way you're ready for the KIP-500.
23:56 Jason Gustafson: It might be useful to go through an example. Let's say that 4.0 is the bridge release and you've got 0.11. Now, the first thing that you have to do when you want to get to KIP-500 is you've got to upgrade to 4.0, and then you're able to take one of the later releases. Basically, in order to make the big metadata changes, we kind of need to be able to isolate, because it changes the picture when everything out there is still talking to ZooKeeper. If we have to upgrade, do an upgrade from this state where everyone is talking to ZooKeeper, into this new state where nobody is talking to ZooKeeper. The bridge release basically allows us to narrow the number of edge cases and whatnot that we have to reason about.
24:32 Colin McCabe: Yeah, that's well said. Basically, we need to encapsulate ZooKeeper properly so that we can actually do this upgrade and have the new system, the new quorum be a drop-in replacement for ZooKeeper. That's the essence of what the bridge release is all about. Now, I should note that you can run the bridge release in KIP-500 mode, or you can run it in legacy mode. So you actually have the option of not using KIP-500 here.
24:54 Wesley Reisz: What does that mean to run it in legacy mode or KIP-500 mode? It's just same APIs, just using different meta stores?
25:03 Colin McCabe: Essentially. Yes. I mean, from the perspective of a client, it'll look the same, right? From the perspective of what's going on under the hood, it's going to be very different because in the KIP-500 case, it will be using the new metadata quorum for its metadata. Whereas, in the legacy case, it'll be continuing to use ZooKeeper. Similar to how we do now. Of course, in the bridge release, we are going to try to encapsulate ZooKeeper somewhat. There'll be some changes, and that gets to the heart of all the KIPs that we've been talking about that are moving us along this path.
25:31 Wesley Reisz: So then once we're on the bridge release, we're able to do rolling upgrades and move to the next version of Kafka, just like we did in the past, is that correct?
25:39 Colin McCabe: In order to do a rolling upgrade, you're going to want to bring up the new controller quorum first, and then you'll be rolling your nodes. Similar to how we do today. If you're familiar with Kafka, then you'll already know about the concept of a rolling upgrade, where basically you take down a node, you install new software and then you bring up the node again. This is a very important concept in Kafka, because when you're running a big production deployment, you don't want to take an hour of downtime, or more, in order to upgrade the software. That's definitely something our users don't want. Being able to support this sort of hitless upgrade to KIP-500 mode is very important to us. I should say though, that the initial alpha version of KIP-500 will not include this upgrade facility because the initial alpha version is sort of just intended to try it out and to find out where the gaps are, give us a chance to kick the tires. When it is production ready though, in a later release though, we'll certainly support this hitless upgrade for KIP-500.
26:34 Are there any structural data changes with KIP-500?
26:34 Wesley Reisz: Got it. Makes sense. Let's talk about some of the internals. What about structural representation data? Are there any changes between pre-KIP-500 and KIP-500 that we need to be aware of?
26:45 Colin McCabe: Yeah, definitely. I'd break it down into two things. One of them is that we're now using deltas rather than just updates. This is very significant. So just to give an example, if you're removing a broker, to express that in the form of a delta, you would say remove broker X. If you wanted to express that in the form of a key value paradigm, you would have to literally remove references to that broker from every key value pair that had it. In the case of ZooKeeper, this actually means you're sending a lot of updates over the wire. These key value updates all get translated into ZooKeeper ops. We're sort of replacing a large number of ops with a single delta in this case.
27:24 Colin McCabe: Aside from the delta's paradigm, the other big change we're making is that we're going to use a more extensible and compact format. We're going to use the same format that we use for Kafka RPCs, for metadata. This format has some advantages. One of them is that, being binary, it just tends to be a little more compact than the JSON that we were using before. Another is that we've actually thought a lot about how we extend this format over time. We have things like versioning. We have things like optional fields, and those things don't always work perfectly when you're trying to do it in JSON. It's a lot more ad hoc. In fact, not all of metadata even is JSON today, there are still some fields which are sort of free text. Systematizing that I think will be a big advantage of the new scheme.
28:08 Wesley Reisz: The format that you're using, what's the format, is it a Kafka format?
28:14 Colin McCabe: It's a Kafka format. It's basically a binary format that Kafka uses for its RPC. I call it Kafka RPC.
28:20 What’s the long term road map look like after KIP-500?
28:20 Wesley Reisz: All right, so end of the year, you're looking for an alpha release getting KIP-500 out. What's the rest of the roadmap look like? What's long term?
28:28 Jason Gustafson: I think we mentioned a couple of possibilities before. Having a consensus layer, having a RAFT replication in layer inside Kafka basically opens up some doors, which we didn't have before. Just to mention one of them is the replication protocol in Kafka today, one of the problems it suffers with today is latency. If you've got a replication factor of three, then typically in order to accept a right, you got to hit all three replicas in a steady state. With RAFT, of course, it's majority based. So you can do two out of three. Latency gets better right away. Additionally, your resilience improves. You don't have this controller effective blast radius of the whole cluster where you lose a controller and you're dead in the water. Instead, you're able to, at a partition level, elect leaders and basically continue along with the replication. That's probably unlikely to happen soon, but it is something that we want to do. Near term, the reliance on this metadata quorum can be seen as a weakness in Kafka.
29:23 Jason Gustafson: So when it is unavailable, then there's nowhere that you can send a create topic request, for example. We still have this problem. I think that having RAFT in Kafka gives us a path here where, instead of just having a single metadata quorum, for example, we can have multiple metadata. It can be sharded. In that case, losing one of them doesn't necessarily mean you lose all of them. We're able to improve availability that way. All of the work here is speculative, but they're kind of reasonable next steps based on this building block. There's a bunch of other stuff. I mentioned the work that is tied into KIP-500 and some of the core improvements there. We're pretty interested in scaling the number of partitions, for example. This work right here does address part of that problem at the cluster level, but there were also some improvements at the broker level. An individual broker can hold more partitions. I think those things are going to be the next focus. I would say probably that is going to be a bigger focus once we get all this KIP-500 in, and then actually the controller's sharding and such.
30:22 Wesley Reisz: What would you like to see coming up, Colin?
30:24 Colin McCabe: So the big things that I'm most interested in are what's the mapping between partitions and logs, I think that's very important. Another thing is, can we make partitioning more convenient and more extensible? Those are the big things that I think are going to be really interesting to us in the future. We have ideas. We've been throwing ideas against the wall. People have discussed them in the community. We don't have a concrete proposal for these things yet.
30:47 Wesley Reisz: Jason, Colin, thanks for joining us on the InfoQ podcast. If you want to learn more about Confluent, what's going on or track what's happening in the Kafka community, be sure to check out the Confluent blog.
31:00 Jason Gustafson: Yeah. And if you're interested in any of the details, all of the proposals here are posted online on the Apache Wiki. In addition, we'll have a blog in the next month or so which covers the RAFT implementation in more detail.
31:12 Wesley Reisz: Colin, Jason, thanks for joining us.
31:14 Colin McCabe: Thank you.
31:14 Jason Gustafson: Thank you, Wes.