Martin Thompson discusses consensus in distributed systems, and how Aeron uses Raft for clustering in the upcoming release. Martin is a Java Champion with over 2 decades of experience building complex and high-performance computing systems. He is most recently known for his work on Aeron and Simple Binary Encoding (SBE). Previously at LMAX he was the co-founder and CTO when he created the Disruptor.
Key Takeaways
- Aeron is a messaging system designed for modern multi-core hardware. It is highly performant with a first class design goal of making it easy to monitor and understand at runtime. The product is able to simultaneously achieve the lowest latency and highest throughput of any messaging system available today.
- Aeron uses a binary format on the wire rather than a text based protocol. This is largely done for performance reasons. Text is commonly used in messaging to make debugging simpler but the debugging problem can be solved using tools like Wireshark and the dissectors that come with it.
- In a forthcoming release of Aeron will support clustering. Raft was chosen over PAXOS for this since it is more strict. This means that there are fewer potential states the system can be in making it easier to reason about.
- RAFT is an RPC-based protocol, expecting synchronous interactions. Aeron is asynchronous by its nature, but the underlying Aeron protocol was designed to support consensus, meaning that a lot of things which would typically need to be done synchronously can be done asynchronously and/or in parallel.
- Static clusters will be added first to Aeron, with dynamic clustering after that, and then cryptography again with the intention of keeping the latency and throughput high.
Subscribe on:
What is Aeron, and why you created another messaging system?
- 1:30 Having worked on many mother messaging systems, getting involved with another one seems like a crazy idea.
- 1:35 It was born out of a client’s needs - we were discussing what the ideal messaging system would look like, given the world is quite different now.
- 1:50 We have much higher bandwidth and lower latency networking, multi-core systems - we have a very different layout from what we used to have before.
- 2:05 What can we learn from the modern hardware, and what can we do better to take advantage of that?
- 2:15 In the end, we decided we should do it very differently, and the client convinced me to do it.
- 2:30 The condition I put forward was that it must be open-source, with the main goal of me not having to be involved with another messaging system in the future.
What was the use case the team needed to solve?
- 2:40 Two things: this is a company working at the most extreme end of performance - so performance was a key requirement.
- 2:50 They were close to having what they needed with the commercial tools - but the second thing they needed was transparency.
- 2:55 The understanding of what is going on in a distributed system is very difficult thing (or even a concurrent system).
- 3:00 So our design goal was to design something that was easy to monitor and understand at runtime as a key design goal.
Could you elaborate on the description of Aeron on the Git repository?
- 3:30 We can deliver messages (reliably) over various mediums.
- 3:40 For example, with shared memory, data doesn’t get lost or take different routes.
- 3:45 Once you go over real networks, packets can go out of order from when they were sent, they can be lost or duplicated.
- 3:55 Reliable means that the stream of messages from a source to a given destination will be put back in the order they were sent eventually.
- 4:10 It’s similar to how TCP works; it sequences everything, works out if there is any loss, and retransmitted if necessary.
- 4:20 We went with different design goals; TCP is a wonderfully designed protocol for wide area networks that may be quite lossy, and is designed for that.
- 4:30 In our data centres, we tend to get very little loss and greater bandwidth with lower latency.
- 4:40 We started designing Aeron with those assumptions - but the base transport has similar characteristics to TCP, and reliability slightly better - but I like comparing to TCP as a benchmark.
So Aeron is a lower-level messaging platform to Kafka?
- 5:25 They are actually quite different - we’ll put the tagged messaging on something.
- 5:30 In many ways, Kafka could run on top of Aeron, using it as a transport instead of TCP.
- 5:45 Kafka doesn’t deal with the reliability of the messages on the network - it deals with the framing of the messages.
- 5:50 Aeron is like TCP+ - it has the ability to frame messages.
- 6:00 TCP is a byte-oriented data stream from A to B, but it doesn’t know where the begin and end of messages are.
- 6:10 What we’re finding is that over time, we’re adding more features to Aeron, so that the line between Aeron and Kafka is becoming more blurred.
- 6:25 Kafka is a high-level product with enterprise features; Aeron started as a low-level transmission medium for transport, and it has evolved up adding features becoming rich and similar to other enterprise messaging systems.
It has a distributed log as well?-
- 6:45 It’s got lots of things - but no dependencies.
- 6:50 If you take other systems, like Kafka, it has a lot of dependencies - TCP, Zookeeper and so on.
- 7:00 With Aeron, you have no dependencies other than a language and a network stack.
- 7:05 If going remote, we only need UDP transport to run on.
What kind of metrics can Aeron deliver?
- 7:30 In the IPC space, we have some of the best latency and throughput of anything out there, depending on whether you have contended publication or not.
- 7:45 It’s possible to get in the high tens of millions of small messages per second, running over Aeron over various transports.
- 7:55 Small messages involve testing the entire mechanism of the entire stack - you have to put envelopes, headers and uniquely identify to be able to hand them off.
- 8:15 It’s easy to publish good numbers for bandwidth by publishing small numbers of large messages, but that’s just a copying exercise.
- 8:20 When it comes to byte copying, we can totally saturate pretty much any transport we run over.
- 8:30 If you’re running over 10Gb ethernet, we can just run that at almost full capacity.
- 8:35 Most commercial products can start to push that when publishing large messages - to make it interesting, publish large numbers of small (say, 32 byte) messages.
- 8:45 They don’t get close to maximum bandwidth, whereas we can do that on small or large messages.
- 8:55 A lot of this is due to the choice of algorithms up and down our stack which works nicely whether it’s small or large messages.
- 9:00 Also, the algorithms work for both throughput and latency perspectives - with the high-end commercial products, you have to decide if you are targeting latency or bandwidth - you can’t have both.
- 9:10 We’ve shown with Aeron that is a fallacy - we don’t have a separate throughput or latency mode, we just have the standard mode of operation.
- 9:20 The result is that we get the lowest latencies of any product out there, and the highest throughput simultaneously.
Have you ported Aeron to work on Java 9 or 10 yet?-
- 9:30 We have got it working on Java 9 and 10, and it works on Java 8.
- 9:35 we didn’t target Java 7 because it relies on some intrinsics on x86 like LOCK XADD, which allows us to increment a number and get the value before it was incremented as a single instruction.
- 10:00 That scales very nicely on x86 hardware - we use that with some of the core of our algorithms to append to the log.
- 10:10 As you said earlier, we have a distributed log, and one machine is used to build the log, and it is replicated to one or more target machine.
- 10:20 If it a single machine, we use unicast - if it is multiple machines we use multicast.
- 10:30 The ability to append to that log from multiple threads requires instructions like LOCK XADD to perform well.
- 10:35 Java 8 introduced the ability to call that - the instruction has been around on x86 for some time, but Java 8 provided access to it from the language.
- 10:45 We need to get it in a memory-mapped file; we’re moving these files over the network and avoiding the Java heap.
- 11:00 We don’t have the ability through the normal language semantics and perform the changes that we need to change through these files.
- 11:10 Things like Unsafe is used - but as we’ve gone through Java 9 and Java 10, we’re got less dependency on Unsafe and more language features to be able to deal with this.
- 11:25 We’re still missing some features that the language doesn’t have - particularly with memory mapped files and buffers - so if we need to integrate with network buffers or other languages, we have to do things slightly out of the normal trodden path.
Can you talk about Aeron’s protocols?
- 11:55 On the wire, we have a protocol for communicating between the various endpoints for the exchange of messages.
- 12:05 Those messages are encoded in a binary format.
- 12:10 Sometimes people confuse encoding and protocol; the protocol is the sequence of messages that can flow between different nodes.
- 12:20 Within those messages, data is encoded in a particular encoding.
- 12:35 We could encode them in text - XML, JSON - and in fact a lot of messaging systems that have been around for some time have used those encodings.
- 12:45 Text is a very inefficient protocol for encoding.
- 12:55 Why do people use text based encodings for storing files on disk or transmitting on the wire?
- 13:05 The reason is all about being able to be easy to debug and understand.
- 13:20 Humans do not read ASCII or Unicode very well - but we have good tools to read ASCII and Unicode very well.
- 13:30 ASCII and Unicode themselves are both encodings; just a specific form.
- 13:40 If you have tools to understand what you’ve got - like Wireshark and its dissectors for network messages - now it becomes easy to debug and understand.
- 14:05 The ability to edit text files is a very common skill.
- 14:10 I don’t see it as a problem with the encoding, as much as a problem with the skillset and the tools available for developers to use.
- 14:20 Dealing with text-based encodings during the development process, and bugs - that’s not how it is used the rest of the time.
- 14:30 Computers don’t read text; they read binary, and operate on binary.
- 14:40 We have to convert to and from those encodings - like two’s complement for a number - we don’t deal with ASCII representations of numbers, we deal with two’s complement representation that’s stored in a word.
- 14:50 We have to go through the conversion all of the time, even though it’s not needed in the live system - it’s needed for the development process.
- 14:55 Fix the real problem; don’t fix the symptom.
Does Aeron support clustering today?
- 15:40 There is support in the repository, and we’ll be making a release shortly which will have it more supported.
What were the design goals for this in Aeron?
- 15:45 The interesting design goal is the type of applications that people want to build.
- 15:50 If you’re working in the financial industry and some of the other complex modelling spaces, we want to build interesting, complex models of the domain in memory.
- 16:00 To do that, we want to run in memory so we have the richness and there’s no impedance mismatch between going from their in-memory representation; there’s no ORM between that and the database.
- 16:10 The downside to a rich in-memory model is what happens if the machine crashes? How do we recover that?
- 16:20 The way we can recover it is if the model is built on a number of different nodes, and another node can take over if the primary node fails.
- 16:30 We now have a reliable system, and we can snapshot it to disk.
- 16:40 How do we run these across a number of nodes and agree on what is in memory? That’s where consensus comes in.
- 16:50 We need some sort of agreement that each of the machines have got a realistic version of the model in memory, so in the event of failure another node can take over.
- 17:05 The interesting thing is - as you build models, we’re living in a world where everything is asynchronous.
- 17:10 The one that’s rebuilding the model could be further ahead or further behind than the one that’s been sent to the customer.
- 17:20 If the customer places an order into the system and the transaction completes, you want to ensure that it is honoured.
- 17:30 If a follower does not have the data yet, there’s no point in replying to the customer as you won’t have a reliable system.
- 17:35 You need consensus that you have the data on a number of nodes, so that in the event of a failure you can proceed.
- 17:45 That raises an interesting question - what does “a number of nodes” mean?
- 17:50 We need enough nodes so that in the event of a dispute we can resolve them.
- 18:05 Consensus algorithms work on a majority - when there’s a majority we can break deadlocks where they may be present.
Being able to acknowledge a message by repeating it back is a way to show that it has been received successfully.
- 18:45 Going back to the original work of Claude Shannon describing information theory in the 1940s, he highlighted that we get loss.
- 18:55 A message having been transmitted - even one that has been received - is not a guarantee that it’s been understood and that it’s being dealt with.
- 19:05 The simplest way to ensure that is to read it back.
- 19:10 We can do checksums, ways of checking the validity of something, but it’s still the same concept.
Why did you choose Raft for your consensus algorithm?
- 19:35 An excellent question! There’s multiple parts to the answer to that.
- 19:40 One is that Raft fits very well with Aeron.
- 19:45 Why not alternatives like Paxos, atomic broadcast, viewstamp replication, virtual synchrony?
- 19:55 If you go back to the 1980s we had a lot of discovery and subsequent refinement.
- 20:00 Paxos was just about first, but very closely followed by virtual synchrony and viewstamp replication.
- 20:05 If you go through those they all have some fundamental principles.
- 20:10 The nice step taken with Raft is that it is designed to be understandable.
- 20:15 It had a different design goal - although similar to Paxos, it is more strict.
- 20:20 By being strict, it has less failure cases and less complexity.
- 20:25 Both Paxos and Raft are based on the concept of electing a leader.
- 20:30 The leader determines what the correct order of any given events.
- 20:30 The leader that can be elected in Paxos may not have the latest view of the world.
- 20:45 Then you have to deal with the complexity of reassembling that view.
- 20:50 Raft makes a much greater simplification; only a leader who has seen the latest view of the world can be elected.
- 21:00 That simplifies recovery - much simpler to prove and much easier to understand.
What do you mean by “Raft is more understandable”?
- 21:10 The number of potential states the system can be in is less, so it’s easier to fit in your head.
What is leader election and what states can the system be in?
- 21:30 At a simple level, Raft has the concept of nodes having one of three roles.
- 21:40 Those are follower, candidate and leader.
- 21:45 When nodes start up, they come up automatically in the follower state, so they don’t know who the leader is.
- 21:50 With the basic version of Raft, you start a random timer because there’s no co-ordination.
- 22:10 One of the things that Raft does is set random timers in a number of nodes that will try to trigger becoming leader.
- 22:20 That means that they don’t all try to become leader at the same time.
- 22:30 At that stage, it will request votes from other members.
- 22:40 Two random timers could happen at the same time, resulting in a collision.
- 22:50 In that case, you re-start and avoid a collision.
What is the concept of heartbeat and timers after a leader is elected?
- 23:05 Once you have a leader, it sequences events coming into the system and telling the followers about the data and what they can consume.
- 23:20 The followers are always given more data than they can consume, up to the point of which the cluster has reached (quorum) consensus.
- 22:40 That will naturally have a liveness property of the system - the leader can be deduced to be alive if it is still sending data.
- 22:45 If the leader doesn’t have data that needs to be replicated, the leader sends out a heartbeat to indicate that it’s still alive.
- 24:00 Whenever you don’t have heartbeats or data for a given time period, you have to assume that the leader is not responsive.
- 24:10 There may have been a network partition, the leader may have died, or experienced a long GC pause (which is the most common case these days).
- 25:30 Network partitions happen, but infrequently - much more frequently partitions occur because a GC pause or other delay causes the network partitions to occur.
It Raft a CP system?
- 24:50 It has to be consistent, and it has to be available - so you have to cope with partition tolerance when it happens.
So with 5 nodes and a network partition, two nodes get cut off?
- 25:25 If the leader dies, you’re left with four nodes - and if you’re partitioned 2+2 then you need to get a majority of votes to be elected, neither of those two groups can elect a leader.
- 25:45 At this point the cluster has stopped, awaiting the reboot of another node on one side of the partition in order to be able to continue.
What would happen if you have more nodes with a majority on one side?
- 26:00 You can never have two leaders elected on each side of the partition.
- 26:05 You set up your network configuration such that your likelihood of a group that could form a majority in the face of network partitioning so that the cluster can continue.
What issues are there with Raft?
- 26:25 As far as electing leaders and agreeing on the contents of a distributed log, Raft does a good job.
- 26:30 One thing it doesn’t do is say how to implement it efficiently, which is a different problem.
- 26:40 There are other protocols out there, and many of them work - but the interesting cases some of them have a bit more flexibility in the form of reads.
- 26:50 You can get increased performance at the expense of consistency.
What happens with Raft under byzantine network conditions?
- 27:20 If all nodes follow the protocol as they should, it copes with the scenario of a machine expiring for some reason.
- 27:30 It doesn’t deal with two distinct cases.
- 28:40 The first is bugs - if you have a bug in your code, it can’t address that.
- 27:45 You might be able to detect it, but it doesn’t tell you how to progress.
- 27:50 The other is where you have a rogue actor, and someone is injecting rogue messages into your network pretending that a certain state is happening when it isn’t.
- 28:00 For example, someone is claiming to be the new leader - the followers have to follow that leader; that’s part of the protocol.
- 28:15 No follower should claim to be leader without having won an election with a majority first.
- 28:20 If you break the rules, you have broken the protocol.
- 28:40 You’d have to use cryptography to solve that, but it has a lot of performance implications to do that.
- 29:00 The synchronous interactions expand the time period in which things happen.
- 29:10 It takes longer for consensus to be reached and greatly reduces the throughput of a distributed system.
- 29:15 Aeron is designed to be asynchronous by nature, and the protocol was designed with a longer term view of consensus protocols in the future.
- 29:25 Todd Montgomery and I both have experience in these types of protocols, and we’ve designed it so that things that would have happened serially can be done in parallel and asynchronously in Aeron, which increases the throughput.
- 29:40 For example, the archiving to disk happens completely in parallel with tracking the position for consensus.
- 30:00 We designed the system so that as much as possible can happen in parallel as possible.
- 30:05 As a result, the throughput figures are phenomenal.
- 30:10 If we benchmark our archiver on its own, it can comfortably run at over 2 gigabytes per second, if you have a sufficient SSD that can handle the load.
What’s on the roadmap?
- 30:35 The next release will have static cluster support where the number of nodes are known at startup.
- 30:40 The next step after that will be to allow dynamic clusters, where members can come and go.
- 31:00 This will allow us to do things like graceful leader step-down, hot upgrades of a cluster.
- 31:05 Cryptography is coming after that - one of the things, out of all the existing messaging systems, is they are all very slow because they’re falling back to SSL.
- 31:15 We’re going to fall back to some native support on the processor which will allow us to have much higher rates of crypto messaging.
Does changing the membership of nodes change the quorum?
- 31:35 The only time you can agree a change is when you have a running functional cluster, and the leader is agreeing that change.
- 31:45 Going from three to four is interesting - you have to ask what the consensus from going from three to four.
- 32:00 It’s still three for a majority - it’s always N/2+1.
- 32:20 You also see people running two-node clusters, often in the finance industry where there is a primary and a secondary.
- 32:30 The thing is - if you lose one of them, you are now running with no backup or no ability to recover if something goes wrong.
What other languages is Aeron built on?
- 32:50 One of the nice things we have got is that we’ve got the system built primarily in three languages.
- 32:55 We have a native implementation in C/C++, we have Java and we have C#.
- 33:00 The native implementation opens up a lot of interesting opportunities - on Linux we can have sendmmsg and recvmmsg and have a 50% throughput increase over Java, because Java doesn’t have the NIO calls that let us get at some of them more recent networking calls.
- 33:35 We can also get direct access by putting assembly inside our native code that lets us access the on-chip cryptography that’s available on a lot of modern processors.