Today on The InfoQ Podcast, Wes Reisz talks with Peter Bourgon. Peter is a distributed system engineer working at Fastly. His area of interest is around coordination free replicated systems. The two engineers talk about the space of Conflict-Free Replicated Data Types (CRDTs) specifically in the context of edge compute. Topics covered on the podcast include Edge compute, CRDTs, CAP Theorem, and challenges around building distributed systems.
Key Takeaways
- CRDTs (conflict-free replicated data types) are a class of coordination free replication systems (or systems that don’t require strategies such as leader election).
- An easy way to think of a CRDT is as an associative, commutative, and idempotent data structure plus the operations needed to use it.
- The edge is an overloaded term that people tend to define based on where they sit on a spectrum between the customer and the data center. Fastly’s edge is away from the data center and but not to the telephone pole or handset.
- RAFT and Gossip are two alternative approaches to using a coordination free replication system like CRDTs.
- To get the properties of a CRDT and have useful data types, you have to pay a cost in size and often bytes on the wire. These are challenges that continue to need solutions.
- Modern Distributed systems and data structures like CRDTs require you to start thinking about state in the system itself. It’s not unusual for a system today to give you multiple results back that the system will have to handle or merge.
Subscribe on:
Show Notes
What makes the edge space interesting?
- 01:20 It's a unique combination of things - at a high level, I'm interested in large scale systems, so distributed systems in general is a good home for me.
- 01:35 I got attracted to CRDTs and co-ordination free large scale distributed systems, because they appeal to me.
- 01:50 Leader elections, heartbeats and other mechanisms in distributed systems aren't needed in co-ordination free systems; CRDTs are a great way to do this.
- 02:05 There was an opportunity in our product portfolio and market for a system that might match all of these properties.
What is meant by edge?
- 03:00 If you're working in a data centre, edge is anything that's closer to the customer than you are; if you are a mobile provider, the edge is the handset.
- 03:20 My take on it is influenced by our (Fast.ly's) network topology; we think the edge is where we sit.
- 03:30 Our points of presence are smaller and higher powered than some other edge cloud providers.
- 03:40 We have bigger points of presence on internet exchange points, which are maybe a bit more distant than other companies, but we can do more there.
- 03:55 For us, that's the edge - away from the customer data centre, but not all the way to their handset.
What are Conflict-Free Replicated Data Types?
- 04:45 There's lots of ways of defining a CRDT is in a formal grammar.
- 05:15 The informal definition is a data type, in combination with the operators that you can use with it.
- 05:30 A stack isn't just a CRDT though, because it doesn't satisfy associativity, commutativity and idempotency.
- 05:45 Associativity allows you to process operations in a different order, like addition.
- 06:05 Commutativity allows you to swap the order of individual arguments, like addition.
- 06:10 Idempotence is the repeated combination of the same element results in the same result, like adding to zero.
- 06:55 If instead of having integers, you have a set of integers, and instead of addition, you have union, then it works out.
- 07:10 {1} U {2} U {3} results in {1,2,3} regardless of order, and {1} U {2} is the same as {2} U {1}, and {1} U {1} is {1}.
- 07:30 It turns out that you can leverage these properties in interesting ways, so that from the view of a distributed systems designer you don't need to keep track of a lot of stuff.
- 07:45 If there's a network outage, everything is fine provided that you make best efforts to move the data forward.
- 08:15 If you can build a system out of these building blocks then means you mostly can ignore faults.
How might you use a CRDT in edge?
- 08:45 A lot of examples have been sets of integers, but often your key value store you want to be able to store arbitrary information.
- 09:00 In CRDTs, you can't necessarily leverage any semantics or properties of the information to make your job easier or resolve conflicts.
- 09:10 You get a big blob of bytes and you need to be able to use it.
- 09:15 The important property is that you have deterministic mergeability.
- 09:20 An example I like to use is: if you have a set of operations to a key, the important thing is that you put the operations in the bag and they come out in an order, that you get to the same end state.
- 10:05 If you have a big blob of bytes, it's difficult to do that.
- 10:10 One solution is last-writer-wins for the register - so you take the blob of bytes, and have a timestamp or something with a time component to it, so that you have some sense of total order.
- 10:30 It's more like a monotonically increasing UUID; so each node in the system is going to have generators of UUIDs, like a vector or Lamport clock.
- 10:45 Every time a write happens, you can attach a timestamp to it, and when you go to merge these values together, the winning write is the one with the biggest timestamp.
- 11:05 What this means is that if one of your nodes has a clock with serious skew, it's going to win all these contests, and the system will be deterministic but not what your customers want.
- 11:30 That's a problem, but it does give you a nice API where you have a single get/set.
- 11:40 We think that's going to be great for a lot of use cases.
- 11:50 If you want to get more sophisticated, there's a multi-value register, which leverages a dotted version vector set.
- 12:05 Instead of just doing a set with a value, you give it a causal context as well, which is a compressed history of all of the operations that have occurred on the key at this point.
- 12:40 It's basically like the event stream of the change.
- 12:50 You see the history of the register for the system, so you get to see what happens and can observe a partial ordering.
- 13:25 If you have two people making a write with the same history, at some point when you merge them together you get two answers with a get.
- 13:45 As the application author, you have to do the merge and write a single value back, like a merge.
- 13:50 It's up to the application as to how you deal with it.
- 14:10 If you want to have the systems with these nice properties, low latency and not deal with conflicts, we have to make the applications more complicated.
You've got all these POPs distributed, and you're using these CRDTs to build a distributed key value store?
- 14:50 When people design systems, they are typically designing for a single data centre.
- 15:00 You have pretty good technology, connections are fast and reliable - and everything is physically close, so latency is pretty low.
- 15:20 The main constraint in edge system design is that you can't ignore the speed of light any more.
- 15:25 The logical system that you're trying to build occupies a physical space that cannot cheat physical laws.
- 15:55 The speed of light is the core constraint in this space, and the faults that come with it can't be ignored.
- 16:05 CRDTs are the things that allow us to work productively at this scale.
- 16:10 We can't fall back to circuit breaker patterns, retry loops, leader elections, when we're not sure the messages are not going to get there in a day.
- 16:20 We're talking about Earth bound systems at the moment, but when we go interplanetary all this becomes more important.
With Raft, you can run into problems when there is latency caused by distance between POPs. Is that correct?
- 17:10 You could say that; you can tweak the timeouts in a Raft installation, and in theory you could make it work at a large scale, but you have to bear in mind every operation has to go through a quorum.
- 17:30 It has to go to the leader, go through a round, append to the log, and you have to get the response back to the client.
- 17:35 All of those things are subject to timeouts; if you have timeouts sufficient for operation at a global scale, then your client will be waiting seconds for each response.
Can you recap the CAP theorem?
- 18:00 Broadly speaking, it states that if you are designing a distributed system, you can choose either consistency or availability in the face of partitions.
- 18:05 A partition is a network fault, and they happen all the time - they are inevitable.
- 18:15 Raft's choice in this situation is consistency, and there's a whole set of protocols like Raft and Paxos that you can use to do this.
- 18:25 Because of the latency requirements, they don't usually work outside of a single data centre, if you want to have reasonable latency.
- 19:10 It depends on the type of system you're using, but in a key value store, every key you write will have some kind of latency - but even reads, you need to have consistency.
- 19:40 Now we're into the formal language, but there's whole classes of serialisability and linearisability that have different consistency models.
- 19:50 Raft is one of the more formal ones, and AP systems choose availability in the face of partitions.
- 20:00 There are also eventually consistent systems, which is what CRDTs give you, so you can have things that are out-of-sync; you can do local operations quickly and sync later.
- 20:20 CRDTs make sure that overall process is still formally consistent and correct.
- 20:35 Cassandra is an AP system, though I'm not sure in all the ways it communicates.
- 20:40 I guess that are ways you can configure it and have Cassandra clusters that are very large - but I've not used Cassandra in anger.
Why not Gossip for a problem at the edge?
- 21:00 It gets into a slightly different aspect of system design, which is data ownership or location.
- 21:15 In any sort of edge compute system, the whole point is that you have physical devices that are spread out across a large physical space.
- 21:30 What you want to do is establish a single logical state space, like a distributed key value store.
- 21:45 The other constraint is that these devices are often not huge, so you can't keep the entire data set on every device - you have to have a heuristic.
- 22:00 For example, you might store it all in a central space, or distribute it out to based on use - it's a physical constraint on these kinds of systems.
- 22:10 If you set a key to a value in some place, it has to live on the device in that place to start, but where does it go eventually?
- 22:25 What's the protocol for getting that data if another device requests it, and what is the information hierarchy?
- 22:30 How do you achieve that while achieving good latency?
- 22:35 Gossip is one way to achieve that - there's lots of gossip protocols - and they all guarantee that the information will propagate eventually, often with upper bounds.
- 22:55 This is often not the compatible with real-time key/value stuff - a couple of seconds might be OK but if it takes six hours to get to another part of the globe, it may not be acceptable.
- 23:15 I'm glossing over the gossip space so I may be missing some advanced stuff, but generally gossip presents some challenges to data locality.
- 23:30 Another extreme is how a typical database works, like read database replicas.
You talked about designing edge as a hub and spoke system?
- 24:00 This is about data locality - if you have a customer in one location who writes a key, and it is read from another location, how do you move that information?
- 24:15 There's several ways of solving this problem, and I'm not sure there is a correct answer.
- 24:20 If of one end of a spectrum you have gossip, and the other end you have a strictly hierarchical single write primary with multiple read secondary, then maybe there is a space in the middle.
- 24:40 You could have each site - each node, in edge terminology - is its own fully realised store of state.
- 24:50 Maybe the way they communicate with each other is via a centralised upstream, like a hub and spoke.
- 25:05 They could sync the state they've received and transmit it upstream, and you can extend this to an n-ary tree, and build it up fractaclly.
- 25:25 You can have your sites be physically constrained, with limited CPU or RAM, and the upstream can be in the cloud where disks are free or almost infinite.
- 25:50 Every POP has a single upstream, but that upstream may have its own chained upstream.
- 26:00 If you wanted to have strong consistency at the edge, I guess you could do that - but that's not how we do it at Fast.ly.
- 26:10 The main point is that with a POP the edge is physically closer, so you have more leverage and breathing room in terms of what access patterns you use.
- 26:25 The other thing to point out is the recency bias - a pop/site/device shouldn't have data that it doesn't need.
- 26:40 It turns the system into an LRU cache with the most recent stuff, but with an authoritative upstream that you can get it if you need it.
Consul uses Raft within a cluster and then gossip between clusters - is that similar?
- 27:20 I think if you squint and look at it far away, sort of - but any globally distributed data system is going to converge towards the same sort of architectures.
- 27:40 I would point out some distinctions between Consul and this system - Consul is supposed to be a system of record, so that's its whole reason of being.
- 28:00 That's why it uses Raft internally and I don't know the details of the site-to-site communication link.
- 28:15 You're going to pay a latency even when setting something inside a site.
- 28:20 In my system, since it is an LRU cache, and we have an authoritative upstream - it's much more important that we have lower latency, so we make different engineering decisions.
- 28:35 Everyone's into edge compute - Amazon in Lambda, and CDNs are getting into the space as well.
- 28:40 Fast.ly have the compute at edge initiative, which was the origin of my project.
- 28:55 We use WASM to run customer code in the HTTP cycle in the CDN.
- 29:05 More than anything else, we are bound by latency for all this stuff.
- 29:10 We boot in the order of single digit microseconds to cold boot the code, and so any state component to this system has to be an order of magnitude in this space.
- 29:25 Even Consul, configured as tightly as possible in this kind of network, is still looking at double digit milliseconds to do writes, which is a different class of system.
What are the trickier problems you've run into?
- 30:10 There's a couple of fundamental - potentially even unsolved - problems with CRDTs at the moment.
- 30:15 In order to get all these nice properties to get automatic mergeability and correctness, and have useful data types, you have to pay a cost in size.
- 30:30 There's a lot of duplication, or data amplification; so if you have the simplest key/value, like a set of bytes - if you want to store a megabyte, that could baloon to 100 megabytes.
- 31:00 You can't charge the customer for this kind of thing - how do you get it small enough to get the properties you need while keeping the system live?
- 31:10 Every CRDTs base system has to grapple with this at this point - so problems where CRDTs are used may not be commercially viable because of this.
- 31:20 We do cheat in clever ways to make this tractable - that's going to be ongoing engineering.
- 31:30 The fact that we don't have to do leadership and quorums is great, but the way we deal with faults is simply by replicating our communication.
- 31:45 That means we are putting more bytes on the wire than we need to, so Fast.ly and edge cloud system have pretty good networks - but that's not always true, so it's another constraint.
- 32:00 The final one - maybe the most important - if you're developing an app, you really want a database with insert and select and it just works.
- 32:10 You don't want to have to think about the speed of light, or how to model your state layer if there are conflicts - you want the database to take care of you.
- 32:30 What we're doing now is stepping into a more complex model, where you can't make those claims; you have to think about the state system in the application and how to merge those conflicts.
- 32:40 More concretely, the APIs that it gives you are not just set and get, but you have to deal with the causal context and deal with multiple values.
- 33:10 I'm not sure that the market is ready for this - a lot of applications that want to work on simple state paradigms, so a lot of my job is easing people into these APIs.
- 33:20 I think Basho had a React database - they had CRDTs as a first-order thing.
Your talk "Infinite parallel universes - state of the edge" recorded at QCon London 2020 should be on InfoQ soon. What other recommendations do you have on CRDTs?
- 34:30 One of the things that draws me to this space is that it's an active area of research - there's no CRDTs book.
- 34:40 You're reading academic papers that have come out in the last ten years.
- 34:50 I can name names, and you can search their research and blogs you can follow.
- 35:00 Martin Kleppmann is a researcher at Cambridge, has done a lot of work in this space, and is now working on Automerge.
- 35:25 Professor Carlos Baquero is on the cutting edge of CRDT research, the latest of which is delta CRDTs.
- 35:40 Christopher Meiklejohn is working on a programming language called LASP which brings it into the language level.
- 35:55 You can learn a lot from Kyle Kingsbury who writes the "Call Me Maybe" Jepsen testing series.
- 36:30 I wish there was a book, but the area is too nascent and moving too quickly to moving at any specific book.