BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Michael Nygard - Redefining CAP

Michael Nygard - Redefining CAP

Bookmarks
   

1. We are here at QCon 2012 and Michael is speaking on Exploiting Holes in CAP Theorem. My first question for you just to start things is: can you just introduce yourself to our audience?

Sure, I’d be happy to. I’m an application architect and developer for about 20 years now with a background that you’d either call “well-rounded or checkered” depending on how charitable you are being, I’ve done work in a lot of different industries including a particular emphasis in large scale commerce and retail over the last 15 years. In the early 2000’s I made a move over into IT Operations and got to live with some of our software in production and see how well the apps that we wrote survived the real world and discovered the answer was not very well at all. So I took that experience and wrote a book call “Release It”, about how to design software to survive contact with the real world and then I tried to bring that experience back through talks at conferences like this one.

   

2. That was not your only book, I think you’ve done quite a few books, is that right?

I’ve been co-author on several others, including “97 Things Every Software Architect Should Know” and O’Reilly’s “Beautiful Architecture”.

   

3. I mentioned you are speaking on Exploiting Holes in CAP, so before we get into that, can you maybe just set the stage for us and explain the Consistency, Availability and Partition Mode problem?

Sure, this problem was actually named by Eric Brewer who gave the opening keynote today, background the early 2000’s or late ’90, when he posed that was impossible to deliver a distributed system that had all three of these characteristics simultaneously, that it was always consistent, always available and could survive network partitions where parts of the system couldn’t communicate with other parts of the system. That was a conjecture until 2002, when it was proven as a theorem of Computer Science by Gilbert and Lynch in a very widely cited paper.

   

4. You’ve been kind of testing the assumptions of CAP and you’ve come up with what you call loopholes, what are some of the loopholes?

First I have to explain how you can find loopholes in something that I just said was proven as a theorem of Computer Science. Like any proven theorem, it relies on certain definitions and assumptions, so the CAP Theorem as proved relies on, for example the nodes being part of an asynchronous message passing network in which the only data they can receive about the outside world is messages passed between each other. It relies on a particularly strict definition of consistency, the idea that not only do all nodes share the same current state but they all have the same view of the entire history of that state. There are other definitions of consistency and there are other kinds of networks you can construct and so when I talk about loopholes, what I’m really talking about is exploring the space around those assumptions.

So for example, there is an older definition of consistency, it dates back to 70’s and early 80’s in which we describe consistency as a set of a assertions about the nature of the data in the database. Now this version of consistency is not talking about the nodes or the computers in the network each sharing the same view over the data, but it’s saying that there must be certain things which are true about the data before and after a transaction but not necessarily during, and it says that is certain relationships among the data that have to hold. For instance in a classic sort of balance transfer example, we would say: “The total amount of money represented in all the accounts in the database should be the same before and after the transaction”. If you’ve created or destroyed money as a result of a transaction, you’ve done something wrong. Well that is also a kind of consistency and it turns out there are ways that we can achieve that kind of consistency with availability and partition tolerance. It’s not CAP because I’ve changed the definition so this is why we refer to it as a loophole.

   

5. What are some of those specifically?

Redefining consistency is a very handy trick, there is another trick, in my talk I caution people that not of all these are serious, some of them are a little bit playful. For example many people say that the problem is shared, mutable state, but I could change the definition and say the problem is shared, readable state. And so I proposed an extension to the famous language HQ9+ called Distributed HQ9+ in which every node can increment the distributed register but the language actually has no means for ever reading the register. So you can claim that it’s consistent, available and partition tolerant but maybe not all that practical, so that is another one.

On a more serious note, when we say that the network is an asynchronous message passing network, that is a limited kind of network, it resembles UDP rather than TCP. So if we construct networks that have more information or we enable our computers to get information from sources other than the network, then we have other modes available to us. One of them it’s been sort of making news lately as Google’s Spanner Database which actually uses both GPS time and an atomic clock to time-stamp every transaction so that even if the TCP based network is partitioned, you can heal after the partition and reconstruct the linearized history, because you’ve got satellites and orbit around the Earth providing very precise time signals to the database nodes.

   

6. You talk about something called PACELC, what is PACELC?

So PACELC is not a concept that I’ve created it actually comes from Doctor Daniel Abadi from Yale University and it’s a mnemonic device to let you analyze the database as to how it behaves when there is and it’s not a partition. He says that CAP is not the complete formulation because it leaves out Latency and we have to think about how the database handles trading off latency versus availability and how it handles trading off under the situation when there is and it’s not a partition. So PACELC means when there is a partition how do I deal with availability versus consistency else when it’s not a partition, how do I trade off latency versus consistency. He actually uses PACELC to characterize pretty much all of the databases that are out there and available today.

   

7. [...] So what does all of this mean for the average developer, Joe Programmer, I mean why should he even care?

[Michael's full question: That is interesting because in Erik Brewer’s talk this morning, he basically said that partitioning is the constant and it’s really consistency versus availability and then he further qualifies and says that the problem is in this right hand corner of a 100% consistency versus a 100% availability, so there are concepts of weak consistency for example where you can’t have all three. So what does all of this mean for the average developer, Joe Programmer, I mean why should he even care?]

It’s a perfectly good question. If you are not building at extremely high scale, if you are not building at multiple physical datacenters and if you are not actually a vendor of a database or building a database, does this matter? There are a lot of applications for which the extreme high availability shouldn’t be an issue. I was worried about a little bit of faddishness that we can kind of instigate at conferences, there are many applications which really should be favoring consistency and sacrificing availability and just running on, plain-old ordinary relational database, and for those applications it’s perfectly fine. For the ones that do have those more challenging constrains, scale, distribution, very high availability requirements, I think the way that you normally address the theoretical concerns of CAP or PACELC or any of these others, is to draw boundaries and think about it as you are building your application architecture.

So part of what I talk about is bounded consistency creating perhaps a section of your infrastructure or a section of your application behavior in which you have high consistency and partition tolerance, and maybe another area where you have high availability and partition tolerance. P is kind of the invariant. But I can selectively make that trade off in different areas of my application by taking advantage of my knowledge of the domain or of my knowledge of the architecture. So this is what I refer as bounded consistency and that is what I recommend for the typical application architect or the typical developer, is to think about these boundaries and be deliberate about choosing them.

   

8. You mentioned your book “Release It” and I always thought that was kind of a play-on release IT but anyway it’s been 5 years now since the book came out, so a lot has changed in 5 years. What has really changed in the deployment space that maybe if you are going to write a new book it would be in the new book?

So I think that are two things that are enormously significant, if I were going to write a new book, this is not a commitment, but if I were, I would probably address DevOps and Continuous Delivery. These are not precisely the same things but they are clearly kind of in related spaces. “Release It” I wrote under the presumption of sort of traditional IT operations, software development creates applications, they essentially throw them over the wall, two groups of people have to run the app but don’t actually know how it was built or what it’s made of. I think with Continuous Delivery you can get into production much earlier and drive out a lot of those problems with the production environment that are so vexing, all of those 1001 configuration details that are different between dev and production. DevOps encourages a much higher level of transparency and it encourages operations to get more involved in the application architecture and more aware of the business metrics and the behavior of the application and how it affects the business. All of those are positive things and they accelerate the learning cycle, so that is one big area.

   

9. Not to put you on the spot, so if you don’t have these handy that we understand, but are there any architectural designs patterns that you might be able to share with our audience?

You mean newer ones since I wrote the book. There are a couple of things that I’ve observed more than once. One means something that tentatively referring to as a force multiplier bug, this is essentially a kind of denial of service attack that you provoke upon yourself by causing others to send you a lot of traffic. The place where I first saw this was actually a bug in some JavaScript code where it was just a missing equal sign, it was a single equal instead of a double equals, that caused a polling loop on a timer to fire a new Ajax request and set a new timer every time it came around, and so from the inside as we were monitoring the system it looked exactly like a distributed denial of service attack, except it was completely self-induced. So that force-multiplier idea something that I’ve seen pop-up a couple of times.

One other one that I’ve seen is something that I’m referring to as a transaction stuttering. This is a little more specific to a relational database-driven system that has a batch load on one side and a transactional behavior on the other side. The batch load is trying to update many rows, the transactional side is trying to select from many rows, one of them is sort of walking through a sequence of them and the other is sporadically picking out random rows. And what can happen is they interfere with each other, the sporadic rows on the transactional side will cause the batch load to pause, to block, the locks are being held, and at the same time batch load is holding lots of locks, thousands maybe, which causes the transactional side to slow down. And so when you have that behavior the batch load takes many times longer than it ought to and the transactional side times out in a lot of places. This is a fairly specialized pattern, it doesn’t happen in every kind of applications so I haven’t written it up anywhere yet.

   

10. You mentioned Continuous Delivery and indeed it forms an entire track here at QCon, so how does Continuous Delivery address the anti-patterns that you’ve described earlier?

One of them is it doesn’t let them survive as long into production, so you tend to catch the anti-patterns earlier and it’s much easier to fix them when they are smaller and there are fewer of them, so that is one big benefit. Another one is that Continuous Delivery causes people to design things differently, so you are more likely to put in things like feature flags if you are doing Continuous Delivery, you are much more likely to design features in a way that they are less coupled to each other. Those design changes actually just tend to provoke better design in general, I mean we always say less coupling is good and so Continuous Delivery sort of lets you do that.

   

11. I just want to get your take on the kind of Agile Lifecycle Management movement and ALM is actually been around for, and it wasn’t call the Agile always Application Lifecycle Management, so it’s been around for a long time but there is a lot of emphasis on the ALM in the last maybe year or two. What is your take basically on the ALM landscape right now?

So this one of those questions that is going to make me really unpopular with a bunch of product vendors I’m afraid. I actually think that ALM is focusing on the wrong thing. If we begin talking about things once it’s entered the software development cycle and we think it’s done one we’re done with it, I think we are taking to narrow a view. We are only looking at the middle piece of a much longer cycle, so my preferred viewpoint is more of a Lean development approach, where I want to look at the entire cycle time from when somebody comes up with a concept until it’s actually deployed into production and generating revenue or reducing costs, wherever it’s meant to do. And in that way we avoid just optimizing this middle piece of a longer lifecycle.

Michael: So thank you for coming by today and I appreciate you are taking the time!

Happy to be here!

Feb 13, 2013

BT