We are here today in Las Vegas at the RICON distributed systems conference with Dave McCrory, CTO of Basho Technologies, host of the conference.
InfoQ: We spoke yesterday to Basho CEO Adam Wray and got the business perspective. But can you give us some technical perspective on Riak?
Dave: Sure, Riak is a distributed database. It falls into the NoSQL category. It is primarily written in Erlang, but you can access the data in any language. There’s what we call Riak Core, which you can think of as all of the really complex data algorithms along with how you do things like clustering and other components of a distributed system. Out of that is Riak KV, which is the key-value component. The actual flow would be that a client makes API calls, say to read or write data, via Protocol Buffers. Those calls get routed through that Riak Core that I described. The actual data is stored in one of three storage engines that we currently support:
- Bitcask, which we actually wrote and that specializes more in a lower memory utilization and is capable of doing a different type of overall load than something like LevelDB.
- LevelDB, which can deal with a lot more memory and will consume more memory but will give you some performance advantages. It offers tiering, where a customer could opt to put the most frequently accessed data in faster storage and less frequently accessed data in slower, spindle-based storage.
- In-memory store, which gives you the advantages of working in memory, but you take on the risk of volatility.
So you have all of these trade-offs and the back-end storage engines options follow a similar pattern to what traditional databases have today. There are many relational databases that give you different storage engine options.
In addition, we recently added a search capability that’s based on Solr/Lucene, and that’s running in the JVM. A Solr instance or a Solr core runs on each of the Riak nodes. This gives you full text search and indexing on top of Riak. So when you are doing writes into Riak that are supposed to be indexed as the write comes in, the data is also sent to Solr where it’s indexed so you can search on it.
InfoQ: How do you keep those two stores in sync?
Dave: The consistency is done a couple of ways. Inside Riak itself, by default we save three copies of the data. The three copies are written to three different physical nodes. Then we check the data and based on configurable number, we decide that the write is complete. For example we can return a success after say two of the writes are complete. If the setting is three we can say all three must be complete before we return a success. But what happens after that and how you ensure consistency, is what is missing from a lot of data stores today. Most of them stop there and don’t then ensure correctness of the data. We have the technology that we call Active Anti-Entropy (AAE) where we are constantly looking at read requests, and periodically looking at all the content inside each of the nodes and comparing all three copies of the data to ensure that they are actually synchronized and correct.
InfoQ: But a database is a moving target, how do you verify the synchronization?
Dave: Not all of the data is constantly being mutated at any given time. You have some data that’s at rest and some that’s in flight. So Riak automatically periodically checks data that’s at rest. For data that’s being read, we can provide the answer to the requester and check the other copies to see whether you would get the same result. If the answer is yes, then you know that the data is correct. If the answer is no, you need to take a corrective action. So if you have two of the nodes with a value of say, one, and one that’s valued at zero, then you likely need to correct the one that’s valued zero, unless there’s been a write in between. That allows you to maintain a level of correctness and the fact that we are actively doing it addresses a big operational challenge. And we are effectively doing the same thing with Solr, independently. We also have a version of AAE that lets us check the correctness of our Solr indexes to ensure the data is correct and up to date. There are only a few reasonable strategies to do this but a lot of the NoSQL stores just punt on this entirely and ignore it, and assume that all writes all the time will always be correct and that simply isn’t true; it defies the laws of physics.
InfoQ: What you’re describing sounds similar to a distributed cache. How is that distinguished from say a Gemfire or Oracle Coherence?
Dave: Three of the challenges Riak addresses are availability, persistence and replication of data. Gemfire is interesting. Gemfire is an in-memory distributed object store, which allows you to do things concurrently. It’s very high performance. The financial sector are the heaviest users of Gemfire, at least that I know. Gemfire though looks at the world as an object, and you have to decide what you want the behavior for conflict resolution of your objects to be. Is it a NoSQL store? It depends how you want to classify it. Because it is purely in-memory you don’t get long term persistence, and it’s making the assumption that you’re going to be satisfied with volatility. They do have another product that’s based on Gemfire called SQLFire, and that simply takes Gemfire and provides a SQL interface over it, allowing you to persist the data in a similar fashion to a more traditional SQL database. But that exposes a new set of problems and you still have to deal with replication and you have no guarantee of correctness with that either. Riak has built in replication both at the cluster level and even across data centers.
With Oracle Coherence, again you are talking about what you might call an in-memory data grid. There, you’re dealing with trying to replicate data, but in many cases you have locking. How you deal with resolution in Gemfire with locks and how you deal with Coherence with locks is different than what we’re doing in Riak, where we don’t really do locking at all. So there is no lock or mutex that we are putting in place in Riak. You’re always able to write the data, which means the application is always available for writes and if you write at two sites that conflicts, we allow that. Then, depending on what the data is, if you use something like CRDTs it may be possible for us to resolve it or solve it for you, without you having to do any work. If that is not something you want, we also have the ability to simply present you with the fact that there is a conflict, and that there are multiple bits of data, and we can provide you with all of the choices. We can say, you have written A and B to this key and you can resolve the conflict yourself programmatically.
InfoQ: Can you talk more about the CRDTs
Dave: CRDTs are about data types. There are multiple descriptions for what CRDT stands for: Convergent Replicated data types, or Commutative Replicated Data types, Conflict Free Replicated Data Types. It is a way of choosing a set of semantics around say a flag or a binary value, or something else where it doesn’t matter what number of operations you perform, ultimately there is a single resolution you can come to automatically. So for a flag, it might simply be that it is either flagged or unflagged, and by having that, you could for example say “flag wins.”
Think of CRDTs as data types that ease the development effort in two ways. First, they are built into Riak so the developer doesn’t have to create them, and more importantly, the conflict resolution is also built in so Riak can resolve it automatically.
InfoQ: Is CRDT a Basho concept?
Dave: No it is a research concept that has come out of several papers. There’s a canonical research paper "CRDTs: Consistency without concurrency control", But Basho has worked closely with the researchers and made them available in the most recent release of Riak.
InfoQ: Can you talk about how this all plays into reliability, scalability, availability, security, etc?
Dave: We focus on availability, scalability and correctness over anything else. To achieve that, we have been more on the availability and partition tolerance side of things, so in the CAP theorem, more AP vs. C (by providing an eventually consistent system). Recently in 2.0 we introduced a version that could provide tighter consistency as well. This is something we are in the process of maturing so that we can also provide a strong consistency solution; let people decide what level of consistency is appropriate for their use case, which is really where all of this heads. Depending on what data I am trying to store and what my requirements are, will determine the level of consistency. And you’re necessarily making a trade-off, so greater consistency might lower the performance. But what is key is putting the choice in the hands of the developers to use the right type of consistency for the task at hand from the same platform.
InfoQ: You said Riak is written in Erlang. Why Erlang?
Dave: Erlang has been around for a very long time. It was designed by Ericsson for phone switches, because you never want your phone switch to go down. We have taken Erlang and adopted it to be distributed and scalable. Erlang is really good for handling network failures because it is by default a mesh topology. It is not perfect though; the Erlang community is good and vibrant, while the Java community size is an order of magnitude greater. But the people in the Erlang community are incredibly smart and talented individuals, and we have leveraged Erlang to our advantage thanks to its capabilities. We have also stayed away from its weaknesses, which is why we are leveraging C to perform low level tasks such as writing data in storage engines. So writing something to disk is handed off to Bitcask or LevelDB. That’s simply not a strength of Erlang, so we are leveraging the right tool for the job, instead of trying to use a single tool and making it solve all problems. And with Riak 2.0 we implemented Solr in a JVM which shows that we can bring on Java developers to help advance the platform as well.
InfoQ: What about ACID transactions?
Dave: Today we do not support ACID transactions. At some point it will come but not today. There are so many new use cases to solve before ACID transactions. If you need ACID transactions, you can put those in a traditional database, and source all other work to a NoSQL solution. You could probably scale that pretty well; that’s been the strategy for big cloud providers at scale.
InfoQ: How does Riak scale to handle multi petabyte repositories?
Dave: Riak scales extremely well; it is designed to handle very, very large sets of data, into the many PB’s. We have several customers that have already gone into the multi-PB scale scenarios. It doesn’t mean there are no limits to the scale, but we have seen that there are enough points of scalability to let us do a very large number of things in the tens of petabytes, which is the sweet spot for the large data sets. We are not seeing people attempting to go beyond that very often.
InfoQ: I have seen some of the presentations, and spoken to some of your developers here at RICON, and it seems like you have some amazing talent here!
Dave: We have spent a great deal of effort cultivating our talent pool, and we have an incredible talent pool. They are solving very difficult computer science problems. That adds to the value of Riak: we have tackled a lot of difficult problems that some companies have decided to punt on, and that’s going to come back and bite them, especially if they didn’t design their systems up front to deal with some of these issues. That’s a level of talent we are very proud of at Basho.