NoSQL data stores offer alternative data storage options for non-relational data types like document based, object graphs, and key-value pairs. Can a distributed cache be used as a NoSQL store? Greg Luck from Ehcache wrote about the similarities between a distributed cache and a NoSQL data store. InfoQ caught up with him to talk about this use case and its advantages and limitations.
InfoQ: Can you explain how a distributed caching solution is comparable with a NoSQL Data Store?
Greg Luck: A distributed cache typically keeps its data in-memory and is designed for low latencies. A NoSQL store is a DBMS without the R (relations), and usually also lacks support for transactions and other sophisticated features. It is the joins of table relations that are the most problematic part of SQL for systems that don't support relations, and the source of how the NoSQL moniker comes about.
One type of NoSQL store is a key-value store. Some examples are Dynamo, Oracle NoSQL Database and Redis. Caches are also key-value stores, so these two are related. While many cache implementations can be configured for persistence, much of the time they are not because the system of record is in a database, and they are aimed at performance rather than persistence. In contrast, a NoSQL store is always aimed at persistence.
But a persistent cache can be used in much the same way as a key-value NoSQL store. NoSQL is also addressed at Big Data, which usually means bigger than you want to put in a single RDBMS node. It is typically thought of as starting at a few terabytes and moving up into petabytes.
Distributed Caches are typically aimed at lowering latencies to valuable transactional data which tends to be smaller in size perhaps ending much where Big Data begins. Because caches store data in memory there is a higher cost per unit stored which also tends to limit their size. If they are relying on heap storage they might have as little as 2GB per server node. It depends on the distributed cache. Ehcache stores off-heap as well and stores hundreds of GBs per server for multi-terabyte caches.
A persistent, distributed cache can address some NoSQL use cases. A NoSQL store can also address some caching use cases, albeit with higher latencies.
InfoQ: What are the similarities between Distributed Cache and NoSQL DB from an architecture stand-point?
Greg: Both want to deliver more TPS and scalability than an RDBMS. To do so, both do less than an RDBMS with the idea that simplifying the problem and leaving out problematic things like joins, stored procedures and fixed ACID transactions gets you there.
Both tend to use proprietary rather than standardized interfaces, although in the area of Java caches we now have JSR 107, which will provide a standard caching API for Spring and Java EE programmers.
Both scale out using partitioning of data transparent to the clients. The non-Java products also scale up quite well. With Terracotta BigMemory we are pretty much unique in scaling up on the Java platform. Finally both deploy on commodity hardware and operating systems making them ideal for use in the cloud.
InfoQ: What are the architectural differences between these two technologies?
Greg: NoSQL and RDBMS are generally on disk. Disks are mechanical devices and exhibit large latencies due to seek time as the head moves to the right track and read or write times dependent on the RPM of the disk platter. NoSQL tends to optimise disk use, for example, by only appending to logs with the disk head in place and occasionally flushing to disk. By contrast, caches are principally in memory.
NoSQL and RDBMS have thin clients (think Thrift or JDBC) that always send data across the network whereas some caches like Ehcache use in-process and remote storage so that common requests are satisfied locally. In a distributed caching context, each application server will have the hot part of the cache in-process so that adding application servers does not necessarily increase network or back end load.
RDBMS is focused on being the System of Record ("SOR") for everyone. NoSQL comes in flavors and so wants to be the SOR for a specific type of data, namely key-value pairs, documents, sparse databases (wide columnar) or graphs. Caches are focused on performance and are typically used with an RDBMS or a NoSQL store where the type of data is SOR. But oftentimes a cache might store the result of a web service call, a computation of even a business object that might require tens or hundreds of calls to an SOR to build up.
Caches like Ehcache tend to run partly within the operating system process of the application and partly in their own processes on machines across the network. Not all of them do this: memcache is an example of a cache that only stores data across the network.
InfoQ: What type of applications are best candidates for using this approach?
Greg: Following on from the earlier question, distributed caches fit in with your existing applications. They often require very little work to implement whereas NoSQL requires a large, jarring architectural change.
So the first types of applications suitable to distributed caching are the ones you already have and in particular applications that need to:
- scale out because of increasing use or load
- have lower latencies to achieve their required SLAs
- minimize use of very expensive infrastructure such as mainframes
- reduce their use of paid for web service calls
- cope with extreme load peaks (e.g. Black Friday sales events)
InfoQ: What are the limitations of this approach?
Greg: Caches, being in memory, are limited in size by the cost of memory and their technical limitations in how much memory they can use (more on this below).
A cache, even if it does provide persistence, is unlikely to be a good choice for your system of record. They purposely avoid sophisticated tooling for backup to and recovery from disk although simple tools may exist. RDBMSs have rich backup, restore, migration, reporting and ETL features, which have been developed over the last 30 years. And NoSQL tends to lay in between.
Caches provide programming APIs for mutating and accessing data. By contrast, NoSQL and RDBMS provide tools that can execute scripting style languages (e.g. SQL, UnSQL, Thrift).
But the key thing to remember is that the cache is not trying to be your SOR. It coexists very easily with your RDBMS and for that reason simply does not need the sophisticated tools of the RDBMS.
InfoQ: How do you see the distributed cache solutions, NoSQL Databases as well as the traditional RDBMS working together in the future?
Greg: Because distributed caches are anywhere from 1 to 3 orders of magnitude faster than RDBMS or NoSQL depending on deployment topology and data access patterns, those looking for lower latencies will use a cache as a complement to NoSQL just as they do today with RDBMS.
Where there will be a difference is that it is often hard to scale/or there is impact on the programming contract or CAP tradeoffs when you try to scale an RDBMS beyond a single node, whereas with NoSQL you treat it as a multi-node installation even when you are running against one node. So when you scale up there is no change to these things. With RDBMS a cache is added to avoid these scale out difficulties. Often the cache solves the problem to the capacity requirements of the system and you need to go no further. So the cache gets added when scale out is needed.
For NoSQL, scale out is built-in, so the cache will get used when lower latencies are required.