Thanks to the NoSQL movement, data storage solutions are no longer a solved problem. Many are working hard to build new storage solutions, and even more are willing to use them. On the flip side, if you are a programmer or a solution architect who wants a data storage solution for your application, you have to face the daunting task of weighing and understanding the tradeoffs associated with the application and make a decision. This article explores the data needs of end user applications and various tradeoffs. It provides guidelines on the criteria for selecting data storage choices enabling the architects and developers to make an informed decision.
Introduction
A few years ago, most systems were small and relational databases could handle those systems without any trouble. Therefore, the storage choices for architects and programmers were simple. However, the size and scale of these systems have grown significantly over last few years. High tech companies like Amazon and Google faced the challenge of scale before others. They soon observed that the relational databases could not scale to handle those use cases. Eric Brewer's CAP theorem explains this phenomenon as "a distributed system can only have two of the three properties - Consistency, Availability, and Partition Tolerance". Motivated by the CAP theorem and Werner Vogel's (Amazon CTO) work on loose consistency, both Google and Amazon came up with two storage solutions called Big Table and Dynamo respectively. These solutions scale by relaxing ACID (Atomic, Consistency, Isolation, Durable) guarantees. They gave rise to the NoSQL movement, which in turn resulted in many data storage solutions.
If pre-NoSQL era was boring for people who built storage solutions, the NoSQL/NewSQL era has been exactly the opposite. The architects now have a wide variety of storage choices like Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc.
Programmers now face the same dilemma that we all face when choosing a TV from Best Buy - too many choices. Different applications have different needs from a data store. For example, an online retail application needs transactions while a search application like Google needs high scalability and eventually consistent data. My recommendations described in this article will provide a good starting point for programmers and solution architects who want to find the right solution in the data storage haystack. I believe the reader will walk away with a broader understating of data storage models and which model should be used when.
Data Storage Requirements
To decide on the best data solution for a given application, we should understand what the application expects from a data storage. Some expectations like reliability of data, data security, Vendor Agnostic nature, availability and simple user experience are expected by most of the applications. However, there are others like type of stored data, support for search depends from application to application, and we shall focus on those. For this discussion, we will focus on the following characteristics of an application.
- Types of data stored
- Scalability requirements
- Nature of data retrieval (i.e. Types of Queries)
- Consistency requirements
Let's discuss each decision parameter in detail, and break values of each property into several categories so that we can analyze them case-by-case.
Decision Parameters in Selecting a Data Storage Solution
Types of Data
Data is usually one of the three types: unstructured, structured, and semi-structured.
Unstructured Data: We classify a given data product as unstructured when the application or the storage that handles data does not understand its structure. Examples of unstructured data are documents written in natural language, images, videos, and raw observation data. The storage handles unstructured data as Files or Blobs (sequence of bytes). Applications either present unstructured data as is to humans who make sense of it (e.g. Web) or convert to structured data through data processing (e.g. AI, statistical techniques, and semantic web).
Structured Data: Has a well-defined structure (e.g. employee record in a database), and it's possible to extract or search by specific parts of the data. A good example of structured data is a relational database where each column in the database is predefined to have a type and a name.
Structured data always has an associated schema that describes the structure of the data. Often, we define the schema beforehand for data validation, to provide user-friendly interfaces/APIs, and to improve the performance. Any data stored in a given relational database has to match the schema of that database.
Relational databases use a two dimensional schema. However, there are other schema models like Star Schema often used by data mining (OLAP) use cases.
Semi-structured Data: Semi-structured data has some structure (e.g. tree, graph, or a list) in terms of relationships or hierarchy. However, it only defines an abstract structure. For example, XML data has relationships defined to each data elements by XML syntax and semantics, but it does not limit the type of data included within a XML document. Some examples of semi-structured data are XML, Graphs, and Name-value pairs, and HTML documents.
Need for Scalability
A system is scalable if it can handle a higher load (in terms of users, required throughput, size of files etc.) with more resources. The Clifford Neuman's paper Scale in distributed systems discuses scalability in detail.
There are two ways for a system to scale, vertical scalability and horizontal scalability. Vertical scalability adds more resources to the computer (e.g. more memory, run on a faster machine etc.), and typically does not need any changes to the application or special features from the application. Horizontal scalability adds more nodes (computers) to the system. Support for vertical scalability needs to be planned from the early stages of the architecture lifecycle. It comes almost orthogonal to the application, but hardware limits the amount of resources possible within the same server. Therefore, we only focus on horizontal scalability in this discussion. To facilitate the discussion in this article, we define three scalability classes.
- Low scalability (2-5 nodes)
- Scalable (10 nodes)
- Highly scalable (100s to 1000s of nodes)
The low scalability class roughly denotes the limits of RDBMS where they can be scaled by adding few replicas. However, data synchronization is expensive and usually RDBMSs do not scale for more than 2-5 nodes. The "Scalable" class roughly denotes data sharded (partitioned) across many nodes, and high scalability means ultra scalable systems like Google.
It is worth noting that "the number of nodes" in the definition means the number of nodes in the data storage tier, not the application. For example, for some use cases, even the low scalability setup may support hundreds of application nodes if the load is small.
Nature of Retrieval
Applications store data for three primary use cases: Archival, Analytics (OLAP), and Online Transaction Processing (OLTP). Here archival use cases store data as a safeguard or for history, OLAP use cases analyze a collection of data to compute some results, and OLTP use cases are driven by end users and there are clients waiting in the other end for a responses.
Data storage solutions support data retrieval in different ways.
- Sequential access - client can access data item by item, and this model does not provide random access. This model is mainly used with archival use cases.
- Key based access - this model stores each data item against a key, and the client later retrieves the data item using the same key. This model provides random access, but does not support searching.
- Query based access - users provide a query, and data storage returns data that matches the query. User queries come in many forms.
- Search by properties (WHERE clause equivalent)
- JOIN based queries (they may include WHERE clause). These queries join data from multiple tables together.
- Semi-structured data supports queries that use underlying data structure. XQuery/XPath for tree like data and RDF Data Query Language for triple data are examples of such query forms.
- Temporal Queries - Also OLAP use cases have very complex queries, which often involve temporal (time-based) conditions (e.g. give me a maximum and minimum temperature in last 6 hours).
OLAP queries often take time to process the data, and therefore, are better implemented using methods like MapReduce.
Users retrieve unstructured data using keys, and unstructured data does not support search directly. There are two main methods to search unstructured data.
- a. Extracting or associating metadata with unstructured data items (e.g. attributes of a file), and using them to search for data items.
- b. Building an index based on the content of the data in the data items. Although, the exact meaning of each data item is not known, queries can provide important information. For example, most document search systems work in this manner by indexing words
Semi structured data cannot support detailed search as structured data does, but it can support search using the structure. For example, we can query XML using tools like XPath and XQuery, and graph based structures using graph searching techniques.
Finally, structured data mainly supports query-based retrieval although key based retrieval is also supported. We use the following four classes of data retrieval for structured data.
- Retrieve by Key
- Retrieve by WHERE clause or its equivalent
- Need Join Queries
- Retrieve by offline queries
Consistency
We talk about consistency in terms of data operations like adding, editing, or deleting data. By consistency, we generally mean three guarantees. First, the failed operations do not leave side effects meaning the data changes are fully applied or they are not applied at all. Second, when multiple users are writing or reading, the data changes are applied in a way that readers always see the write as one atomic operation. In other words, readers never see an intermediate value. Finally, when there are multiple copies of the same data, the system applies updates to all copies in the same order ensuring the consistency.
We define three classes of consistency based on scope and how fast the changes are applied.
- Eventual Consistency: each data copy eventually comes to the same consistent state given enough time.
- Consistent Operations: Each operation single operation is applied in atomic manner. Difference between this and transactions is that transactions can apply multiple operations in atomic manner.
- Transactions: a series of operations can be applied with ACID guarantees.
Choosing the Right Storage Solution
This section provides Data storage recommendations. Each of the following subsections considers unstructured, semi-structured, and structured data, and other properties (Scalability, Consistency, and Types of Queries needed) are discussed within those sections.
Handling Unstructured Data
(Click on the image to enlarge it)
Listing 2 shows recommendations for unstructured data where the File Systems is the defacto solution. Different file system architectures provide different levels of scalability. For instance, File systems like Network File System (NSF) have limited scale and work well only within a LAN, but systems like Lustre are scalable and work across WANs. On the other hand, when individual files are not very large, Key-value based systems can be used, and they are massively scalable.
Due to the lack of structure, unstructured data only supports searching through metadata or content search through structure. In the table above, different rows show different search methods.
Metadata is data about files, and are often represented as name-value pairs although in some cases, the values can be of a complex type. File systems have metadata, but are often limited to a default set of properties. Metadata catalogs keep track of user-defined metadata about files and facilitate searching for files based on some criteria.
Another alternative is building an index. Index is a mapping from substrings (often words) in files to files. Search on substrings is supported through the index. Google uses this model to a great effect.
Since unstructured data storages do not understand the structure, they do not provide any constancy guarantees for data stored in them, and Parallel access is often handled through locks, where one process has exclusive access to the files while the data is being updated. Google File System has the concept of "append only files", where many processes can append records to files concurrently.
Semi-Structured Data
(Click on the image to enlarge it)
Listing 3 above depicts the recommendations for storing and managing the semi-structured data. They come in many forms (e.g. XML, Graphs), and different rows in the table represent each type. Each type often has data storages optimized for the data type and they are often the best storage choice for that type.
However, semi-structured data should only be stored as semi-structured data if one wants to store it as-is, and query the data through its structure. Otherwise, if the structure of data is known, it is better to convert the data to structured data and store it in structured form. For example, receiving XMLs as requests does not mean that one should have XML database as a storage mechanism. Most of the time, the schema for the XML is well known and it makes sense to extract the data and store in structured form.
At the same time, if the application never needs to search within XML through XPath, then XML can be stored in unstructured storage as XML database features are not useful to the application. XML databases help only in cases where there is XML whose schema is not known, and the application needs to search within that data using tools like XPath. The same is true for many semi-structured data types. In addition, it is worth noting that semi-structured data stores not widely used and proved in production much. Therefore, they should not be used unless they can provide some advantages over the structured model.
A good example of using semi-structured storage is storing a social graph. Since the social graph does not fit nicely with the structured data like relational data model, and since applications need to perform graph-based queries that are often very inefficient with the relational model, it often makes sense to keep social graphs in a graph database.
Structured data
(Click on the image to enlarge it)
Table 4 summarizes the recommendations for structured data. It presents three dimensions: scale, consistency, and search support in a 2D form. The rows show different types of queries. We categorize columns first by scale, and then further categorize by consistency. Each cell provides recommendations for a specific scale, consistency model, and type of queries. The acronyms used in the table are as follow:
KV = Key-Value Systems, CF = Column Families, Doc = Document Based Systems
The recommendations in the table are based on the following general guidelines:
- Only databases have out-of-the-box support for transactions. It may be an option to implement transactions using transaction manager (e.g. Atomikos , Zookeeper, but that is likely to be slower than DB transactions.
- NoSQL solutions do not support JOIN operations unless it is offline processing where Map-Reduce can be used.
- If the use case doesn't need search functionality, but just needs the key based retrieval, the key/value based model is the best choice. Since they are often implemented with Distributed Hash Table (DHT) they are highly scalable and fast.
Let's consider each case in detail.
For the small-scale category, relational databases (DB) are the defacto solution. Since databases are widely used, accepted, and battle-tested, you should use them unless you need more scalability. However, in cases where transactions and joins are not required, one can use Key-Value storage, Column Families, or Document databases. One potential reason is that users may have a plan to scale up later. However, unless there are any perceivable benefits, one should stay with databases. Typical deployment in this case includes a database cluster that replicates data across 2-3 nodes. This provides a lot of benefits if the application has a read-dominated workload. However, the benefits of clustering are not as significant with more write-dominated use cases.
When search is not needed, Key-Value based systems are the best choice. Column Families are also useful, but the former is more scalable. The only exception for this is when transactions are needed where, either databases are needed or a custom transaction implementation should be used.
Offline data processing cases (such as data warehouses, Business Intelligence (BI), indexing) are not affected by most of the parameters with transactions being the only exception. There are custom multi-dimensional schemas (e.g. star schema) for these cases. Another alternative is to use NoSQL solutions like document-based systems and Column Families. In either case, joins can be supported through Map-Reduce queries.
JOIN queries work reliably only with databases, and do not work with large-scale systems. Parallel databases have explored the topic, and have developed methods to join two data sets that are on different machines. However, they have limited scalability, and do not work with all types of queries and data. Hence, this approach might work with medium-scale systems when amount of information that needs to be transferred while carrying out a join operation is small.
NoSQL storages like Column Families and Document databases do not support searches that have WHERE clauses inherently and support for those are implemented on top of a scatter-and-gather style where queries are scattered to all nodes and results are collected. This approach can handle medium-scale systems, but the same approach is unlikely to work with large-scale systems.
Use cases
This section discuses some use cases, potential data solutions for those cases, and the rationale behind the choice of data storage solution.
- Document/ Video/ Image Repositories: Since the system serves the data as is and does not understand the data, these are typically handled through either file systems or Metadata catalogs. Most use cases support searching documents through text indexing and support searching Video and Images by associating metadata with each video and storing them in a metadata catalog. For example, most web sites are served from a file system, and most new agencies keep a repository of images and videos annotated by their metadata for quick retrieval.
- Social Networks and Relationships: Social graphs (e.g. Facebook), dependency graphs, and relationship graphs have graph structure as their basis. Furthermore, we can cast most queries on that type of data as graph searches. Therefore, semi-structured graph data storage is the best choice for such use cases.
- Analytical Processing: Analytical processing needs very fast write throughput to collect data and then walk through the data and process them to generate results. These results are either directly presented to the users or accessed by them at a later time. Often, Column family storages are used for these use cases as they provide high write-throughput. Since these use cases need data processing queries in an offline mode, lack of support for JOINs in the data storage implementation doesn't cause problems. Often, complex processing can be done using Map-Reduce methods.
- Web-based Retail Systems: Retail systems generate many read queries compared to few write queries that actually buy things. One potential design can keep the data about transactions in a RDBMS while keeping the data about items in a Column space. Column space serves the data for most listings while most of data about buying an item will be stored in the RDBMS.
Conclusion
This article provides data storage recommendations based on four decision parameters about a given application: type of data, scalability, consistency, and supported queries. We also describe the rationale for these choices while providing rules of thumb useful to the architects. It is worth noting that some use cases need more than one type of data storage as they inherently have several types of data. It makes sense to use two or more data storage types for those scenarios. The Analytical Processing use case is an example of such a scenario. However, such hybrid solutions cannot support transactions across multiple data stores. If needed, transactions can be added using an external transaction manager.
It is likely that there are other decision parameters and aspects of storage decisions that are not covered in this article. We would appreciate your feedback on any such omission or improvement.
About the Author
Srinath Perera works as a Senior Software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation, PMC and the Apache Web Service project. Srinath is also a committer of Apache open source projects Axis, Axis2, and Geronimo.
His primary research interest is scale in distributed systems. He has been working on Web Services, SOA, Grid, Cloud, System management, E-Science, and he has publish many peer reviewed articles as well as technical articles.
Srinath received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering from University of Moratuwa, Sri Lanka.