Nowadays, NoSQL databases co-exist with relational databases in enterprise data architecture. However, NoSQL data management currently lacks mature methods and tools to manage NoSQL data as well as relational data.
Most existing NoSQL databases are designed with more consideration on application performance, less on high level business models, data integration and data standardization. There is a gap between data modeling and physical data aspects of NoSQL databases.
In this article, I discuss a solution for managing both NoSQL and relational databases using the Unified Data Modeling techniques. Unified data modeling supports features like document schema of NoSQL databases and reverse engineering of data from an existing database. It also supports visual refactoring of existing databases.
With data growth on 4V’s (Volume, Variety, Velocity, Value), data management is evolving from scaling up to scaling out. Hundreds of thousands of small servers create distributed computing instead of a single powerful machine. Data needs to be transformed to a different model to support distributed computing.
Current relational databases all follow the 3rd normalization. With ACID transaction model (Atomic, Consistent, Isolated, Durable), it is good to use relational databases when one data set has only one copy in the database. It means modifying one copy at a time. However, data needs aggregation when it’s queried from multiple different applications. So data needs to be distributed, and data schema needs to be de-normalized according to the business requirements. Schemas should be designed for enabling distributed query. This requires each data set to contain enough information to run the executed queries separately in different data nodes.
Based on the above, using logical model describing business requirements and de-normalizing schema to physical data model is fundamental when building NoSQL databases. Instead of writing data to NoSQL database directly from program without data modeling.
Also, data is expected to be more varied, so flexibility is needed to match the native format of information, it could be stored in Document, Graph, or Key Value database. The structure of data changes with agile business. But a strong pre-defined schema is limited by this kind of business scenario. Modifying an existing column or adding a new column requires recreating the table in relational databases, but it’s flexible to add new attribute or composite object in NoSQL database.
On the other hand, predefined schema is not enforced before writing data in NoSQL. Schema on read means load the data as-is and apply your own lens to the data when you read it back out. Schema is required for reading and understanding data, but it is a big challenge for those who are not developers of Map Reduce program to understand. This is because schema is hidden in map-reduce program. So most DBAs and data analysts cannot access and understand these schemas. This is why data modeling is the key for the better understanding of enterprise data.
Moreover, streaming data, compared to traditional batch data sets, has different requirements (real-time, append only). Data may also need to be transformed to support multiple concurrent data processing systems. Data model is needed to help understand and re-structure data for data analysis. Data integration system can extract dimension data from original streaming data to data warehouse according to data model design.
Therefore, data model is the key for data architecture to meet business requirements in both RDBMS and NoSQL databases.
ACID (Atomicity, Consistency, Isolation, and Durability) of RDBMS has been and will always be one of the most important requirements in the database. In the future, mixing RDBMS and NoSQL will be a typical scenario in enterprise architecture. Unified Modelset can be used to describe data schema of RDBMS and NoSQL databases.
Here is a summary of the differences between RDBMS and NoSQL databases.
Figure 1. Differences between RDBMS and NoSQL databases
It is a big challenge to manage these differences and leverage both relational and NoSQL databases in different business scanarios. That’s why we need a unified approach to managing these databases.
CA ERwin Unified Data Modeler supports data modeling for RDBMS and two major NoSQL data types (Document and ColumnFamily). It also supports data discovery and data migration between RDBMS & NoSQL databases.
Methodology of Unified modelset managing RDBMS and NoSQL data.
It provides a solution of documenting business requirements and physical data schema, and then manages data across RDBMS and NoSQL databases.
Notation for concept / logical model descript business requirement
- Concept / logical model notation consists:
- Entity
- Properties
- Relationship
- Tags
Figure 2. Concept / Logical model Definition
Basic concept mapping among Logical, RDBMS and NoSQL Data Models
Conceptual / Logical |
RDBMS |
NoSQL |
Entity |
Table |
Collection / Column Family |
Entity instance |
Row |
Document / Row |
Property |
Column |
Key / Column |
Property of an entity instance |
Cell Value |
Field Value |
Domain |
Data type |
Data type (Some NoSQL database no data type, all of value is plain text.) |
Relationship |
Constraint |
Reference, Embedded, Additional table, row across multiple column families. |
Key |
Index |
Index, Additional table, Reference |
Unique identification |
Primary Key |
Row Key |
Unified Modeling Procedure
Business concepts and requirements can be described using logical model which could be transformed to target physical models. This is a transformation process based on query patterns and data production patterns described in logical model. The transformation rule that is pre-defined as schema transformation strategy. And which strategy to be affected depends on the tags describing the query patterns and data production patterns. The tags and rules here can also be modified and customized by user. At the end of the procedure, raw data can be formatted according to the designed physical model and migrate to database as forward engineering.
Business scenarios of unified modelset usage:
- Document data schema of RDBMS and NoSQL, support Reverse Engineering from existing databases.
- Migrate data (forward engineering) between different databases according to designed Unified modelset. Also support refactor of the existing database based on Unified modelset.
- Create Unified modelset from scratch, generating new physical model bound to database, and then wrap up the model as UDBC data source. Manage Data read / write by UDBC for data management.
- Build data warehouse from production database, which is similar to scenario 2. Reverse engineering from existing database, design your data warehouse physical model. Migrate data to finally build the data warehouse.
Here is sample of the database design of a film (movie) library. The business requirement is listed as below:
- A films could be acted by multiple actors and actresses. A film can belong to multiple categories.
- An actor or actress could act multiple films.
- A category could include multiple films.
Entities & Relationships are identified in concept/logical model:
Entities – Actor, Category, Film
Relationships – (Actor – Film N: N); (Film – Category N: N)
Figure 3 is the logical model diagram of a film library described by ER-Diagram. Figure 4 is the physical RDBMS model diagram described by ER-Diagram.
Figure 3. Logical Model with traditional ER diagram notation
Figure 4. RDBMS physical model with traditional ER-Diagram
However, the existing ER Diagram is not enough to describe NoSQL data. So we create Unified modelset for data modeling on RDBMS and NoSQL. It describes business model, query patterns and data production patterns. The modelset consists of one logical model and multiple physical models.
Figure 5. Unified Modelset Definition
Figure 6 is the logical model of the sample of a film library above. Some tags are attached with entities and properties describing data query patterns and production patterns.
Figure 6. Logical model with Unified Modelset notation
Notation of Document physical model in Unified Modelset
In a document database, such as MongoDB or Couchbase, everything related to a database object is encapsulated together as a Document.
- Collection identified as Entity
- Nested-document identified as Entity
- Nested document identified as Relationship between nested and parent document
- Array identified as One-to-Many Relationship
- Reference identified as Relationship
Figure 7 is the document-based physical model diagram. Its notation reflects real data structure (document & nested document). Figure 8 shows the real data with model designed for MongoDB, like the Actor and Category nested into Film as an example.
Figure 7. Document base physical model
Figure 8. Real data persist according to designed model in MongoDB
Notation of Column Family physical model in Unified Modelset
A column family is a NoSQL database that contains columns of related data. It is a tuple (pair) that consists of a key-value pair, where the key is mapped to a value that is a set of columns.
Typical column family databases: HBase, Cassandra
- CF (Column Family) identified as Entity
- Table identified as Entity, it set a relationship with entity identified from parent CF.
- Column Qualifier identified as Property or Index (Tag)
- One row across multiple CFs identified as Relationship between crossed Entities.
Figure 9 is the physical model diagram of the sample of a film library.
Figure 9. Column Family physical model
Query Patterns in Unified Modelset
Query patterns are critical for NoSQL data modeling. De-normalizing logical model to physical model depends on how data is queried. So query patterns need to be described in data model, in which entities are usually queried together. These entities need to be aggregated in NoSQL physical data storage.
- Aggregation, aggregated data are data combining several measurements. When data are aggregated, groups of observations are replaced with summary statistics based on those observations. Its can describe entities.
- One-to-One, used to describe relationship. Each row in one database table is linked to one and only one other row in another table. It is FK (Foreign Key) constraint in RDBMS. It could be embedded in each-other’s document or separate documents in document database.
- One-to-Many, used to describe relationship. Each row in the related to table can be related to many rows in the relating table. For example, a mother can have many children, but a child can have only one mother. It is FK constraint in RDBMS. It could be embedded in each other’s document.
- Many-to-Many, used to to describe relationship. One or more rows in a table can be related to 0, 1 or many rows in another table. For example, a video can be hired by many customer, and customers can hire many videos. It is an additional table in RDBMS. It could be generated as reference in each other document or additional documents in document database.
- Frequent query, used to describe relationship. Aggregating associated entities into a single composite entity if possible. It would be better to create index in physical database.
Production Data Patterns in Unified Modelset
Production data patterns are used to describe physical character of database in production environment.
- Big Volume, used to describe entities, including those containing large number of records. So it would be better to design an aggregated entity.
- Strong Consistency & Eventual Consistency, used to describe entities.
It could be used for Transition Validation,
And entity with “Strong Consistency” should be prevented from being embedded into other entities and replicated.
- ReadOnly / Append / Update, used to describe entities
Frequently-updated data should be prevented from being embedded into other entities, as updating distributed entities could lead to large scale of data to be locked.
“ReadOnly” and “Append” entity could be either embedded in or not.
Schema Inference in Unified Modeler
Since NoSQL database don’t have schema, data schema needs inference from raw data, instead of being extracted directly from database like RDBMS. Methods of schema inference:
- Schema coverage on statistics of records. Statistics approached by
- Whole database records
- Top N records
- Percentage of total records
- Recursion of a depth of N levels
- Machine Learning, feedback collection from UI, dimension building in classifier, and continual improvement of accuracy in schema inference process.
Figure 10. Reverse Engineering from MongoDB in Unified Modeler
UDBC (Unified Database Connector) in Unified Modeler
Once we have Unified Modelset to describe enterprise data architecture, including business requirements in logical models and data schema in physical models, we can extend to data discovery based on data model. UDBC (Unified Database Connector) provides an interface of enterprise data from outside based on unified modelset.
Figure 11. UDBC (Unified Database Connector) architecture
About the Author
Allen Wang is Director from CA technologies ERwin development. He’s responsible for ERwin global engineering and ERwin Unified Modeler product management, leading team successfully released 8 major releases since 8.0. Member of Industry standard committee OMG & DAMA. Driving Big Data research project in cooperation between CA and Tsinghua University. Help product entering new market. Focus on complex big data environment of enterprises, manage data with unified model and data mining. Five patents owner relative with data modelling and NoSQL. Guest tutor of Fudan University, teaching big data lesson for master degree student. This article is also contributed by ERwin development team and edited by Xiaoyuan Yuan.