Key Takeaways
- The advent of Big Data is characterized by the 5Vs: Volume, Variety, Velocity, Veracity, and Value
- DataBase Plus allows applications to communicate with standardized services using SQL queries. It also allows for feature plugins.
- Database Plus is a technology and development concept compatible with any database, that can eliminate switching costs and vendor lock-in.
- Apache ShardingSphere is an open source ecosystem that can create a distributed and encrypted database and can enhance them with sharding, elastic scaling, and encryption features.
- Using AWS’ Aurora database as the storage node, it is possible to build a distributed database cluster with ShardingSphere running as a computing node on EC2.
Background
Data traffic increased to levels we’d never have thought possible just little over a decade ago - and it’s not slowing down. The exponential increase in data traffic has ushered us into the era of Big Data coming from a number of sources ranging from mobile apps, social networks, customer databases or data from IoT (Internet of Things) devices.
The advent of Big Data has meant new attention dedicated to new technologies, new challenges, and new skill sets to be developed to keep up with the newly found database tech development pace.
Concurrently, what has now come to be called "traditional databases", have been losing ground to innovative database solutions. The emergence of data lakes as the preferred storage solution for Big Data, has meant that traditional data warehouses that were built on relational databases (traditional databases) and capable of storing only structured data, have been struggling. Data lakes are usually based on Hadoop clusters, or NoSQL databases for example.
This post starts off by introducing the challenges that emerged since the Internet has gone mainstream, to then present you with some ideas and a hands-on guide to solving some of these issues - data distribution and security in particular.
Considering how these two issues related to different areas of Big Data, you'd be pressed to think about the number of databases that can potentially be used. To keep things simple and clear, this post considers how to upgrade conventional databases such as MySQL, PostgreSQL, or SQLServer with distributed and secure features. This type of solution converts your database cluster into a sharding distributed system and enhances it with useful features such as data encryption and traffic governance. These advantages are certainly no small feat, but there's more: upgrading and migrating your database cluster will generate a net positive if you consider the cost-benefit equation.
Big data creating database challenges
Before getting down to the practical aspects, let's first look at the 5V characteristics of Big Data and its challenges:
- Volume. The amount and size of the data are too large to manage and efficiently use.
- Variety. There is a wide range of data types, including the likes of structured data, unstructured data, and hybrid data.
- Velocity. The increasing speed is a result of thriving Internet traffic.
- Veracity. The accuracy of data determines executives' confidence in business decisions and prospects.
- Value. Data accumulation and analysis creates new opportunities for businesses to discover new potential markets/products, and to make better informed decisions.
Considering the Veracity and Value points of the previous list are more pertinent to data analytics, they're not the focus of this piece. Considering the Volume, Variety and Velocity points, across industries multiple types of businesses are looking for solutions to the following issues:
- How to store and efficiently manage unprecedented amounts of data?
- How to flexibly expand database instances on-demand?
- Considering that in some cases data is collected from multiple sources and then combined into a single result, how to manage structured and unstructured data in tandem?
- How to protect users' privacy in an online system with the minimum amount of refactoring?
There are multiple possible solutions to these problems. Examples include finding a new database vendor or developing middleware or plugins.
However, if you are using or considering an open source traditional DBMS, you can refer to the rest of this post as a recommendation for your own system to evolve it to or create a distributed secure database system that leverages traditional DBMS. Depending on if you prefer PostgreSQL, MySQL, or RDS the following steps can be applied to your system.
Introducing the architecture
Apache ShardingSphere
Apache ShardingSphere is an open source ecosystem that can transform any database into a distributed database system, and enhance it with sharding, elastic scaling, encryption features and much more.
The project's definition can already give you a hint - it can assist you in converting your existing databases into distributed databases, and improve the eventual new system with useful features.
The procedure is fairly straightforward. To achieve this outcome, all you have to do is import the project into your database system (therefore creating a sharding database system), scaling it out on-demand and if you wish to do so, encrypting the data for privacy protection. The following figure gives you an overview of the proposed architecture.
As shown in the previous figure, our distributed database system consists of Apache ShardingSphere (in this case ShardingSphere-Proxy) positioned in between applications and databases such as MySQL, PostgreSQL, Aurora or any other SQL92 database.
In this system, ShardingSphere serves as the computing node receiving user requests, while databases serve as storage nodes to store the data and perform some local computing. Connecting applications will send their queries to ShardingSphere in the same way they would send their queries to a DBMS.
Traditionally SQL would be used to query the databases. Nevertheless, due to the addition of multiple new features in the proposed distributed database system (think auto-scaling, encryption, SQL audit), a SQL-like language is required to operate the new features.
To answer this requirement, without creating a new barrier or learning curve, the project uses Distributed SQL (DistSQL), to allow for a seamless transition. This means that you can just login to ShardingSphere, enter your SQL and DistSQL to create a sharding table, encrypted table, or start a scaling job. In the following sections I will demonstrate its magic and versatility.
Before we proceed, let's take the previous figure one step further. If the previous figure gave you an overview of the deployment architecture including ShardingSphere, the following figure "zooms in" to give you an indepth look.
As you can see from the figure, ShardingSphere not only acts as a computing node in the distributed database system, but also includes a multitude of useful features, and two clients being ShardingSphere-Proxy and ShardingSphere-JDBC.
Database Plus
Database Plus is the guiding development concept the Apache ShardingSphere project follows. It is a concept for a distributed database system that takes it up a notch, and goes beyond simple data sharding.
It was first conceived with the goal of creating a standardized layer and ecosystem positioned above existing and fragmented databases, to provide unified SQL operation services, and minimize databases' differentiation. As a result, applications communicate with a standardized service, rather than requiring tweaking to match each different database. ShardingSphere functions as a standard database server for end users by leveraging traditional DBMS and noSQL databases (TODO).
Feature plugins
The term "feature plugins" that we use in the Database Plus terminology, refers to the fact that all of these features can function both independently as well as concurrently.
This means that a Database Plus based database system is adaptable and "pluggable", simplifying the combination of multiple feature plugins for end users. ShardingSphere currently supports sharding, read/write splitting, database gateway, data encryption, distributed privileges, shadow database, and other features.
Clients
Two clients are included in the ecosystem, for independent or concurrent deployment.
ShardingSphere-Proxy is a transparent database proxy that also functions as a database server. As a result, it should be deployed independently on a server. Currently, PostgreSQL and MySQL work well with ShardingSphere.
ShardingSphere-JDBC is a lightweight Java framework that extends the Java JDBC layer. It integrates into your JDBC application.
A concurrent deployment of the two clients is an option, with ShardingSphere-JDBC acting as a high-performance driver and ShardingSphere-Proxy acting as a management client.
Guide to create a secure distributed database cluster leveraging your DBMS
Following the comprehensive architecture introduction, let's now jump into a step-by-step guide for creating a sharding and secure Aurora database system. In other words, we use ShardingSphere's sharding plugin, data encryption plugin, and ShardingSphere-Proxy for PostgreSQL to build a distributed database system, as shown in the following figure.
The final solution you will have once you go through the steps in the guide, will look like the figure below.
The application considers ShardingSphere + PostgreSQL instances to be a distributed database and treats ShardingSphere similarly to PostgreSQL. From the user perspective, there is only one logic table, namely t_user. However, this table is made up of four actual tables, from t_user_0 to t_user_3, which are located in two different PostgreSQL instances.
The logic table t_user has a column tel that stores users' phone numbers. Since phone numbers are considered sensitive data, they must be encrypted when stored in a database. To accomplish this, two columns tel cipher and tel plain are created in the actual tables to save the corresponding ciphertext and plaintext (optional, shown here only for demonstration purposes).
The ecosystem's user friendliness means that end users do not need to be concerned with these columns in the actual table, or how to map a logic column to its actual columns. They simply construct their SQL statement using the logic column and plaintext data, and ShardingSphere will complete the entire process of sharding the data, as well as automatically encrypting and decrypting data.
ShardingSphere-Proxy executes all of these processes in the background, greatly simplifying your life as a user, only requiring you to deal with the logic table t user with a logic column tel. However, before running a SQL query, users shall guide ShardingSphere on how to shard and encrypt data.
Step-by-step demo
The following demonstration is performed on AWS using the Aurora database as the storage nodes, with ShardingSphere running as a computing node on EC2.
-
Create EC2 for ShardingSphere-Proxy.
-
Create Aurora databases.
-
Deploy ShardingSphere-Proxy.
-
Login to ShardingSphere-Proxy.
-
Initialize the Aurora databases.
-
Initialize ShardingSphere with encryption rule and sharding rule by SQL and Distributed SQL.
-
Insert rows for test on ShardingSphere-Proxy.
-
Run test query SQLs.
-
Check the actual data in the Aurora databases.
Wrap-up
This post focuses on the creation of a distributed and secure database on Aurora using ShardingSphere, while giving the possibility to add many interesting features as well.
The same guide can be used for a variety of other supported database types, to serve as storage nodes for this distributed database. Thanks to the power of open source, there probably are many other solutions that could solve similar issues, and I hope that the readers of this post can find the most suitable solution for their environment.