Grab engineering has recently shared its experience in the design of its backend data platform in a blog post.
GrabApp is an application that customers select and buy their daily needs from merchants. To be scalable and manageable, the data platform and ingestion should be designed as a distributed, fault-tolerant. To design this system, as stated in the blog post they calcified queries into two main classes. Write queries like create and update an order. Read queries like getting orderID, read statistics related to the userId, etc. We can easily classify these two types of Read/Write queries as transactional and analytical queries.
Transactional queries are also known as OLTP or online transactional processing. Analytical queries are also known as OLAP or online transaction processing. These two types of data storage and their usage is illustrated in the following figure.
Database platform for online ordering in Grab
For transactional queries, engineers were using AWS DynamoDB. DynamoDB is a scalable, elastic key-value store. It supports strong consistency for primary-key. A sample DynamoDB table is illustrated in the following figure.
As explained in the blog post :
Each DynamoDB table has many items with attributes. Each item has a partition key and sort key. The partition key is used for key-value queries, and the sort key is used for range queries. In our case, the table contains multiple order items. The partition key is order ID. We can easily support key-value queries by the partition key.Batch queries like ‘Get ongoing orders by passenger id’ are supported by DynamoDB Global Secondary Index (GSI). A GSI is like a normal DynamoDB table, which also has keys and attributes.
By using DynamoDB, engineers also used the time to live (TTL) attributes to manage retention. In that case, they control the size of the table and control the cost.
To manage analytical queries, engineers in Grab used MySQL. The following graphs show the OLAP architecture.
Using Kafka to connect data stream to both OLTP and OLAP data stores
Kafka is used to process stream data for both OLTP and OLAP data. To make sure that Kafka will have high reliability and availability, Amazon Simple Queue Service (SQS) is used for retry. This is explained in the blog post:
Even if Kafka can provide 99.95% SLA, there is still the chance of stream producer failures. When the producer fails, we will store the message in an Amazon Simple Queue Service (SQS) and retry. If the retry also fails, it will be moved to the SQS dead letter queue (DLQ), to be consumed at a later time. On the stream consumer side, we use back-off retry at both stream and database levels to ensure consistency. In a worst-case scenario, we can rewind the stream events from Kafka. It is important for the data ingestion pipeline to handle duplicate messages and out-of-order messages. Duplicate messages are handled by the database-level unique key (for example, order ID + creation time).
Designing two-layer architecture for data platforms such as OLTP and OLAP is a clear way to make sure that the engineering team could focus on the high scalability, availability, and observability of these two layers.