AWS updated DynamoDB with the ability to publish near real-time notifications of data changes. This new capability – called DynamoDB Streams – spawned two additional features for the NoSQL database-as-a-service: DynamoDB Triggers fire based on specific data changes found in a DynamoDB Stream, and cross-region replication is driven by a DynamoDB Streams-based architecture.
In a blog post about the release, Amazon CTO Werner Vogels described the ongoing transition to distributed systems that embrace event-driven processes.
The velocity and variety of data that you are managing continues to increase, making your task of keeping up with the change more challenging as you want to manage the systems and applications in real time and respond to changing conditions. A common design pattern is to capture transactional and operational data (such as logs) that require high throughput and performance in DynamoDB, and provide periodic updates to search clusters and data warehouses. However, in the past, you had to write code to manage the data changes and deal with keeping the search engine and data warehousing engines in sync. For cost and manageability reasons, some developers have collocated the extract job, the search cluster, and data warehouses on the same box, leading to performance and scalability compromises. DynamoDB Streams simplifies and improves this design pattern with a distributed systems approach.
Vogels defined DynamoDB Streams as “a time-ordered sequence, or change log, of all item-level changes made to any DynamoDB table.” All processing is asynchronous, so users can enable DynamoDB streams on either new or existing tables without impacting performance. It’s up to the user to determine what information– there’s a choice of primary keys only, the pre-change item, post-change item, or both – is written to the stream. Change records are available in the stream for up to 24 hours. Stream consumers connect to an API endpoint and issue requests for shards that contain the stream records. The interface to DynamoDB Streams purposefully resembles that of Amazon Kinesis, the real-time data processing service from AWS. DynamoDB Streams has been available in Preview since last Fall, but this marks the general availability for all customers. While DynamoDB Streams itself is available at no charge, users will pay for reading data from the stream.
The concept of “triggers” may make an experienced DBA shudder, and Vogels recognized the challenge of leveraging triggers on traditional databases.
From the dawn of databases, the pull method has been the preferred model for interaction with a database. To retrieve data, applications are expected to make API calls and read the data. To get updates from a table, customers have to constantly poll the database with another API call. Relational databases use triggers as a mechanism to enable applications to respond to data changes. However, the execution of the triggers happens on the same machine as the one that runs the database and an errant trigger can wreak havoc on the whole database. In addition, such mechanisms do not scale well for fast-moving data sets and large databases.
To achieve a truly scalable, high-performance, and flexible system, we need to decouple the execution of triggers from the database and bring the data changes to the applications as they occur.
Vogels thinks that the solution is DynamoDB Triggers – a combination of DynamoDB Streams with AWS Lambda. Lambda is the event-driven compute service that runs Java and JavaScript functions. Via DynamoDB Triggers, these functions run outside the database and respond to data changes included in DynamoDB Streams. AWS evangelist Jeff Barr explained and demonstrated DynamoDB Triggers in a recent blog post.
You can think of the combination of Streams and Lambda as a clean and lightweight way to implement database triggers, NoSQL style! Historically, relational database triggers were implemented within the database engine itself. As such, the repertoire of possible responses to an operation is limited to the operations defined by the engine. Using Lambda to implement the actions associated with the triggers (inserting, deleting, and changing table items) is far more powerful and significantly more expressive. You can write simple code to analyze changes (by comparing the new and the old item images), initiate updates to other forms of data, enforce business rules, or activate synchronous or asynchronous business logic. You can allow Lambda to manage the hosting and the scaling so that you can focus on the unique and valuable parts of your application.
Barr identified a sample Lambda function that developers can use to configure a simple Trigger. Users who want to build out more complex scenarios have to be sure to follow the proper patterns for building secure, stateless, event-driven Lambda functions. There’s no extra cost for using DynamoDB Triggers, outside the cost of executing Lambda functions.
The final component of this announcement was a cross-region replication solution built upon DynamoDB Streams. By default, DynamoDB replicates its data across multiple availability zones in a given region. In the past, if a developers wanted to copy data to additional regions for disaster-recovery, latency, or migration purposes, it was their responsibility to design a replication solution. AWS now offers an application that leverages DynamoDB Streams to copy data from a source table to a replica in another region. The application is deployed by the CloudFormation provisioning engine, and consists of:
- Master table in DynamoDB with Streams enabled.
- A Replication Coordinator and DynamoDB Connector running in the Amazon Container Service and deployed via the Elastic Beanstalk packaging service. The Replication Coordinator reads from the DynamoDB Stream and batches records meant for the replica table.
- DynamoDB tables for replication metadata, and another for Kinesis checkpoint management.
Users can only create single-master configurations with this model; any changes written directly to a replica are not propagated back to a master table, and risk being overwritten. While there is no price for running the cross-region application itself, the user is responsible for the cost of the throughput units for the replica DynamoDB table, data transfer charges, the cost of reading the data stream, the charge for EC2 container instances, and the cost of the SQS queue that interacts with the cross-region application.