When you store data in a database, you often also want to put the same data in a cache and a search engine. The challenge then is how to keep all data in sync when you want to stay away from distributed transactions and dual writes. One solution is to use a change data capture (CDC) tool that captures and publishes changes in a database. In a presentation at MicroXchg Berlin, Gunnar Morling described Debezium, an implementation of CDC using Apache Kafka to publish changes as event streams.
Morling, software engineer at Red Hat, describes Debezium as an open source CDC tool built on top of Kafka that reads the transaction logs in a database and creates streams of events. Other applications can then asynchronously consume these events in the correct order and update their own data storage according to their needs.
Transaction log files are append-only log files, used for rollback of transactions and replication, and for Morling they are an ideal source for capturing changes made in a database, since they contain all changes made and in the correct order. All databases have their own APIs for reading the log files, so Debezium comes with connectors for several databases. On the output side Debezium produces one generic and abstract event representation for Kafka.
Besides using CDC for updating caches, services and search engines, Morling notes some other use cases, including:
- Data replication. Commonly used for replication of data into another type of database or data warehouse.
- Auditing. By keeping the history of events, possibly after enriching each event with metadata, you will have an audit of all changes made. One example of metadata that may be interesting is the current user.
In an microservices based system, commonly services need data from other services, but Morling points out that we should to stay away from shared databases. Using REST calls will increase the coupling between services; instead, we can use CDC for such scenarios. By creating streams with changes, other services can subscribe to these changes and update their local databases. This pipeline of events is asynchronous, which means services can fall behind in case of for instance network problems, but they will catch up eventually — the system is eventually consistent.
Morling points out that exposing internal structures, as in the previous example, is problematic but can be solved using the Outbox pattern. Instead of capturing changes in tables used by the business domain, a separate events table is used. When something is changed in the domain, which should cause an event to be published, a record is added to the events table in the same transaction. This events table then essentially becomes an outbox with events that should be sent to other systems.
When you want to start with microservices you often have an existing monolithic application which you want to extract services from. To verify that a new service does the right thing, Morling notes that CDC can be used for verification. By keeping the write requests to the monolith and extracting the changes, they can be used to verify the behaviour of the new service.
Morling has written a blog post describing the Outbox pattern in more detail. Martin Kleppmann described in an earlier blog post how logs can be used to build a solid data infrastructure, and why dual writes are a bad idea.
Most presentations at the conference were recorded and will be available over the coming months.