In a recent post, Expedia Group, a global travel technology company that manages over 70 TB of data, disclosed its process of migrating Cassandra clusters with over 50 tables and thousands of connections to ScyllaDB. The primary motivation for this migration is to leverage ScyllaDB's built-in Change Data Capture (CDC) capabilities for improved data consistency and reduced operational complexity.
Database migrations like this pose significant challenges, especially when maintaining continuous application access and handling large data volumes. In particular, for this migration, the critical goals were zero downtime, continuous TLS connections, data consistency, and accommodating latency-sensitive applications.
The migration started with the Cassandra cluster called Identity, which kept sensitive data for user authentication and sessions. Due to the amount of data, the migration process included transferring 1 TB of non-compressed data (totaling 3 TB with replication) and the critical nature of this data, meant that any downtime or data loss would disrupt Expedia Group’s platforms.
The current architecture, composed by Cassandra’s Change Data Capture (CDC) paired with Debezium (a third-party change-tracking library), introduced additional points of failure, creating system instability. If Debezium stopped processing events, the CDC-enabled tables could become blocked, leading to instability across the cluster. Due to the requirements and the current architecture, the team evaluated two migration tools: the SSTable Loader and the Scylla Migrator. Ultimately, Scylla Migrator was selected for its checkpointing ability, parallel migration support, and built-in validator, providing a more reliable approach than SSTable Loader.
Scylla Migrator offered more reliability than SSTable Loader, especially given the requirement to pause and resume migration if needed. Scylla Migrator also integrates with Spark, enabling parallelized data migration and further boosting efficiency.
Preparing the Spark cluster required fine-tuning both the Spark and Scylla Migrator settings. Memory allocations and Spark worker instances were adjusted to balance performance without impacting Cassandra’s latency sensitivity. TLS connections posed additional challenges; Cassandra required TLS, but Scylla Migrator did not support the self-signed certificates used by the team. To resolve this, the team temporarily disabled TLS during the migration and later configured a new data center in ScyllaDB to enable TLS after completion, ensuring secure connections without disrupting the migration.
Technical complexities surfaced around null cluster keys and large tables. During migration, some tables contained null values in primary keys, which ScyllaDB could not accept. To address this, the team identified and purged problematic data before re-running the migration job. Large tables also necessitated a temporary scale-up of the ScyllaDB cluster to handle higher write loads, with additional adjustments to split count and worker cores for optimal performance.
Once the data transfer was complete, the team conducted a validation phase to confirm data accuracy, focusing on value comparison rather than timestamps. Due to latency considerations, validation jobs ran with reduced resources. Scylla validator reports confirmed data consistency, and dual writes verified both clusters’ integrity before switching applications to ScyllaDB.
The migration process provided several key insights. Running Scylla Migrator on a separate CDC instance minimized the load on production traffic. Additionally, data types requiring original timestamps posed challenges due to CQL protocol limitations rather than any restriction within Scylla Migrator. Ultimately, the migration to ScyllaDB streamlined CDC processes and reduced AWS costs by requiring fewer nodes than Cassandra while maintaining high availability.
Other companies decided to migrate to ScyllaDB; some examples are:
- Cobli: A Brazilian logistics company that migrated from Apache Cassandra to ScyllaDB to improve performance and reduce costs. They were able to achieve a 10x performance improvement and a 50% reduction in infrastructure costs.
- mParticle: A customer data platform that migrated from Apache Cassandra to ScyllaDB to improve performance and reduce latency. They could achieve a 10x improvement in query performance and a 50% reduction in latency.
In conclusion, for Expedia, the migration to ScyllaDB successfully enhanced efficiency, stability, and cost-effectiveness, meeting Expedia Group’s stringent standards for data availability and low latency while offering a simpler CDC solution.