Atlassian recently introduced Lithium, an in-house ETL platform designed to meet the requirements of dynamic data movement. Lithium streamlines tasks such as cloud migrations, scheduled backups, and in-flight data validations by supporting ephemeral pipelines and tenant-level isolation while ensuring efficiency and scalability, resulting in significant cost savings due to the dynamic allocation of resources.
Lithium addresses the need for runtime pipeline provisioning, a requirement for tasks like user-initiated migrations and periodic backups. Traditional ETL systems are not designed to handle such on-demand provisioning efficiently, prompting Atlassian to create Lithium to minimize resource usage while delivering high-performance data movement.
Niraj Mishra, principal engineer at Atlassian, told InfoQ, "The primary advantage of utilizing ephemeral pipelines is the ability to scale infrastructure according to demand rather than maintaining it for peak loads." He adds that leveraging elastic Kafka clusters allows Atlassian to avoid provisioning large clusters for brief spikes, resulting in significant infrastructure cost savings. However, this feature is currently under development.
Integrating the Lithium SDK allows consumer services like Jira and Confluence to host their processors closer to their data and domain logic. This model empowers services to focus solely on their processing logic while leveraging Lithium's abstraction for efficient Kafka-based data handling. Mishra states, "In a system like Flink, data processing occurs within dedicated clusters, necessitating that processors have access to the data required for transformation." However, the components of the Lithium data plane operate within the confines of the service that owns the data, allowing them direct access to any information needed to enhance the messages.
Bring your own host (Source)
Another benefit is that these data plane components can leverage the domain logic embedded within the services to process messages rather than packaging this logic as separate JARs to run on Flink. "This approach eliminates the need for additional Flink infrastructure to be monitored, hosted, and managed. Consequently, it empowers service teams to take ownership of their stream processors, thereby minimizing team friction and competing priorities."
The platform operates with an AWS MSK-based Control Plane that orchestrates pipeline management, resource allocation, and Kafka topic lifecycles, as well as a Data Plane that facilitates actual data movement. The separation ensures streamlined operations, with the Control Plane broadcasting commands and the Data Plane executing them using provisioned processors.
Lithium's Control Plane and Data Plane (Source)
The data pipelines in Lithium are called Workplans. Workplans in Lithium consist of three main processor types - Source, Transform, and Sink - which extract, transform, and load data. These modular processors can be dynamically distributed across services, enabling localized execution of domain-specific logic and scalable parallel data processing.
What makes Workplans unique is their ability to be provisioned dynamically and operate ephemerally, existing only for the task's duration. Each Workplan runs in complete isolation, with dedicated Kafka topics and processor instances, ensuring tenant-level data security and preventing pipeline interference. Their design supports advanced capabilities like pause, rewind, and in-progress remediation, making them highly adaptable to complex, real-time data workflows.
Sample Workplan Specification (Source)
InfoQ spoke with Mishra about Lithium, its implementation and its future.
InfoQ: In the article, you highlight using Kafka Streams' exactly-once semantics to avoid duplicates. What challenges did you face in ensuring transactional integrity across multiple pipeline stages, and how did you address them?
Niraj Mishra: Lithium's ETL pipeline combines Kafka producer/consumer APIs and Kafka stream APIs. The Kafka stream topology is utilized in the transform phase, where multiple transformation steps can be interconnected as nodes within the Kafka streaming topology. Kafka provides an API that guarantees exactly-once delivery for Kafka streams, ensuring that duplicates are not produced during the transformation step.
However, this does not render the entire pipeline an exactly-once system. For instance, the Kafka producers used in the extract phase may produce messages more than once, and the Kafka consumers in the sink phase can also consume messages multiple times.
To achieve transactional integrity across various stages, Lithium employs several mechanisms (the transform phase is straightforward with the use of Kafka streams). We have developed a transactional sink that consumers can utilize to ensure that messages are processed in the sinks exactly once. The platform accomplishes this by managing consumer offsets within a database table, which resides in the same database where the Sink writes the consumed data. Writes to both the primary table and the offset table occur within the same transaction, guaranteeing that messages are processed in the database exactly once.
InfoQ: Distributed ephemeral pipelines can make debugging complex. What observability features, logging, and monitoring capabilities have you built into the Lithium platform to help developers and operators quickly identify and resolve real-time issues?
Mishra: Debugging ETL jobs, which operate across multiple services, can indeed pose challenges without the proper tooling support from the platform. To address this, we developed our own monitoring tool called Lithium Lens. As the name suggests, it is designed to diagnose issues in running jobs and can also be utilized to check their status.
Lithium Lens is a user-friendly application offering significant insights and visibility into the platform and individual jobs. It boasts several valuable features, including Pipeline Visualization for presenting a diagram illustrating all the data plane components involved in the job, Stream Topology Visualizer, which displays stream topologies through diagrams and showcases the input/output throughput across the nodes in these topologies, and Timeline View which offers a comprehensive overview of the submitted job's timeline, complete with all relevant information.
In addition to these features, we also utilize Signalfx monitoring dashboards for each job running on the platform. Each job is assigned a monitoring ID, and all job-level metrics are tagged with this monitoring ID. This allows us to compile numerous important metrics and charts into a dashboard, providing a convenient link through Lithium Lens.
InfoQ: Given Lithium's current capabilities, what future enhancements do you envision for the platform?
Mishra: One of the unique selling points of the platform is its remediation capabilities. The platform allows for updates to Workplans at runtime, enabling modifications to the transformers and validators utilized in the Workplan. This functionality empowers users to rectify invalid data by incorporating the necessary transformers without restarting the Workplan from the beginning.
However, implementing this requires the addition of new transformers and validators to the appropriate service, followed by a deployment process for that service. This deployment can take considerable time, depending on the specific services involved.
As a next step, we are focused on developing capabilities that will allow new processors to be integrated into the Workplan without necessitating a comprehensive deployment process for individual services. This enhancement will immensely benefit support staff and on-call engineers dealing with production issues.