At QCon San Francisco 2019, Chris Riccomini presented "The Future of Data Engineering". The key takeaway of his talk was the need to reach an end goal with data engineering, which is having a fully automated decentralized data warehouse.
Riccomini stated that his intention with this talk was to provide a survey of the current state of the art for data pipelines, ETL, and data warehousing, and to look forwards at where data engineering may be heading:
The two primary areas where I think we're going are towards more real-time data pipelines and towards decentralized and automated data warehouse management.
Riccomini, a software engineer at WePay, provided a view of the future of data engineering based on his blog post from July this year. He began by defining what data engineering is, and put out his definition: "a data engineer's job is to help an organization move and process data". "Move" means streaming or data pipelines, and "process" means data warehouses and stream processing, in his view.
In his talk Riccomini provided an overview of the various stages of data engineering, from an initial "none" stage up to a "decentralization" stage:
- Stage 0: None
- Stage 1: Batch
- Stage 2: Realtime
- Stage 3: Integration
- Stage 4: Automation
- Stage 5: Decentralization
Each stage depends on the situation an organization is in, and has some challenges. Riccomini described each stage, according to the journey WePay made to reach the final stage of a fully decentralized and automated warehouse management system. He pointed out that the stages can provide a perspective of where an organization is, and suggest what the future could be.
Furthermore, Riccomini said that WePay is at a particular stage -- some companies are further ahead, and some behind, but these stages can help to build a roadmap. The first "none" stage, as Riccomini classifies it, is when an organization is small, has a monolithic architecture, and needs a data warehouse quick. WePay was at this stage in 2014, and had problems like query timeouts and sophisticated analytic features were missing.
The next stage is "batch", where organizations still have a monolithic architecture yet need to scale, and require more features like reports, charts, and business intelligence. In 2016 Riccomini said WePay was at the batch stage:
If you want something up and running, it's a nice really nice place, to start with.
However, an organization could have issues when it grows, such as timeouts on the workflow, and database operations impacting pipelines.
Next, Riccomini discussed the real-time stage, which he considers the modern era of data engineering. This stage has data engineering is a first-class citizen, and the organization has an Apache Kafka-like infrastructure. It's the stage WePay was at in 2017, and was captured in a blog by Riccomini.
Still, as Riccomini stated, there were problems with the real-time setup:
- The pipeline for the datastore was still on Airflow
- No pipeline at all for Cassandra or Bigtable
- BigQuery needed logging data
- Elasticsearch needed data
- Graph DB needed data
Next in the talk he presented the stage of "integration", where the architecture is no longer a monolith. In order to reduce the number of systems being used, Riccomini said integration is necessary. WePay leverages Apache Kafka, including Waltz. WePay is currently at the integration stage; however, the architecture has become complex, as with Kafka WePay is onboarding more and more systems. Hence, WePay started to think about automation, which is the subsequent stage.
Riccomini explains that there are two types of the "automation" stage:
- Automated operations like creating and configuring Kafka topics, create BigQuery views, and leveraging automation tools such as Terraform, Salt, Chef, and Spinnaker.
- Automated data management by setting up data catalog, including schema, versioning, and encryption, configuring access through policies for Role-Based Access (RBAC), Identity Access Management (IAM) and Access Control Lists (ACL), and again leveraging tools like Terraform.
With automated data management, Riccomini pointed out that regulations like GDPR, SOX, and HIPAA play a role, and organizations need to ask the following questions:
- Who gets access to this data?
- How long can this data be persisted?
- Is this data allowed in this system?
- Which geographies must data be persisted in?
- Should columns be masked?
However, Riccomini said that automation still requires data engineers to manage configuration and deployment.
Lastly, Riccomini went on to explain the final stage, "decentralization", a stage in which an organization has a fully automated real-time data pipeline. However, the questions is, does it require one team to manage this? According to Riccomini, the answer is "no", as he stated in the future multiple data warehouses will be able to be set up and managed by different teams. In his view, traditional data engineering will evolve from a more monolithic data warehouse to so-called "microwarehouses" where everyone manages their own data warehouse.
Riccomini has published his slides of the presentation. Additionally, this and other presentations at the conference were recorded and will be available on InfoQ over the coming months. The next QCon London 2020 is scheduled for March 2 - 6, 2020.