Key Takeaways
- Data lineage is a critical component of the data pipeline root cause and impact analysis workflow.
- To better understand the relationship between source and destination objects in the data warehouse, data teams can use field-level lineage.
- The complexity of SQL queries can make it challenging to build field-level lineage manually.
- Open source tools like ANother Tool for Language Recognition (ANTLR) can be used to automate parsing query logs for root cause analysis.
- Elasticsearch and the open ELK stack provides a powerful and flexible database to store metadata and query logs for lineage.
- Automating lineage creation and abstracting metadata down to the field-level cuts down on the time and resources required to conduct root cause analysis.
"What happened to this dashboard?"
"Where did our column go?"
"Why is my data ... wrong?"
If you’ve been on the receiving end of these urgent messages, late night phone calls, or frantic emails from business stakeholders, you’re not alone. Data engineers are no strangers to the schema changes, null values, and distribution errors that plague even the healthiest data systems. In fact, as data pipelines become increasingly complex and teams adopt more distributed architectures, such data quality issues only multiply.
While data testing is an important first step when it comes to preventing these issues, data lineage has emerged as a critical component of the data pipeline root cause and impact analysis workflow. Akin to how Site Reliability Engineers or DevOps practitioners might leverage commands like git blame to understand where software broke in the context of larger systems, data lineage gives data engineering and analytics teams visibility into the health of their data at each stage in its lifecycle, from ingestion in the warehouse or lake to eventual analysis in the business intelligence layer. As part of a larger approach to data observability, lineage is critical when it comes to understanding the ins and outs of broken data.
Data lineage refers to a map of the dataset’s journey throughout its lifecycle, from ingestion in the warehouse or lake to visualization in the analytics layer. In short, data lineage is a record of how data got from point A to point B. In the context of data pipelines, lineage traces the relationships across and between upstream source systems (i.e., data warehouses and data lakes) and downstream dependencies (e.g., analytics reports and dashboards), providing a holistic view of data as it evolves. Lineage also highlights the effect of system changes on associated assets, down to individual columns.
Figure 1. TABLE-LEVEL LINEAGE. Table-level lineage with upstream and downstream connections between objects in the data warehouse and tables.
Due to the complexity of even the most basic SQL queries, however, building data lineage from scratch is no easy feat. Historically, lineage is parsed manually, requiring an almost encyclopedic knowledge of a company’s data environment and how each component interacts with each other.
Adding further complexity to the equation, keeping manual lineage up to date becomes more challenging as companies ingest more data, layer on new solutions, and make data analysis more accessible to additional users through codeless analytics tools and other reporting software.
In this article, we walk through our journey to building field-level lineage for our customers, focusing on our backend architecture, key use cases, and best practices for data engineering teams planning to build field-level lineage for their own data systems.
High-level business requirements
Our company, Monte Carlo, is the creator of a popular data observability platform that notifies data engineers when pipelines break. We work with companies like Fox, Intuit, and Vimeo to alert them to, root cause and fix data quality issues before they become a serious problem for the business, from broken dashboards to inaccurate ML models. Such data quality issues can have big ramifications, including lost revenue, poor decision making, model drift, and wasted time.
For the past several years, data teams across industries have relied on table-level lineage to understand the root cause analysis (RCA) and impact analysis of upstream data issues on downstream tables and reports. While useful at the macro level, table-level lineage doesn’t provide teams with the granularity they need to understand exactly why or how their data pipelines break.
As with building any new product functionality, the first step was to understand user requirements, and from there, scope out what could actually be delivered in a reasonable timeframe. After speaking with our customers, we determined that our solution required a few key features:
- Fast time to value: Our customers wanted to quickly understand the impact of code, operational, and data changes on downstream fields and reports. We needed to abstract the relationships between data objects down to the field-level; the table level was too broad for quick remediation.
- Secure architecture: Our customers did not want lineage to access user data or PII directly. We required an approach that would access metadata, logs, and queries, but keep the data itself in the customer’s environment.
- Automation: Existing field-level lineage products often take a manual approach, which puts more responsibility in the lap of the customer; we realized that there was a quicker way forward, with automation, that would also update data assets based on changes in the data lifecycle.
- Integrations with popular data tools. We needed a knowledge graph that could automatically generate nodes across an entire data pipeline, from ingestion in the data warehouse or lake down to the business intelligence or analytics layer. Many of our customers required integration with data warehouses and lakes (Snowflake, Redshift, Databricks, and Apache Spark), transformation tools (dbt, Apache Airflow, and Prefect), and business intelligence dashboards (Looker, Tableau, and Mode), which would require us to generate every possible join and connection between every table in their data system.
- Extraction of column-level information. Our existing table-level lineage was mainly derived from parsing query logs, which couldn’t extract parsed column information - the metadata necessary to help users understand anomalies and other issues in their data. For field-level lineage, we’d need to find a more efficient way to parse queries at scale.
Based on this basic field-level lineage, we could also aggregate the metadata in further steps to serve different use cases, such as operational analytics. For example, we could pre-calculate for a given table and each of its fields how many downstream tables are using the field. This would be particularly useful when it came to identifying the impact of data quality issues on downstream reports and dashboards in the business intelligence layer.
After all, who doesn’t want a painless root cause analysis?
Use cases
At its most basic, our field-level lineage can be used to massively reduce TTD and TTR of data quality issues, with the goal of bringing down the total amount of time it takes to customers to root cause their data pipelines. In an analytics capacity, data lineage can be used for a variety of applications, including:
- Review suspicious numbers in a revenue report. One of our customers, a 400-person FinTech company, generates monthly revenue forecasts using data collected and stored in Snowflake and visualized by Looker. They can use field-level lineage to trace which table in their warehouse has the source field for the "suspicious" numbers in this report, and through this process, realize that the culprit for the data issue was a dbt model that failed to run.
- Reduce data debt. Many of our customers leverage our data observability solution to deprecate columns in frequently used data sets to ensure that outdated objects aren’t being used to generate reports. Field-level lineage makes it easy for them to identify if a given column is linked to downstream reports.
- Managing personal identifiable information (PII). Several of our customers deal with sensitive data, and need to know which columns with PII are linked to destination tables in downstream dashboards. By being able to quickly connect the dots between columns with PII and user-facing dashboards, customers can remove the information or take precautions to deprecate or hide the dashboard from relevant parties. These use cases just scratch the surface of how our customers leverage field-level lineage. By integrating it with their existing root cause analysis workflows, getting to the bottom of these questions can save time and resources for analysts and engineers across their companies.
Solution architecture
Figure 2. LINEAGE GRAPH. Field-level lineage with hundreds of connections between objects in upstream and downstream tables.
When it came to actually building field-level data lineage for our platform, the first thing we needed to do was architect a way to understand which columns belonged to which source tables. This was a particularly challenging task given that most data transformations leverage more than one data source. Further complicating matters, we needed to recursively resolve the original sources and columns in the event that some of the source tables were aliases of existing subqueries derived from other subqueries.
The sheer number of possible SQL clause combinations made it extremely difficult to cover each and every possible use case. In fact, our original prototype only covered about 80% of possible combinations. So, to cover every possible clause and clause combination across all of our data warehouse, data lake, extract, transform, load (ETL), and business intelligence integrations (dozens of tools!), we chose to individually test each clause with one another and ensure that our solution still worked as intended before moving on to the next use case.
Figure 3. FIELD-LINEAGE ARCHITECTURE. The back-end architecture of our field-level lineage solution, built on top of Snowflake and Elasticsearch.
Data model
Figure 4. FIELD-LEVEL LINEAGE STRUCTURE. The structure of our field-level lineage includes several downstream destination fields per upstream table.
At a foundational level, the structure of our lineage incorporates three elements:
- The destination table, stored in the downstream report
- The destination fields, stored in the destination table
- The source tables, stored in the data warehouse
As previously mentioned, there are infinite relationships between destination and source objects, which required us to leverage a data model that was flexible enough to capture multiple queries at once.
We decided to use a logical data model, a table_mcon
ID, and hashed field-level lineage objects together as the ID for the document. For the same destination table, there could be several different queries to update it. Using the destination table mcon
and hashed field-level lineage object, we would be able to capture all the different lineage combinations for a given destination table.
Below, we share one of our proposed index schemas:
Figure 5. FIELD-LEVEL LINEAGE QUERY. Lineage query between a destination (analytics report) and one or more source tables in the warehouse.
In our lineage model, we have one destination table. For each of the fields in the destination table, there is a list of source tables and source columns that define the field, referred to as selected fields. Our model also contains another list of source tables and columns containing the non-selected fields.
Figure 6. DATA MODEL. Our lineage’s data model exposes the relationship between source tables and destination tables.
In our case, our model incorporates one denormalized data structure which contains edges between fields in a destination table to their source fields in some source tables.
Figure 7. JSON QUERY. A JSON query that simplifies a complex WHERE query to identify the root cause of a data quality issue, down to the field level.
Above, we offer a real example of how field-level lineage can 'simplify' a complex query. The WITH
clause contains nine temporary tables, with some of the tables using other temporary tables defined before them. Additionally, in the main query, there could be joins between real tables, temporary tables declared in the WITH
clause, and subqueries.
In each query or subquery’s SELECT
clause, there are fields that apply additional functions, expressions, and subqueries. In even more complex examples, lineage can reflect queries that have many nested layers of subqueries, and even more complex expressions.
The red rectangle shows the selected fields in the lineage that derived from this query. The selected fields are the fields that define the result table. The fields in the purple rectangle are extracted as non-selected fields. Non-selected fields have an impact on the rows to be fetched from source tables, but they don’t contribute to the field values in the result table, which offers a more intuitive UI and quick root cause analysis process because unaffected lineage is obscured.
Database and deployment details
Compared to building table-level lineage, field-level lineage is far more challenging. In order to efficiently query the field level lineage data, we needed to store the field level lineage data into Elasticsearch or another graph database. Since we already leveraged Elasticsearch, we decided it would be a good fit for our use case.
We chose to store our denormalized data structure in Elasticsearch, which enables easy and flexible querying. While our Elasticsearch instance stores metadata about our customer’s data, the data reflected in these lineage graphs is stored in the customer environment.
Figure 8. HYBRID DEPLOYMENT MODEL. A hybrid deployment model that allows our field-level lineage to store non-sensitive metadata and keep customer data in their environment.
Our approach to deploying the field-level lineage incorporates two key parts: (1) a control plane managed by the Monte Carlo team, and (2) a data plane in the customer’s environment.
The control plane hosts the majority of the lineage’s business logic and handles insensitive metadata. It communicates with the data plane and delegates sensitive operations (such as processing, storing or deleting data) to it. The control plane also provides web and API interfaces, and monitors the health of the data plane. The control plane runs entirely in the Monte Carlo environment and typically follows a multi-tenant architecture, though some vendors offer a single-tenant control plane (often for a price premium) that runs in a customer-dedicated, completely isolated, VPC.
The data plane typically processes and stores all of the customer’s sensitive data. It must be able to receive instructions from the control plane, and pass back metadata about its operations and health. The interface between the control and data plane is implemented by a thin agent that runs in the customer’s environment.
At its essence, separating the customer’s data from the managed software gives our customers greater agility while data compliance and security remains in the hands of the customer.
Other technologies used in our solution
To build the rest of our architecture, we leveraged multiple technologies, including:
- AWS & Snowflake. We built our field-level lineage functionality on top of AWS (one of our platform’s public cloud hosts) and a Snowflake data warehouse. We chose Snowflake as the processing power behind our lineage due to its flexibility and compatibility with our customers’ data architectures.
- Queryparser and ANTLR. As mentioned earlier, we leveraged a queryparser with a Java-based lambda function to parse through the query logs and generate lineage data. We support basic Redshift, BigQuery, AWS Athena, Snowflake, Hive, and Presto queries by defining the grammars in ANTLR.
- AWS Kinesis. Whenever we collect queries, we process them in real-time. To do this, we use Kinesis to stream the data to different components of our normalizer, so we can pass and resolve the table in real-time. We then store the resulting lineage data in Elasticsearch. We chose Elasticsearch over Postgres, Amazon DynamoDB, and other solutions due to its ability to scale and support a broader variety of queries.
- Homegrown data collector. We generated our automatic lineage based on queries executed at our clients’ data warehouses and lakes. We utilized a data collector to extract all the query histories from the client’s data center, then process them once the data enters our system.
By leveraging a mix of proprietary and open source technologies, we built a production version of field-level lineage that met the needs for detail and abstraction necessary to make root cause analysis more intuitive and seamless for our customers.
We could have taken several different paths to build our backend, and there were no obvious answers - at least at the outset. For our broader data observability architecture, we already used ANother Tool for Language Recognition (ANTLR) as a queryparser generator to read, process, execute, and translate structured text or binary files. After consulting with our broader engineering team, we decided that we could utilize ANTLR for field-level lineage as well.
By playing around with our table-level queryparser to extract the columns that were defining ANTLR grammars, we were able to access the field-level relationships for more basic SELECT clause use cases. From there, we were confident that we could build a fully functional backend.
However, we quickly ran into parsing performance issues when working with complex queries. In some cases, the queries were too long to be easily parsed: some used WITH clauses that defined some subqueries, and those subqueries were referenced in the main queries themselves. For example, if a column doesn’t have quotes around it, it’s parsed as a column, and if it has quotes, it’s parsed as a string. To fix this, we modified the grammar of our query log parser to better support Presto, Redshift, Snowflake, and BigQuery, each with its own parsing nuances and complexities.
This SQL query complexity manifested itself in another design challenge: architecting our user interface. To build useful lineage, we needed to ensure that our solution provided enough rich context and metadata about the represented tables and fields without burdening the user with superfluous information.
Between the backend and frontend complexities, we’d need to abstract away this spider web of relationships and possible interactions in order to deliver on our vision of offering a truly powerful product experience for our customers. We’d need to architect a Magnolia tree with only the most relevant blossoms, leaves, and roots showing.
User interface
When it came to building the frontend interface, we needed to decide which technologies to use and determine the most useful and intuitive way to display field-level lineage to our customers. To strike the right balance between showing too much and too little information (the goldilocks zone for data lineage, as we liked to call it), we again turned to our customers for early feedback.
We also wanted to augment lineage as opposed to just automating it. In other words, we needed to determine whether we could automatically highlight what connections are likely to be relevant to their given use case or issue. After all, the most effective iteration of our product wouldn’t just surface information - it would surface the right information at the right time.
To answer these questions, we released beta versions of our lineage product with the most upstream layer of information and the most downstream layer of information for several customers, asking them whether or not the feature fell within their goldilocks zone. Our larger data observability product serves data engineers and data analysts across cloud and hybrid environments, helping them manage analytical data across hundreds to thousands of sources.
Consistently, customers considered two layers (the upstream data source and the downstream data field) to be the ideal number for root cause analysis. If users wanted a full view of all the field-level relationships in a specific pipeline, they would simply need to click on the fields, then click through associated connections to those fields to traverse the field-level relationships layer by layer.
Figure 9. FIELD-LEVEL LINEAGE. Field-level lineage UI featuring the source table and the destination reports.
This exercise taught us something interesting: our customers care most about either A) the most downstream layer (business intelligent objects in tools like Looker or Tableau), which are business intelligence reports or B) the most upstream layer (the source table or field stored in the warehouse or lake, which are frequently the root cause of the issue). The layers in the middle were deemed less important than middle layers (i.e., data transformations) when it came to conducting early root cause analysis.
The most downstream layer, BI reports and dashboards, are the end products that data consumers use for their day-to-day work. This is important because customers want to know the direct impact on the consumer facing data products, so they know how to communicate with end users, i.e. "Hey finance analyst, don't use this number, it's outdated".
The most relevant upstream layer, the source table in the warehouse or lake, is leveraged when customers trace the lineage layer by layer and find that one upstream layer with the table/field has a data quality issue. Once they find it, then they can fix the issue in that table / field and solve the problem.
To write the field-level lineage UI, we created a reusable component using React TypeScript to make sure we could easily port the field lineage UI to other parts of the product or even to other MC products in the future. To display both the upstream and downstream lists, we used React Virtuoso, a React framework that makes it easy to visualize large data sets.
Since fields could potentially have tens of upstream or downstream tables, which in turn could have hundreds or thousands of fields, rendering all of those components without affecting the performance of the app was crucial. Virtuoso, which supports lists with items of different sizes, gave us the flexibility to only render the relevant components for a given lineage graph.
How our field-level lineage works
Figure 10. LINEAGE QUERY DETAILS. A lineage query that helps users understand table-to-field relationships.
Currently, our product displays two types of field relationships:
- SELECT clause lineage: field relationships defined by the SQL clause SELECT; these are field-to-field relationships where a change in an upstream field directly changes the downstream field.
- Non-SELECT lineage: field relationships defined by all other SQL clauses, e.g., WHERE; these are field-to-table relationships, where the downstream fields are often shaped by a filtering or ordering logic defined by upstream fields.
A chosen field’s upstream non-SELECT lineage fields display as the filtering/ordering fields that result in the chosen field. Its downstream non-SELECT lineage fields are the resulting fields from the filtering/ordering logic defined by the chosen field.
Lessons learned
Like any agile development process, our experience building field-level lineage was an exercise in prioritization, listening, and quick iteration. Aside from some of the technical knowledge we gained about how to abstract away query log complexity and design ANTLR frameworks, the vast majority of our learnings can be applied to the development of nearly any data software product:
- Listen to your teammates and take everyone’s advice into consideration. When we began to work with the query parser, we underestimated how challenging it would be to parse queries. One of our teammates had previous experience working with the parser, warned us of its quirks, and suggested we might build a different one to complete the task. If we had listened to our teammate early on, we would have saved a good chunk of time.
- Invest in prototyping: Our customers are our north star. Whenever we create new product features, we take care to consider their opinions and preferences. Doing so effectively requires sharing a product prototype. To speed up the feedback cycle and make these interactions more useful for both parties, we shipped an early prototype to some of our most enthusiastic champions within weeks of development. While this first iteration was not perfect, it allowed us to demonstrate our commitment to meeting customer demands and gave us some early guidance as we further refined the product.
- Ship and iterate: This is a common practice in the software engineering world and something we take very seriously. Time is of the essence, and when we prioritize one project we have to ensure we are optimizing the time of everyone who is involved in that project. When we began working on this feature, we didn’t have time to make our product "perfect" before showing our prototype to customers - and moving forward without perfection allowed us to expedite development. Our repeatable process included building out the functionality of the feature, showing it off to our customers, asking for feedback, then making improvements and or changes where needed.
We predict that more and more data teams will start adopting automatic lineage and other knowledge graphs to identify the source of data quality issues as the market matures and data engineering teams look to increase efficiencies and reduce downtime.
Until then, here’s wishing you as few "what happened to my data?!" fire drills as possible.