Data lineage is the process of tracking data flow over time to understand its origin, transformation, and final destination in a data pipeline. By documenting the relationship between enterprise data in various business and IT applications, data lineage helps provide a comprehensive record of data throughout its lifecycle.
Data lineage provides a complete picture of an organization’s data story from the source through all processes and changes to storage or consumption. It aims to show the data flow from start to finish, including all transformations the data underwent, what changed, and why. Different data lineage implementation techniques produce different outcomes - read our guide to make the right choice for your business.
Data lineage is commonly used to gain context about historical processes as well as trace errors back to their root cause. It helps users make sure their data is coming from a trusted source, has been transformed correctly, and is loaded to the specified location. It provides a stepwise record of how data arrived at its current form, including both transformations made to the data and its journey through different business systems.
Components of data lineage
Data lineage components include:
-
Data Origin: Tracking data generated, uploaded, and altered by business users and applications, and identifying the first instance of that data or its source. This involves listing all elements and tracking them back to their origin one by one.
-
Data Flow: Tracking the complete data flow from start to finish and documenting the relationship between enterprise data in various business and IT applications. The entire data flow is mapped to understand, document, and visualize data at all stages.
-
Data Transformations: Data lineage helps to model all transformations the data underwent, how the data was transformed, what changed, and why and and illustrates the different dependencies across the data ecosystem.
-
Metadata: Metadata provides information about the data assets, such as the type, format, structure, author, date created, date modified, and file size. Metadata allows users of data lineage tools to fully understand how data flows through the data pipeline.
Different approaches to implementing data lineage
There are different approaches to implementing data lineage, including a manual approach and an automated approach.
The manual approach involves identifying the data elements, tracking their origin, noting sources and links, and creating a map. This approach can be time-consuming but is still a must-have for any business.
The automated approach uses data lineage tools that track all of the components. These tools can easily sort and organize lineage data, saving time and money by comparison to manual approaches.
Data lineage is essential for businesses to ensure data quality, good decision-making, regulatory compliance, and audits. It provides a compliance mechanism for auditing, improving risk management, and ensuring data are stored and processed in line with data governance policies and regulations. Data lineage gives visibility into changes that may occur as a result of data migrations, system updates, errors, and more, ensuring data integrity throughout the data lifecycle.
Pick a comprehensive tool that will solve all your lineage problems.
Read our guide to data lineage technologies to understand what is meant by harvesting, mapping and visualising when it comes to lineage, along with a perspective on emerging technologies such as graph databases and machine learning.