AWS News Blog

Announcing the general availability of data lineage in the next generation of Amazon SageMaker and Amazon DataZone

Voiced by Polly

Today, I’m happy to announce the general availability of data lineage in Amazon DataZone, following its preview release in June 2024. This feature is also extended as part of the catalog capabilities in the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI.

Traditionally, business analysts have relied on manual documentation or personal connections to validate data origins, leading to inconsistent and time-consuming processes. Data engineers have struggled to evaluate the impact of changes to data assets, especially as self-service analytics adoption increases. Additionally, data governance teams have faced difficulties in enforcing practices and responding to auditor queries about data movement.

Data lineage in Amazon DataZone addresses the challenges faced by organizations striving to remain competitive by using their data for strategic analysis. It enhances data trust and validation by providing a visual, traceable history of data assets, enabling business analysts to quickly understand data origins without manual research. For data engineers, it facilitates impact analysis and troubleshooting by clearly showing relationships between assets and allowing easy tracing of data flows.

The feature supports data governance and compliance efforts by offering a comprehensive view of data movement, helping governance teams to quickly respond to compliance queries and enforce data policies. It improves data discovery and understanding, helping consumers grasp the context and relevance of data assets more efficiently. Additionally, data lineage contributes to better change management, increased data literacy, reduced data duplication, and enhanced cross-team collaboration. By tackling these challenges, data lineage in Amazon DataZone helps organizations build a more trustworthy, efficient, and compliant data ecosystem, ultimately enabling more effective data-driven decision-making.

Automated lineage capture is a key feature of the data lineage in Amazon DataZone, which focuses on automatically collecting and mapping lineage information from AWS Glue and Amazon Redshift. This automation significantly reduces the manual effort required to maintain accurate and up-to-date lineage information.

Get started with data lineage in Amazon DataZone
Data producers and domain administrators get started by setting up the data source run jobs for the AWS Glue Data Catalog and Amazon Redshift sources to Amazon DataZone to periodically collect metadata from the source catalog. Additionally, the data producers can hydrate the lineage information programmatically by creating custom lineage nodes using APIs that accept OpenLineage compatible events from existing pipeline components—such as schedulers, warehouses, analysis tools, and SQL engines—to send data about datasets, jobs, and runs directly to Amazon DataZone API endpoint. With the information being sent, Amazon DataZone will start populating the lineage model and map them to the assets already cataloged. As new lineage events are captured, Amazon DataZone maintains versions of events that were already captured, so users can navigate to previous versions if needed.

From the consumer’s perspective, lineage can help with three scenarios. First, a business analyst browsing an asset, can go to the Amazon DataZone portal, search for an asset by name, and select an asset that interests them to dive into the details. Initially, they’ll be presented with details in the Business Metadata tab and move right to neighboring tabs. To view lineage, the analyst can go the Lineage tab for details of upstream nodes to find the source. The analyst is presented with a view of that asset’s lineage with 1-level upstream and downstream. To get the source, the analyst can choose upstream and get to the source of the asset. When the analyst is sure that this is the correct asset, they can subscribe to the asset and continue with their work.

Second, if a data issue is reported—for instance, when a dashboard unexpectedly shows a significant increase in customer count—a data engineer can use the Amazon DataZone portal to locate and examine the relevant asset details. In the asset details page, the data engineer navigates to the Lineage tab to view the details of upstream nodes of the asset in question. The engineer can dive into the details of each node, its snapshots, column mapping between each table node, the jobs that ran in between, and view the query that was executed in the job run. Using this information, the data engineer can spot that a new input table was added to the pipeline, which has introduced an uptick in customer count, because they notice that this new table wasn’t part of the previous snapshots of the job runs. This helps them clarify that a new source was added and hence the data shown in the dashboard is accurate.

Lastly, a steward looking to respond to questions from an auditor can go to the asset in question and navigates to the Lineage tab of that asset. The steward traverses the graph upstream to see where the data is coming from and notices that the data is from two different teams—for instance, from two different on-premises databases—that has its own pipelines until it reaches a point where the pipelines merge. While navigating through the lineage graph, the steward can expand the columns to make sure sensitive columns are dropped during the transformations processes and respond to the auditors with details in a timely manner.

How Amazon DataZone automates lineage collection
Amazon DataZone now enables automatic capture of lineage events, helping data producers and administrators to streamline the tracking of data relationships and transformations across their AWS Glue and Amazon Redshift resources. To allow automatic capture of lineage events from AWS Glue and Amazon Redshift, you have to opt in because some of your jobs or connections might be for testing and you might not need any lineage to be captured. With the integrated experience available, the services will provide you an option in your configuration settings to opt-in to collect and emit lineage events directly to Amazon DataZone.

These events should capture the various data transformation operations you perform on tables and other objects, such as table creation with column definitions, schema changes, and transformation queries, including aggregations and filtering. By obtaining these lineage events directly from your processing engines, Amazon DataZone can build a foundation of accurate and consistent data lineage information. This will then help you, as a data producer, to further curate the lineage data as part of the broader business data catalog capabilities.

Administrators can enable lineage when setting up the built-in DefaultDataLake or the DefaultDataWarehouse blueprints.

Data producers can view the status of automated lineage while setting up the data source runs.

With the recent launch of the next generation of Amazon SageMaker, data lineage is available as one of the catalog capabilities in the Amazon SageMaker Unified Studio (preview). Data users can set up lineage using connections, and that configuration will automate the capture of lineage in the platform for all users to browse and understand the data. Here’s how data lineage in next generation Amazon SageMaker will look.

Now available
You can begin using this capability to gain deeper insights into your data ecosystem and drive more informed, data-driven decision-making.

Data lineage is generally available in all AWS Regions where Amazon DataZone is available. For a list of Regions where Amazon DataZone domains can be provisioned, visit AWS Services by Region.

Data lineage costs are dependent on storage usage and API requests, which are already included in the Amazon DataZone pricing model. For more details, visit Amazon DataZone pricing.

To get started with data lineage in Amazon DataZone, visit the Amazon DataZone User Guide.

— Esra
Esra Kayabali

Esra Kayabali

Esra Kayabali is a Senior Solutions Architect at AWS, specialising in analytics, including data warehousing, data lakes, big data analytics, batch and real-time data streaming, and data integration. She has more than ten years of software development and solution architecture experience. She is passionate about collaborative learning, knowledge sharing, and guiding community in their cloud technologies journey.