Amazon DataZone: Automate Data Discovery

Overview

Remove time from manual entry of data attributes in the data catalog, which also introduces potential errors. Generate business context and recommend analysis for datasets, which boosts data discovery results. Understand where your data came from, and which sources will be impacted by changes. More, richer data in the business data catalog also improves the search experience. Reduce your time searching for and using data from weeks to days.

Page Topics

Key features

Key features

The Amazon DataZone business data catalog acts as a federated organizational registry where technical metadata can be published as assets, and you can add enriched business context. You can make data visible with business context for all your users to find, understand, and trust data quickly and easily.

Automate adding business descriptions and names to data, which helps you easily understand context and helps you avoid dealing with cryptic technical names. This automation is powered by large language models (LLMs) to increase accuracy and consistency. 

Faceted search works on top of the business data catalog to help data consumers and producers find data assets using familiar structural information, such as table and column names, as well as business terms.

For each dataset, generate a list of the most valuable columns and the likely analytics uses. 

With data quality statistics in Amazon DataZone, data consumers can see data quality metrics from AWS Glue data quality or third-party systems. Data consumers can trust the data sources they use for decisions, and have data quality context as they search for assets. Producers and IT teams can also use APIs to incorporate the data quality statistics from third-party systems into a unified, out-of-console portal. Data producers can bring in AWS Glue data quality results on a schedule to make sure that the scores are current, even as the data continues to change.

Understand the movement of data over time. Data lineage can raise trust and an organization’s data literacy by helping data consumers understand where data came from, how it changed, and its consumption. You can reduce time spent in mapping a data asset and its relationships, troubleshooting and developing pipelines, and asserting data governance practices.

Group data assets into defined packages (data products) tailored for specific business use cases to streamline cataloging and enable data consumers to easily discover and subscribe to the data. Data producers can curate a collection of relevant assets, add business context, and publish it as a data product unit. This simplifies the process for data consumers to locate all necessary data assets for particular use cases. Consumers can subscribe to all assets within a data product through a single approval workflow. Data producers can manage the product's lifecycle, including editing the asset collection, unpublishing, deleting it, and maintaining subscriptions. Amazon DataZone also offers API support for data product workflows, facilitating integration and automation.

Use cases

Videos

AWS re:Invent 2023 - How to build a business catalog with Amazon DataZone (21:37)
AWS re:Invent 2023 - Understand your data with business context (55:40)

FAQs

What kind of information is in the Amazon DataZone business data catalog?

In the Amazon DataZone business data catalog, business metadata provides information authored or used by business people and gives context to organizational data. This could include the following information:

  • Ownership: Modern data-centric organizations employ a distributed data stewardship process where lines of business (LOBs) are responsible for managing their own data. A catalog tracks that ownership so interested parties can find and request access to data as part of their business tasks.
  • Classification: Data discovery is a key task that business metadata can support. Data discovery uses centrally defined corporate ontologies and taxonomies to classify data sources and helps you find relevant data objects.
  • Relationships: You can use the Amazon DataZone business data catalog to add relationship information as metadata. As with a technical dataset schema, the business data catalog shows relationships between objects in the catalog, such as those between databases, datasets, and their columns.
  • Schema: AI recommendations for descriptions can use the technical and business schema to generated recommended descriptions and usage for data.
  • Origin and consumption: Data lineage and impact analysis, as well as custom mappings from OpenLineage, are linked to in the business data catalog.

What can I catalog with Amazon DataZone?

Amazon DataZone supports data assets published directly from the AWS Glue Data Catalog and Amazon Redshift. These two sources can be used to catalog data in the following locations:

  • Amazon Simple Storage Service (Amazon S3) data lakes
  • Many of the AWS purpose-built databases like Amazon Relational Database Service (Amazon RDS) through an AWS Glue crawler
  • Over 100-plus Amazon AppFlow connectors, to bring in data from third-party applications like Snowflake, Salesforce, and Google Analytics