AWS Big Data Blog

Analyzing your data catalog: Query SageMaker Catalog metadata with SQL

As your data and machine learning (ML) assets grow, tracking which assets lack documentation or monitoring asset registration trends becomes challenging without custom reporting infrastructure. You need visibility into your catalog’s health, without the overhead of managing ETL jobs. The metadata feature of Amazon SageMaker provides this capability to users. Converting catalog asset metadata into Apache Iceberg tables stored in Amazon S3 Tables removes the need to build and maintain custom ETL pipelines. Your team can then query asset metadata directly using standard SQL tools. You can now answer governance questions like asset registration trends, classification status, and metadata completeness using standard SQL queries through tools like Amazon Athena, Amazon SageMaker Unified Studio notebooks, and BIsystems.

This automated approach reduces ETL development time and gives your team visibility into catalog health, compliance gaps, and asset lifecycle patterns. The exported tables include technical metadata, business metadata, project ownership details, and timestamps, partitioned by snapshot date to enable time travel queries and historical analysis. Teams can use this capability to proactively monitor catalog health, identify gaps in documentation, track asset lifecycle patterns, and make sure that governance policies are consistently applied.

How metadata export works

After you enable the metadata export feature, it runs automatically on a daily schedule:

  1. SageMaker Catalog creates the infrastructure — An Amazon Simple Storage Service (Amazon S3) table bucket named aws-sagemaker-catalog is created with an asset_metadata namespace and an empty asset table.
  2. Daily snapshots are captured — A scheduled job runs once per day around midnight (local time per AWS Region) to export updated asset metadata.
  3. Metadata is structured and partitioned — The export captures technical metadata (resource_id, resource_type), business metadata (asset_name, business_description), project ownership details, and timestamps, partitioned by snapshot_date for query performance.
  4. Data becomes queryable — Within 24 hours, the asset table appears in Amazon SageMaker Unified Studio under the aws-sagemaker-catalog bucket and becomes accessible through Amazon Athena, Studio notebooks, or external BI tools.
  5. Teams query using standard SQL — Data teams can now answer questions like “How many assets were registered last month?” or “Which assets lack business descriptions?” without building custom ETL pipelines.

The export evaluates catalog assets and their metadata properties in the domain, converting them into Apache Iceberg table format. The data flows into downstream analytics operations immediately, with no separate ETL or batch processes to maintain. The exported metadata becomes part of a queryable data lake that supports time-travel queries and historical analysis.

In this post, we demonstrate how to use the metadata export capability in Amazon SageMaker Catalog and perform analytics on these tables. We explore the following specific use-cases.

  • Audit historical changes to investigate what an asset looked like at a specific point in time.
  • Monitor asset growth view how the data catalog has grown over the last 30 days.
  • Track metadata improvements to see which assets gained descriptions or ownership over time.

Solution overview

AWS Cloud architecture diagram showing data pipeline from Amazon SageMaker Catalog to Amazon S3 Tables with daily export, connecting to query engines including Amazon Athena, Amazon Redshift, and Apache Spark

Figure 1 – SageMaker catalog export to S3 Tables

The architecture consists of three key components:

  1. Amazon SageMaker Catalog exports asset metadata daily to Amazon S3.
  2. S3 Tables stores metadata as Apache Iceberg tables in the aws-sagemaker-catalog bucket with ACID compliance and time travel.
  3. Query engines (Amazon Athena, Amazon Redshift, and Apache Spark) access metadata using standard SQL from the asset_metadata.asset table.

What metadata is exposed?

SageMaker Catalog exports metadata in the asset_metadata.asset table:

Metadata Type Fields Description
Technical metadata resource_id, resource_type_enum, account_id, region Resource identifiers (ARN), types (GlueTable, RedshiftTable, S3Collection), and location
Namespace hierarchy catalog, namespace, resource_name Organizational structure for assets
Business metadata asset_name, business_description Human-readable names and descriptions
Ownership extended_metadata['owningEntityId'] Asset ownership information
Timestamps asset_created_time, asset_updated_time, snapshot_time Creation
Custom metadata extended_metadata['form-name.field-name'] User-defined metadata forms as key-value pairs

The snapshot_time column supports point-in-time analysis and query of historical catalog states.

Prerequisites

To follow along with this post, you must have the following:

For SageMaker Unified Studio domain setup instructions, refer to the SageMaker Unified Studio Getting started guide.

After you complete the prerequisites, complete the following steps.

  1. Add this policy to our IAM user or role to enable metadata export. If using SageMaker Unified Studio to query the catalog, add this policy to the AmazonSageMakerAdminIAMExecutionRole managed role.
{ "Version": "2012-10-17", 
"Statement": [ 
{
 "Effect": "Allow",
 "Action": [ "datazone:GetDataExportConfiguration",
 "datazone:PutDataExportConfiguration"
 ],
 "Resource": "*"
 },
 {
 "Effect": "Allow",
 "Action": [
 "s3tables:CreateTableBucket",
 "s3tables:PutTableBucketPolicy"
 ],
 "Resource": "arn:aws:s3tables:*:*:bucket/aws-sagemaker-catalog" 
} 
]
}
  1. Grant describe and select permissions for SageMaker Catalog with AWS Lake Formation. This step can be performed in the AWS Lake Formation console.
    1. Select Permissions -> Data permissions and choose Grant.

      AWS Lake Formation Grant Permissions interface showing principal type selection with IAM users and roles option selected and AmazonSageMakerAdminIAMExecutionRole assigned

      Figure 2 – AWS Lake Formation grant permission

    2. Under Principal type, select Principals, IAM users and roles and the AWS managed AmazonSageMakerAdminIAMExecutionRole execution role.
    3. Choose Named Data Catalog resources.
    4. Under Catalogs, search for and select <account-id>:s3tablecatalog/aws-sagemaker-catalog.
    5. Under Databases, select asset_metadata database.
      AWS Lake Formation Grant Permissions page showing Named Data Catalog resources method with s3tablescatalog/aws-sagemaker-catalog selected, asset_metadata database, and asset table configured

      Figure 3 – AWS Lake Formation catalog, database, and table

      AWS Lake Formation Grant Permissions interface showing table permissions with Select and Describe checked, grantable permissions section, and All data access radio button selected

      Figure 4 – AWS Lake Formation grant permission

    6. For Table, select asset.
    7. Under Table permissions, check Select and Describe.
    8. Choose Grant to save the permissions.

Enable data export using the AWS CLI

Configure metadata export using the PutDataExportConfiguration API. The Amazon DataZone service automatically creates an S3 table bucket named aws-sagemaker-catalog with an asset_metadata namespace, and schedules a daily export job. Asset metadata is exported once daily around midnight local time per AWS Region.

The SageMaker Domain identifier is available on domain detail page in the AWS Management Console. Accessing the asset table through the S3 Tables console or the Data tab in SageMaker Unified Studio can require up to 24 hours.

AWS CLI command to enable SageMaker catalog export:

aws datazone put-data-export-configuration --domain-identifier <domain-id> --region <region> --enable-export

Use this AWS CLI command to validate the configuration is enabled:

aws datazone get-data-export-configuration --domain-identifier <domain-id> --region <region>
{
    "isExportEnabled": true,
    "status": "COMPLETED",
    "s3TableBucketArn": "arn:aws:s3tables:<region>:<account-id>:bucket/aws-sagemaker-catalog",
    "createdAt": "2025-11-26T18:24:02.150000+00:00",
    "updatedAt": "2026-02-23T19:33:40.987000+00:00"
}

Access the exported asset table

  1. Navigate to Amazon SageMaker Domains in the AWS Management Console.
  2. Select your domain and select Open.

    Amazon SageMaker Domains management page showing an Identity Center based domain with Available status, created February 26, 2026, with Open unified studio button highlighted

    Figure 5 – Open Amazon SageMaker Unified Studio

  3. In SageMaker Unified Studio, choose a project from the Select a project dropdown list.
  4. To query SageMaker catalog data, select Build in the menu bar and then choose Query Editor. To create a new project, follow the instructions in the Amazon SageMaker Unified Studio User Guide.

    SageMaker Unified Studio project overview dashboard showing IDE and Applications, Data Analysis and Integration with Query Editor highlighted, Orchestration, and Machine Learning and Generative AI categories

    Figure 6 – Open SageMaker Unified Studio Query Editor

The asset_metadata.asset table is available in Data explorer. Use Data explorer to view the schema and query data to perform analytics from.

  1. Expand Catalogs in Data explorer. Then, select and expand s3tablecatalog, aws-sagemaker-catalog, asset_metadata, and asset.
  2. Test querying the catalog with SELECT * FROM asset_metadata.asset LIMIT 10;.
SageMaker Unified Studio Query Editor with Data Explorer showing Lakehouse hierarchy including s3tablescatalog, aws-sagemaker-catalog, asset_metadata database, and asset table schema with SQL SELECT query

Figure 7 – Query SageMaker catalog

Queries for observability and analytics

With setup complete, execute queries to gain insights on catalog usage and changes. To monitor asset growth, and view how the data catalog has grown over the last five days:

SELECT 
    DATE (snapshot_time) as date,
    COUNT (*) as total_assets
FROM asset_metadata.asset
WHERE 
     DATE (snapshot_time) >= CURRENT_DATE - INTERVAL '5' DAY
GROUP BY DATE (snapshot_time)
ORDER BY date DESC;
SageMaker Unified Studio Query Editor showing SQL aggregation query on asset_metadata.asset table with results displaying date and total_assets columns, returning 42 assets for March 7-8, 2026"

Figure 8 – Query asset growth

Use the catalog to track metadata changes to determine which assets gained descriptions or ownership over time. Use this query to identify assets that gained business descriptions over the past five days by comparing today’s snapshot with the earlier snapshot.

SELECT
    t.asset_id,
    t.resource_name,
    p.business_description as description_before,
    t.business_description as description_now
FROM asset_metadata.asset t
JOIN asset_metadata.asset p ON t.asset_id = p.asset_id
WHERE DATE(t.snapshot_time) = CURRENT_DATE
    AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '5' DAY
    AND p.business_description IS NULL
    AND t.business_description IS NOT NULL;

Investigate asset values at a specific point in time using this query to retrieve metadata from any snapshot date.

SELECT
     asset_id,
     resource_name,
     business_description,
     extended_metadata['owningEntityId'] as owner,
     snapshot_time
FROM asset_metadata.asset
WHERE asset_id = 'your-asset-id'
     AND DATE(snapshot_time) = DATE('2025-11-26');

Clean up resources

To avoid ongoing charges, clean up the resources created in this walkthrough:

  1. Disable metadata export:

Disable the daily metadata export to stop new snapshots:

aws datazone put-data-export-configuration \
  --domain-identifier <domain-id. \
  --no-enable-export \
  --region <region>
  1. Delete S3 Tables resources:

Optionally, delete the S3 Tables namespace containing the exported metadata to remove historical snapshots and stop storage charges. For instructions on how to delete S3 tables, see Deleting an Amazon S3 table in the Amazon Simple Storage Service User Guide.

Conclusion

In this post, you enabled the metadata export feature of SageMaker Catalog and used SQL queries to gain visibility into your asset inventory. The feature converts asset metadata into Apache Iceberg tables partitioned by snapshot date, so you can perform time-travel queries, monitor catalog growth, track metadata completeness, and audit historical asset states. This provides a repeatable, low-overhead way to maintain catalog health and meet governance requirements over time.

To learn more about Amazon SageMaker Catalog, see the Amazon SageMaker Catalog documentation. To explore Apache Iceberg table formats and time-travel queries, see the Amazon S3 Tables documentation.


About the Authors

Photo of Author Ramesh Singh

Ramesh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that help enterprise customers achieve their critical goals using cutting-edge technology.

Photo of Author Pradeep Misra

Pradeep is a Principal Analytics and Applied AI Solutions Architect at AWS. He is passionate about solving customer challenges using data, analytics, and Applied AI. Outside of work, he likes exploring new places and playing badminton with his family. He also likes doing science experiments, building LEGOs, and watching anime with his daughters.

Photo of Author - Rohith Kayathi

Rohith is a Senior Software Engineer at Amazon Web Services (AWS) working with Amazon SageMaker team. He leads business data catalog, generative AI–powered metadata curation, and lineage solutions. He is passionate about building large-scale distributed systems, solving complex problems, and setting the bar for engineering excellence for his team.

Photo of AUthor - Steve Phillips

Steve is a Principal Technical Account Manager and Analytics specialist at AWS in the North America region. Steve currently focuses on data warehouse architectural design, data lakes, data ingestion pipelines, and cloud distributed architectures.