How Amazon is moving to integrate catalogs to improve data discovery with Amazon SageMaker

Enterprises face challenges when teams create data assets outside of central data catalogs. It adds overhead for discovery, and limits collaboration. Amazon’s Business Data Technologies (BDT) team has built an enterprise data catalog (Andes) for sharing datasets under well-defined policies. However, teams also created catalogs of local datasets and other non-tabular assets such as dashboards and metrics, outside Andes. This made it difficult to discover all assets in a consolidated way.

In this post, we share how Amazon.com is working to integrate catalogs by extending its enterprise data catalog, Andes, with Amazon SageMaker.

Need for expanding catalog and governance from datasets to data assets

Without a single solution, users had to search multiple catalogs depending upon the asset type. Teams spent considerable time indexing the different catalogs and identifying the right one for their task. This slowed them down and took time away from solving the business problems.

To address these challenges, BDT team identified four critical capabilities needed:

Multimodal catalog – Data consumers required the ability to blend enterprise data with local datasets and use them together for specific use cases. Teams sought to discover not only datasets, but also assets such as metrics, dashboards, and business files, to obtain a complete view of available resources. This necessitated a catalog that consolidates datasets and data assets in one location.
Uniform governance and enforcement – To maintain best data protection practices and support business goals, teams need consistent enterprise-wide data governance where they request access once and the system enforces that access uniformly across all compute engines, alleviating fragmented or redundant access management. For internal systems, there was need for trusted identity propagation so user identity is preserved and used across AWS and internal systems for consistent enforcing.
Multi-approval workflows – The solution supports multiple approval workflows within a single system, using Andes for dataset approvals and a custom workflow for dashboard approvals to maintain total governance and visibility across data assets.
Delegated ownership – While enterprise teams retain overarching governance responsibility, business-specific data stewards required the ability to modify select attributes and apply appropriate tags to assets produced by their respective producers and consumers.

Solution: Unify datasets and data assets with Amazon SageMaker

Amazon chose to extend Andes with Amazon SageMaker to enhance the discovery experience. SageMaker offers native support for multimodal catalogs, and integrated with enterprise identity management, making it the ideal foundation for extending Andes’ governance model.

Rather than broadcasting assets across multiple domains, a single enterprise-wide domain standardizes and synchronizes data assets in one place. This domain is associated with AWS IAM Identity Center, which is connected to Amazon’s corporate identity system to maintain best data protection practices by limiting direct permissions and using corporate identity and group-based permissions.

Architecture diagram showing how Amazon SageMaker integrates with enterprise data catalog Andes and AWS IAM Identity Center

This integrated architecture directly addresses the identified challenges:

Single-pane asset discovery – Datasets and data assets are accessible through a single, consolidated view, avoiding the need to navigate across disparate systems or domains. This simplifies discovery and reduces the time to insight for teams across the organization.
Extended governance – Governance of both enterprise-wide and local datasets is orchestrated through a single system.
Extended observability – Trusted Identity Propagation (TIP) through AWS IAM Identity Center allows human users to access data interactively using their corporate identities. This provides audit-trail visibility into who is accessing what data for audits and organization’s observability requirements.
Amazon tool integration – Integration with Git and other internal systems automates management of accounts, permissions, and approvals. This reduces manual overhead and helps maintain that access controls remain tightly aligned with existing business workflows.

Design overview

This section describes the key features and design of the Amazon SageMaker integration. The technical implementation consists of three core components:

1) Catalog connectors

Amazon built connectors and ingestion paths to bring data assets into Amazon SageMaker while maintaining business continuity and preserving existing governance:

Andes integration: SageMaker provides APIs to synchronize assets from external catalogs. BDT extended this to bring Andes datasets (with their sophisticated metadata, business context) into the integrated experience. The integration preserves Andes’ permission model and governance workflows, to maintain existing security standards and best practices intact.
Account onboarding: Teams self-serve onboard their AWS accounts through an AWS Lambda-based integration. When creating projects, SageMaker queries this service to determine which accounts a user’s identity can access.

2) Delegated ownership

When data systems scale across business units, centralized governance teams need to delegate permissions for catalog enrichment, policy enforcement, and metadata management.

Catalog enhancement allows business teams to define and publish their own business glossaries, curated vocabularies of domain-specific terms, definitions, and relationships, directly within the catalog. Allowing business owners to author and maintain these glossaries increased accuracy and discoverability of catalog assets. Data consumers across the enterprise benefit from clearer, more consistent terminology.

3) Integration with consumption and access tooling

Teams discover data in SageMaker Unified Studio and consume it through both SageMaker Unified Studio and internal tooling:

Data discovery: SageMaker Unified Studio integrates with Amazon-wide Identity Center allowing almost all Amazon users to authenticate and search for cataloged assets. This integration addresses the data discovery problem by providing enterprise-wide visibility into available data resources.
Integrated development environment: SageMaker Unified Studio provides built-in tooling out of the box including a Query Editor for SQL analytics and Amazon SageMaker AI for machine learning (ML), which helps teams access data, build models, and collaborate across organizational boundaries.
Code repository integration: Manage code with full Git operations supported from SageMaker Unified Studio. Query code and notebook code persist to GitFarm (Amazon’s internal Git system), allowing teams to view and manage their work through Amazon’s standard version control system.
Native analytics integration: Projects directly connect to AWS analytics engines including Amazon Athena for SQL, AWS Glue and Amazon EMR for Apache Spark, and Amazon Redshift for data warehousing. User-authored jobs use Andes governance and permissions across engines for consistent access control.

SageMaker implementation results

SageMaker catalog now encompasses various types of data assets from across the organization, representing an expansion from datasets alone to a complete inventory of data, dashboards, metrics, models, and other data assets, all while maintaining best practices and appropriate access and use guardrails.

“SageMaker provides a unified catalog that makes discovery and sharing of data assets, metrics and dashboards across teams straightforward, with direct integration to Andes datasets. SageMaker delivers deep integration through Git repository connections and enterprise identity management that aligns with existing Amazon workflows.”

– Gerry Moses, Sr. Principal TPM, Amazon

Faster data discovery – Data consumers can go to one place to locate trusted, high-quality assets with significantly less friction, which reduces the time from question to insight. By surfacing well-documented, governed assets through an enriched catalog, teams can confidently identify the right data for their use cases without navigating sprawling, inconsistent inventories or relying on tribal knowledge.
Improved collaboration – Breaks down data silos by making curated assets discoverable and reusable across Amazon. When teams can build on shared, authoritative datasets rather than creating redundant copies, data proliferation is reduced.

Conclusion

By integrating their existing governance tooling with Amazon SageMaker to build a centralized data catalog, BDT is creating a foundation for faster, more efficient data discovery across teams. Amazon SageMaker helped unify diverse data types with their existing catalog and enabled collaboration across teams to help them find the right data. By integrating with existing governance frameworks, BDT demonstrates how organizations can expand their catalog capabilities while preserving existing enterprise investments.

To learn more and get started with Amazon SageMaker Unified Studio, visit aws.amazon.com/sagemaker/unified-studio or the AWS console.

AWS Big Data Blog

How Amazon is moving to integrate catalogs to improve data discovery with Amazon SageMaker

Need for expanding catalog and governance from datasets to data assets

Solution: Unify datasets and data assets with Amazon SageMaker

Design overview

SageMaker implementation results

Conclusion

About the authors

Resources

Follow

Learn

Resources

Developers

Help