AWS Big Data Blog

Automate data discovery and centralized management with AWS Glue Data Catalog

Managing sensitive data across sprawling data environments is hard. In this post, we show you how to tackle data discovery, classification, and governance across your databases, data warehouses, and object storage to regain visibility and control over your data landscape. As you build new features, products, and services, your data naturally spreads across multiple systems to meet immediate application and business needs. Different teams spin up their own data stores, and before long, you’re dealing with a complex web of repositories—often with limited visibility into what exists where. This data sprawl becomes most challenging when you must understand and protect your sensitive data. Security teams often struggle to maintain accurate inventories of data categorization and classification. Stakeholders demand comprehensive insights into data classification and processing activities, usually on tight deadlines, and keeping up-to-date data inventories becomes increasingly daunting as your data grows. Without automation, you’re left with manual processes that stretch over weeks, leave room for human error, and create unnecessary business risk.

The need for automation

In a typical manual scenario, creating a new database triggers a chain of time-consuming events. The governance team reviews the new data source, documents its contents, and scans for sensitive data. The security team assesses its configuration and access controls. Days or weeks pass before you fully understand this new asset’s sensitivity.

With automation, creating a new database triggers immediate action. The system detects the new source, catalogs its structure, identifies sensitive data, and updates a central inventory within minutes, supporting proper governance from the moment you create it. Here’s how it works on AWS: When you create an Amazon Simple Storage Service (Amazon S3) bucket for customer orders, you add tags such as Business Function, Data Owner, and Purpose. After the bucket is in use, the system detects it, creates catalog entries, analyzes data patterns, identifies sensitive information, and updates governance records without additional input from you. This gives your organization real-time visibility. Security teams instantly see which repositories contain sensitive information. Governance teams generate up-to-date inventory reports on demand, and data teams immediately understand sensitivity levels, helping them use data responsibly.

Solution overview

The solution uses key AWS services across three layers that work together for comprehensive data visibility and categorization.

Detection Layer: Continuously monitors your AWS environment for new resource creation. When you provision an Amazon S3 bucket, Amazon Relational Database Service (Amazon RDS) database, or Amazon DynamoDB table, Amazon EventBridge rules capture this activity and initiates the governance workflow, so no data source goes unnoticed.

Architecture: EventBridge triggers Lambda and SQS to create Glue crawlers and ETL jobs for new S3 data sources
Figure 1 Automated data source discovery (S3 example) workflow using EventBridge Rules and Lambda functions

Processing Layer: After a new source is detected, AWS Glue crawlers analyze its schema while specialized jobs scan for sensitive data patterns. The system also extracts metadata from resource tags, enriching your understanding of each repository’s purpose and ownership.

Architecture: Glue PII Detection jobs scan S3, DynamoDB, and Aurora; Lambda updates Glue Data Catalog
Figure 2 PII detection and processing workflow using AWS Glue jobs and DynamoDB staging

Management Layer: Maintains a central source of truth about your data assets. AWS Glue Data Catalog provides a unified view across your organization, tracking schema changes and sensitivity levels. This layer also manages the processing workflow state and generates insights for stakeholders.

Architecture: Lambda extracts S3 bucket metadata via EventBridge, stores in DynamoDB, updates Glue Data Catalog
Figure 3 Tag-based metadata capture and Data Catalog update workflow

Setting up the solution

This solution uses AWS Cloud Development Kit (AWS CDK) for deployment, organized into four stacks that build upon each other.PrerequisitesBefore deployment, verify that you have:

  • Access to an AWS account with permissions to create resources in Amazon S3, AWS Lambda, Amazon DynamoDB, AWS Glue, and Amazon EventBridge
  • Node.js (version 18 or later) and npm installed
  • Access to a terminal to run AWS CDK CLI commands
  • Basic familiarity with AWS Console navigation

Step 1: Infrastructure deployment

Deploy four stacks using AWS CDK. Each establishes components for data discovery, cataloging, and PII detection.

  1. BaseInfraStack: Deploys core infrastructure—Amazon Virtual Private Cloud (Amazon VPC), DynamoDB tables for state management, EventBridge rules for monitoring, and Lambda functions for orchestration.
  2. GlueAssetsStack: Sets up S3 buckets for AWS Glue ETL scripts and deploys PySpark code for PII detection.
  3. GlueJobCreationStack: Creates Data Catalog databases and deploys Lambda functions that automate the creation of AWS Glue crawlers and PII detection jobs for newly discovered data sources.
  4. ReportingStack: Deploys Lambda functions that process PII detection results and tag metadata, updating the Data Catalog accordingly.

To deploy these stacks, you will use the AWS CDK CLI, running the following commands:

# Clone and prepare repository
git clone https://github.com/aws-samples/automated-datastore-discovery-with-aws-glue.git
cd automated-datastore-discovery-with-aws-glue
npm install
npx cdk bootstrap

# Deploy infrastructure stacks sequentially
npx cdk deploy BaseInfraStack
npx cdk deploy GlueAssetsStack
npx cdk deploy GlueJobCreationStack
npx cdk deploy ReportingStack
CloudFormation Stacks console showing four stacks with CREATE_COMPLETE status including BaseInfraStack

Figure 4 CloudFormation console showing successful stack deployment

Step 2: Verify initial setup

In the AWS Management Console, open DynamoDB and find the glueJobTracker table. This table is a critical component of the framework:

  • Purpose: Central state management – tracks processing states and configurations for discovered data sources.
  • Current state: The table should be empty because no discovery processes have been triggered yet.
  • Structure: Tracks states such as Data Catalog entry creation and PII detection job setup for each data source.

By verifying this table, you confirm that the infrastructure is ready to begin tracking new data sources.

DynamoDB glueJobTracker table scan returning zero items, showing empty table before pipeline execution

Figure 5 Empty DynamoDB glueJobTracker table before execution

Solution in action

This solution runs automatically in production through EventBridge triggers and scheduled AWS Glue crawlers. The following walkthrough executes each step manually so you can observe the workflow.You follow the journey of a newly created S3 bucket containing sensitive data, seeing how the solution discovers, catalog, and processes it through each stage.

Step 3: Create a new S3 bucket

  1. Open the Amazon S3 console.
  2. Choose Create bucket.
  3. Enter a unique name for your bucket (for example, demo-customer-data-20250819).
  4. In the Tags section, add the following tags:
    1. Key: gdpr-scan, Value: true
    2. Key: Business Function, Value: Sales – US
    3. Key: Data Classification, Value: Confidential
  5. Keep other settings as default and choose Create bucket.
S3 bucket properties with versioning disabled and data classification tags

Figure 6 S3 console showing new bucket creation with tags

Step 4: Upload sample data

  1. In the S3 console, open your newly created bucket.
  2. Choose Upload.
  3. Create a new file named customer_orders.csv with the below content.
  4. Upload this file to a folder named orders/ in your bucket.
order_id,customer_id,email,ssn
ORD001,CUST1001,john.doe@example.com,***-**-****
ORD002,CUST1002,john.doe2@example.com,***-**-****

S3 upload succeeded showing customer_orders.csv uploaded

Figure 7: S3 console showing uploaded CSV file in the orders folder

Step 5: Verify automated detection

  1. Open the DynamoDB console.
  2. Navigate to the glueJobTracker table.
  3. Choose the Items tab.
  4. You should see a new item with an s3_location matching your bucket name.
40918.pngDynamoDB glueJobTracker table scan returning 1 item with S3 data source entry

Figure 8 DynamoDB console showing detected bucket entry in glueJobTracker table

Step 6: Initiate catalog creation

  1. Open the AWS Lambda console.
  2. Find the function with a name containing s3GlueCatalogCreator.
  3. Choose the function name to open its details.
  4. Choose the Test tab.
  5. Create a new test event with an empty JSON object {}.
  6. Choose Test to invoke the function.
  7. Check the execution result for a successful response.
Lambda function execution succeeded with logs showing Glue table, crawler, and DynamoDB item creation

Figure 9 Lambda console showing successful function execution

Step 7: Run the AWS Glue crawler

  1. Navigate to the AWS Glue console.
  2. In the left sidebar, choose Crawlers.
  3. Find the crawler with a name related to your S3 bucket.
  4. Select the crawler and choose Run crawler.
  5. Wait for the crawler to complete (typically 3–5 minutes).
AWS Glue crawler running

Figure 10 Glue console showing crawler in “Running” state

Step 8: Verify schema discovery

  1. In the AWS Glue console, go to Databases in the left sidebar.
  2. Choose the s3_source_db database.
  3. You should see a new table corresponding to your uploaded data.
  4. Choose the table name to view its schema.
Glue Data Catalog table Version 3 showing 4-column CSV schema with no sensitive data annotations yet

Figure 11 Glue console showing detected table schema

Step 9: Execute PII detection

  1. Return to the Lambda console.
  2. Find and open the function with a name containing s3GlueCreator.
  3. Use the Test tab to invoke this function with an empty JSON object {}.
  4. After successful execution, go to the AWS Glue console.
  5. Navigate to Jobs in the left sidebar.
  6. Find the newly created PII detection job (it should contain your bucket name).
  7. Select the job and choose Run job.
  8. Monitor the job execution in the Glue console.
AWS Glue PII detection job running with 10 DPUs, G.1X worker type, Glue version 4.0, showing run details

Figure 12 Glue console showing PII detection job in “Running” state

Step 10: Review PII detection results

  1. Open the DynamoDB console.
  2. Navigate to the piiDetectionOutputTable.
  3. In the Items tab, you should see new entries related to your data.
  4. These entries will show detected PII types and confidence scores.
DynamoDB piiDetectionOutputTable scan showing 2 PII detection results identifying USA_SSN and EMAIL types

Figure 13 DynamoDB console showing PII detection results in piiDetectionOutputTable

Step 11: Verify Data Catalog updates

  1. Open the AWS Lambda console.
  2. Find the function with a name containing ReportingStack-PIIReportS3.
  3. Choose the function name to open its details.
  4. Choose the Test tab.
  5. Create a new test event with an empty JSON object {}.
  6. Choose Test to invoke the function.
  7. Check the execution result for a successful response.
  8. Return to the AWS Glue console.
  9. Go to Databases > s3_source_db > Your table.
  10. Review the schema. PII columns should now have comments indicating their classification.
Glue Data Catalog table Version 7 with schema showing sensitive data element comments for EMAIL and USA_SSN

Figure 14 Glue console showing updated table schema with PII classifications

Note: While we focus on S3 data sources in this walkthrough, the framework extends to other data stores, offering a unified approach for PII detection and compliance management, so organizations can automatically discover, catalog, and monitor sensitive data elements across your entire data ecosystem. For more information, see aws-samples/automated-datastore-discovery-with-aws-glue.

Best practices and operational excellence

As you implement this solution, consider these key practices for effective results:

  • Design your tagging strategy to capture essential business context about each data source. Implement automated tag enforcement through AWS Organizations for consistency across teams.
  • Monitor automated workflows regularly and configure retention policies for processed data to manage costs.
  • For enhanced security, configure VPC endpoints for services such as Amazon S3, DynamoDB, and other data sources. This keeps traffic within the AWS network, which is especially important when processing sensitive data. Verify that server-side encryption (SSE) is enabled on your data stores. This solution uses AWS Key Management Service (AWS KMS) keys for DynamoDB tables and SSE-S3 for S3 buckets by default, aligning with data-at-rest encryption best practices.
  • For teams with multiple AWS accounts, implement cross-account discovery and cataloging to maintain a comprehensive view of your data landscape.
Multi-account EventBridge buses aggregate events to central account with Lambda and Glue Data Catalog

Figure 15 Centralized Storage of Glue PII Detection Results in AWS Data Catalog

Clean up

To avoid ongoing charges and remove the resources created by this solution, follow these steps:

  1. Empty and delete the S3 buckets created for sample data and AWS Glue assets.
  2. Delete the AWS CloudFormation stacks in reverse order of creation:
    1. ReportingStack
    2. GlueJobCreationStack
    3. GlueAssetsStack
    4. BaseInfraStack
  3. Manually delete any remaining resources:
    1. DynamoDB tables (glueJobTracker, piiDetectionOutput, tagCaptureTable)
    2. AWS Glue databases and crawlers
    3. Lambda functions
    4. EventBridge rules
  4. Review your AWS account to ensure that all related resources have been removed.

Remember, deleting these resources will remove all data and configurations associated with this solution. Make sure that you have saved any important information before proceeding with the clean-up.

Conclusion

In this post, you learned how to build an automated data governance framework using AWS Glue Data Catalog. You set up detection, processing, and management layers that automatically discover, catalog, and classify your data sources.This approach improves how you manage sensitive data assets. Teams spend less time on manual discovery and categorization, freeing them to derive value from data. The system gives you current insights into your data landscape and automatically identifies sensitive data, creating a trusted source of truth that helps teams work efficiently while maintaining controls.You can extend this framework with custom sensitivity patterns for your industry. Its modular design supports continuous improvement and integrates with existing workflows. This turns data governance from a manual burden into an efficient process that scales with your organization.


About the authors

Ramakrishna Natarajan

Ramakrishna Natarajan

Ramakrishna is a Senior Partner Solutions Architect at Amazon Web Services. He is based out of London and helps AWS Partners find optimal solutions on AWS for their customers. He specialises in Telecommunications OSS/BSS and has a keen interest in evolving domains such as Data Analytics, AI/ML, Security and Modernisation. He enjoys playing squash, going on long hikes and learning new languages.