AWS Big Data Blog

Improve DynamoDB analytics with AWS Glue zero-ETL schema and partition controls

You store transactional data in Amazon DynamoDB and get single-digit millisecond performance. However, when you want to run analytics, machine learning (ML), or reporting on that same data, you face a gap: your flexible, semi-structured DynamoDB schemas don’t align with the flat, columnar formats that analytics engines require. Bridging this gap typically means building and maintaining custom ETL pipelines, which adds development cost and operational overhead.

AWS Glue Zero-ETL integration removes that pipeline work. It enables replication of your DynamoDB tables to Apache Iceberg tables in Amazon Simple Storage Service (Amazon S3), then query it directly with Amazon Athena. During setup, you can configure two capabilities that will shape how replicated data looks and performs: schema unnesting flattens nested attributes into individual columns, and data partitioning organizes data so your queries scan only what they need.

In this post, you learn how to replicate Amazon DynamoDB data to Apache Iceberg tables in Amazon S3 through a zero-ETL integration. We walk through the challenges that the DynamoDB nested, schema-flexible data model introduces for analytics workloads, and show you how to configure schema unnesting and data partitioning for a sample product catalog table. We also cover how to query the replicated data in Amazon Athena using standard SQL.

Semi-structured data meets analytics

Your product catalog in DynamoDB contains items with nested attributes like product details, pricing tiers, and inventory information. A typical item looks like this:

{
  "product_id": "P-1001",
  "name": "Wireless Headphones",
  "productdetails": {
    "brand": "AudioTech",
    "category": "Electronics",
    "weight_kg": 0.25,
    "specification": {
       "color": "Black",
       "storage": "128GB"
    }
  },
  "pricing": {
    "list_price": 79.99,
    "discount_pct": 10
  },
  "created_at": 1701388800000
}

This structure supports fast transactional reads and writes. However, when you replicate this data for analytics, you face two decisions:

  • You must decide whether to flatten nested maps like productdetails into individual columns or preserve them as-is.
  • You must choose how to organize the data on disk so that queries filtering by brand or date range scan only relevant partitions.

With AWS Glue Zero-ETL, you address both decisions through configurable schema unnesting and data partitioning.

Solution overview

You replicate data from your DynamoDB table through AWS Glue Zero-ETL into Apache Iceberg tables stored in Amazon S3, then query the results with Amazon Athena. The following diagram illustrates the end-to-end architecture:

Data flow diagram showing AWS data pipeline: DynamoDB source table → AWS Glue zero-ETL integration → Apache Iceberg on Amazon S3 → Amazon Athena analytics query.

AWS Glue zero-ETL ingests data from Amazon DynamoDB, writes it in Apache Iceberg format to your Amazon S3 data lake, and makes it available for SQL queries in Amazon Athena—with no pipelines to build or maintain. With this integration, you:

  • Save development time by skipping custom code and ETL job management
  • Keep DynamoDB performance intact because replication doesn’t consume table’s provisioned read/write capacity
  • Get data within 15 minutes of changes in the source table
  • Query with standard tools because data lands in Apache Iceberg format, an open table format that AWS natively supports for high-performance analytics

During setup, you configure two output settings:

  1. Schema unnesting in Zero-ETL: You choose how nested attributes appear in the target. Flattening nested maps into individual columns streamlines your queries and reduces complexity.
  2. Data partitioning in Zero-ETL: You choose how data is organized into partitions. When you filter on a partition column, the query engine reads only matching data instead of scanning everything, cutting both query time and cost.

Schema unnesting

When you create a zero-ETL integration, you can choose one of three unnesting options. Schema unnesting transforms complex, nested DynamoDB structures into formats that analytics engines can query directly, removing post-processing transformations.

Each option changes how nested DynamoDB attributes appear in the target table. The right choice depends on your analytics tools and how consistent your DynamoDB schemas are.

Option 1: No unnesting

This option preserves the original nested structure. DynamoDB maps and lists remain as structured columns in the target.

Using the product example, the target table retains productid and value as columns to hold DynamoDB partition key and a DynamoDB record respectively.

Recommended for: Workloads where your analytics tools natively support querying nested data and you want to preserve the DynamoDB structure unchanged.

Option 2: Unnest one level

This option flattens top-level maps into individual columns. Lists remain nested.

With this option, productdetails and pricing each become separate columns.

Recommended for: Scenarios where your DynamoDB items have a consistent schema and you want to balance structure preservation with query simplicity.

Option 3: Unnest all levels (default)

This option recursively flattens nested structures using dot notation and produces the flattest schema.

For the product table, this creates columns such as productdetails.brand, productdetails.category, productdetails.specification.color , productdetails.specification.storage , pricing.list_price, and pricing.discount_pct. The pricing map flattens similarly. Each column is directly queryable without nested access patterns.

Recommended for: Analytics tools that prefer flat schemas when your DynamoDB items have a reasonably consistent structure. Note that deeply nested or highly variable schemas can produce very wide tables.

Data partitioning

You can speed up your queries and reduce costs by partitioning your replicated data. Partitioning divides data into logical segments on disk.

When you include a filter on a partition column in your query, the query engine skips irrelevant segments entirely. This behavior is called partition pruning: instead of scanning the entire dataset, the engine reads only the data that matches your filter conditions. For large tables, partition pruning can reduce both query runtime and cost significantly.

Default partitioning

If you don’t specify partition columns, AWS Glue Zero-ETL partitions data using the DynamoDB primary key with bucketing. This approach supports general-purpose queries without requiring manual configuration. For specific query patterns or performance requirements, you can define custom partitioning strategies described in the subsections that follow.

Identity partitioning

Identity partitioning uses raw column values to create partitions. You apply this strategy to low-to-medium cardinality columns such as brand, category, or AWS Region. To partition the product table by productdetails.brand and create a separate partition for each brand, use this configuration:

{
  "partitionSpec": [
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}

With this setup, AWS Glue creates one partition directory per unique brand value. When you query for a specific brand, Athena reads only that partition.

Important: Avoid identity partitioning on high-cardinality columns such as primary keys or timestamps. This creates many small partitions, which degrades both ingestion and query performance

Time-based partitioning

Time-based partitioning organizes data by timestamp at a chosen granularity: year, month, day, or hour. You apply this strategy to time-series data and time-range queries. To partition the product table by month on the created_at column, which stores epoch milliseconds, use this configuration:

{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    }
  ]
}

The conversionSpec parameter tells AWS Glue how to interpret the source timestamp. Supported values: epoch_sec (Unix seconds), epoch_milli (Unix milliseconds), and iso (ISO 8601 format).

Note: The original column values remain unchanged. AWS Glue transforms only the partition column values to timestamp type in the target table

Multi-level partitioning

You can combine strategies for a hierarchical scheme. To partition first by month and then by brand, use this configuration:

{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    },
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}

This scheme supports efficient queries that filter by date range, brand, or both. Place higher-selectivity columns first in the hierarchy and align the scheme with your most common query patterns.

Best practices

Keep these guidelines in mind when you configure your integration:

  • Avoid identity partitioning on high-cardinality columns such as primary keys, timestamps, or system-generated IDs. This leads to partition explosion and degrades performance.
  • Apply only one time-based function per column. For example, don’t partition col1 by year, month, day, and hour simultaneously.
  • Match conversionSpec to your actual data format. If your timestamps are in epoch milliseconds, use epoch_milli, not epoch_sec or iso.
  • Choose granularity based on data volume. High-volume tables benefit from finer granularity (day or hour). Lower-volume tables work well with coarser granularity (month or year).
  • Account for timezone implications with ISO timestamps. AWS Glue Zero-ETL normalizes timestamp partition values to UTC.

Prerequisites

To implement the AWS Glue Zero-ETL integration with a DynamoDB source, you will need:

  1. An AWS account with least privilege principle
  2. An AWS Glue database (for example, ddb_zero_etl_demo_db) with an Amazon S3 bucket associated as the database location (setup instructions)
  3. AWS Glue Data Catalog settings updated with an AWS Identity and Access Management (IAM) policy that grants fine-grained access control for zero-ETL (setup instructions)
  4. Create an IAM role named zetl-role, to be used by zero-ETL to access data from your DynamoDB table
  5. A DynamoDB source table (for example, product) configured for zero-ETL integration (setup instructions)

Walkthrough: Create the zero-ETL integration

Complete these steps to create a zero-ETL integration with DynamoDB as the source and Apache Iceberg tables in Amazon S3 as the target.

Step 1: Select the source type

  1. Open the AWS Glue console.
  2. In the navigation pane, under Data Integration and ETL, choose Zero-ETL integrations.
  3. Choose Create zero-ETL integration.
  4. Select Amazon DynamoDB as the source type, then choose Next.

AWS Glue console showing Step 1 of creating a Zero-ETL integration — selecting a source type from 14 available data sources including Amazon DynamoDB, Facebook Ads, Instagram Ads, MySQL, Oracle, PostgreSQL, and Microsoft SQL Server

[Figure 1: Selecting Amazon DynamoDB as the zero-ETL source type]

Step 2: Configure source and target

  1. In Source details, select your DynamoDB table (for example, product).
  2. In Target details:
    • Select the current account as target.
    • Choose the catalog and target database (for example, ddb_zero_etl_demo_db).
    • Select the IAM role (for example, zetl-role).

AWS Glue console Step 2 — configuring source and target for a zero-ETL integration with Amazon DynamoDB "product" table as source and an AWS Glue catalog database "ddb_zero_etl_demo_db" as target

[Figure 2: Configuring source DynamoDB table and target database]

Step 3: Configure output settings

  1. Under Schema unnesting, select Unnest all fields.
  2. Under Data partitioning, select Specify custom partition keys.
  3. Enter the partition key (for example, productdetails.brand) and set the function to Identity.
  4. Choose Next.

AWS Glue Zero-ETL integration output settings showing schema unnesting set to "Unnest all fields," custom partition key "productdetails.brand" configured with Identity function, and target table named "product.

[Figure 3: Configuring schema unnesting and partition key settings]

Step 4: Set integration details

  1. Optionally configure encryption and replication settings. The default refresh interval is 15 minutes.
  2. Enter a name for the integration (for example, ddb-zero-etl-demo).
  3. Choose Next.

AWS Glue Zero-ETL integration Step 3 — configuring security with AWS managed KMS key, replication refresh interval set to 15 minutes, and integration named "ddb-zero-etl-demd

[Figure 4: Configuring encryption and replication settings]

Step 5: Review and create

  1. Review your settings and choose Create and launch integration.
  2. The integration shows as Active within about a minute.

AWS Glue Zero-ETL integration Step 4: Review and Create — showing DynamoDB "product" table as source, Glue database "zett_target" as target with IAM role "zett-role," and partition key "productdetails.brand" with Identity function

[Figure 5: Review and create summary]

AWS Glue Zero-ETL Integration Details page showing "ddb-zero-etl-demo-test" integration with status "Creating," DynamoDB "product" table as source, Glue database "ddb_zero_etl_demo_db" as target, and a 15-minute refresh interval

[Figure 6: Integration active with successful status]

Query the replicated data

After the integration is active and the initial replication completes (typically 15–30 minutes), you can query the data in Amazon Athena.

Preview the replicated data

  1. Open the Amazon Athena console.
  2. In the query editor, select your target database (for example, ddb_zero_etl_demo_db).
  3. Run a preview query:
SELECT * FROM "ddb_zero_etl_demo_db"."product"LIMIT 10;

Verify schema unnesting

With Unnest all fields selected, nested attributes appear as individual columns with dot notation:

SELECT "productdetails.brand", "productdetails.category", "pricing.list_price" 
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.category" = 'Electronics';

Verify partition pruning

Queries that filter on the partition column (productdetails.brand) automatically skip irrelevant partitions:

SELECT product_id, name, "pricing.list_price"
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.brand" = 'AudioTech';

Amazon Athena Query Editor showing a completed SQL query selecting brand, category, and product ID from a DynamoDB zero-ETL Glue catalog table, returning two results: Samsung SmartPhone P22445 and TechCo SmartPhone P12345

[Figure 7: Athena query to retrieve the data from Apache Iceberg lakehouse]

You can verify the partition structure by navigating to the Amazon S3 bucket associated with your database. The data organizes into directories like:

Amazon S3 bucket browser showing the "data/" folder in "ddb-zero-etl-demo-bucket" with two partitioned folders: "productdetails.brand=Samsung/" and "productdetails.brand=TechCo/" — confirming Iceberg partition structure from DynamoDB zero-ETL integration

[Figure 8: Amazon S3 bucket organization for the identity partition productdetails.brand]

Clean up

To avoid ongoing charges, delete the resources in this order:

  1. Delete the zero-ETL integration. In the AWS Glue console, navigate to Zero-ETL integrations, select your integration, and choose Delete. Existing replicated data remains in the target, but new changes stop replicating.
  2. Delete the replicated table. In the AWS Glue Data Catalog, navigate to Tables, select the replicated table, and delete it.
  3. Delete the AWS Glue database. In the Data Catalog, select the database and delete it.
  4. Delete the Amazon S3 data. Empty and delete the S3 bucket associated with the database.
  5. Delete the DynamoDB table. If you created it for this walkthrough, delete the source table.
  6. Delete IAM resources. Remove the IAM role and policies created for the integration.

Conclusion

You configured schema unnesting and data partitioning for a DynamoDB zero-ETL integration, replicated a product catalog table to Apache Iceberg tables in Amazon S3, and verified the results in Amazon Athena. Unnesting flattened nested attributes into directly queryable columns. Partitioning helped the query engine skip irrelevant data, reducing both query time and cost. To take your integration further, try monitoring replication lag and data freshness with Amazon CloudWatch metrics. You can also experiment with different partitioning strategies on a staging table before applying them to production workloads, testing time-based partitioning alongside identity partitioning to find the optimal scheme for your query patterns. For broader analytics coverage, query the same Iceberg tables from Amazon Redshift Spectrum or Amazon EMR alongside Athena. For more details, explore these resources:


About the authors

Raju Ansari

Raju is a Senior Software Development Engineer at AWS, specializing in building scalable, secure, serverless solutions that simplify data analytics and AI agent development. He helps organizations modernize their data analytics infrastructure and develop cutting-edge AI agentic applications. Currently, Raju focuses on building foundational AI services, including Amazon Bedrock Agents, which enable developers to create intelligent, autonomous applications at scale. Outside of work, Raju is passionate about giving back to the tech community. He actively volunteers at IEEE events and mentor early and mid-career professionals

Shashank Sharma

Shashank is an Engineering Leader with over 15 years of experience delivering data integration and replication solutions for first-party and third-party databases and SaaS for enterprise customers. He leads engineering for AWS Glue Zero-ETL and Amazon AppFlow, building fully managed pipelines that replicate data from sources like Salesforce, SAP, DynamoDB, and Oracle into Amazon Redshift and Apache Iceberg-based data lakes. Shashank advises startups on technology strategy and mentors engineers and technical leaders at various career stages