Best practices for querying Apache Iceberg data with Amazon Redshift

Apache Iceberg is an open table format that helps combine the benefits of using both data warehouse and data lake architectures, giving you choice and flexibility for how you store and access data. See Using Apache Iceberg on AWS for a deeper dive on using AWS Analytics services for managing your Apache Iceberg data. Amazon Redshift supports querying Iceberg tables directly, whether they’re fully-managed using Amazon S3 Tables or self-managed in Amazon S3. Understanding best practices for how to architect, store, and query Iceberg tables with Redshift helps you meet your price and performance targets for your analytical workloads.

In this post, we discuss the best practices that you can follow while querying Apache Iceberg data with Amazon Redshift

1. Follow the table design best practices

Selecting the right data types for Iceberg tables is important for efficient query performance and maintaining data integrity. It is important to match the data types of the columns to the nature of the data they store, rather than using generic or overly broad data types.

Why follow table design best practices?

Optimized Storage and Performance: By using the most appropriate data types, you can reduce the amount of storage required for the table and improve query performance. For example, using the DATE data type for date columns instead of a STRING or TIMESTAMP type can reduce the storage footprint and improve the efficiency of date-based operations.
Improved Join Performance: The data types used for columns participating in joins can impact query performance. Certain data types, such as numeric types (such as, INTEGER, BIGINT, DECIMAL), are generally more efficient for join operations compared to string-based types (such as, VARCHAR, TEXT). This is because numeric types can be easily compared and sorted, leading to more efficient hash-based join algorithms.
Data Integrity and Consistency: Choosing the correct data types helps with data integrity by enforcing the appropriate constraints and validations. This reduces the risk of data corruption or unexpected behavior, especially when data is ingested from multiple sources.

How to follow table design best practices?

Leverage Iceberg Type Mapping: Iceberg has built-in type mapping that translates between different data sources and the Iceberg table’s schema. Understand how Iceberg handles type conversions and use this knowledge to define the most appropriate data types for your use case.
Select the smallest possible data type that can accommodate your data. For example, use INT instead of BIGINT if the values fit within the integer range, or SMALLINT if they fit even smaller ranges.
Utilize fixed-length data types when data length is consistent. This can help with predictable and faster performance.
Choose character types like VARCHAR or TEXT for text, prioritizing VARCHAR with an appropriate length for efficiency. Avoid over-allocating VARCHAR lengths, which can waste space and slow down operations.
Match numeric precision to your actual requirements. Using unnecessarily high precision (such as, DECIMAL(38,20) instead of DECIMAL(10,2) for currency) demands more storage and processing, leading to slower query execution times for calculations and comparisons.
Employ date and time data types (such as, DATE, TIMESTAMP) rather than storing dates as text or numbers. This optimizes storage and allows for efficient temporal filtering and operations.
Opt for boolean values (such as, BOOLEAN) instead of using integers to represent true/false states. This saves space and potentially enhances processing speed.
If the column will be used in join operations, favor data types that are typically used for indexing. Integers and date/time types generally allow for faster searching and sorting than larger, less efficient types like VARCHAR(MAX).

2. Partition your Apache Iceberg table on columns that are most frequently used in filters

When working with Apache Iceberg tables in conjunction with Amazon Redshift, one of the most effective ways to optimize query performance is to partition your data strategically. The key principle is to partition your Iceberg table based on columns that are most frequently used in query filters. This approach can significantly improve query efficiency and reduce the amount of data scanned, leading to faster query execution and lower costs.

Why partitioning Iceberg tables matters?

Improved Query Performance: When you partition on columns commonly used in WHERE clauses, Amazon Redshift can eliminate irrelevant partitions, reducing the amount of data it needs to scan. For example, if you have a sales table partitioned by date and you run a query to analyze sales data for January 2024, Amazon Redshift will only scan the January 2024 partition instead of the entire table. This partition pruning can dramatically improve query performance—in this scenario, if you have five years of sales data, scanning just one month means examining only 1.67% of the total data, potentially reducing query execution time from minutes to seconds.
Reduced Scan Costs: By scanning less data, you can lower the computational resources required and, consequently the associated costs.
Better Data Organization: Logical partitioning helps in organizing data in a way that aligns with common query patterns, making data retrieval more intuitive and efficient.

How to partition Iceberg tables?

Analyze your workload to determine which columns are most frequently used in filter conditions. For example, if you always filter your data for the last 6months, then that date will be a good partition key.
Select columns that have high cardinality but not too high to avoid creating too many small partitions. Good candidates often include:
- Date or timestamp columns (such as, year, month, day)
- Categorical columns with a moderate number of distinct values (such as, region, product category)
Define Partition Strategy: Use Iceberg’s partitioning capabilities to define your strategy. For example if you are using Amazon Athena to create a partitioned Iceberg table, you can use the following syntax.

CREATE TABLE [db_name.]table_name (col_name data_type [COMMENT col_comment] [, …]
[PARTITIONED BY (col_name | transform, … )]
LOCATION 's3://amzn-s3-demo-bucket/your-folder/'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' [, property_name=property_value] )

Example

CREATE TABLE iceberg_db.my_table ( id INT, date DATE, region STRING, sales DECIMAL(10,2) )
PARTITIONED BY (date, region)
LOCATION 's3://amzn-s3-demo-bucket/your-folder/'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' )

Ensure your Redshift queries take advantage of the partitioning scheme by including partition columns in the WHERE clause whenever possible.

Walk-through with a sample usecase
Let’s take an example to understand how to pick the best partition key by following best practices. Consider an e-commerce company looking to optimize their sales data analysis using Apache Iceberg tables with Amazon Redshift. The company maintains a table called sales_transactions, which has data for 5 years across four regions (North America, Europe, Asia, and Australia) with five product categories (Electronics, Clothing, Home & Garden, Books, and Toys). The dataset includes key columns such as transaction_id, transaction_date, customer_id, product_id, product_category, region, and sale_amount.

The data science team uses transaction_date and region columns frequently in filters, while product_category is used less frequently. The transaction_date column has high cardinality (one value per day), region has low cardinality (only 4 distinct values) and product_category has moderate cardinality (5 distinct values).

Based on this analysis, an effective partition strategy would be to partition by year and month from the transaction_date, and by region. This creates a manageable number of partitions while improving the most common query patterns. Here’s how we could implement this strategy using Amazon Athena:

CREATE TABLE iceberg_db.sales_transactions ( transaction_id STRING, transaction_date DATE, customer_id STRING, product_id STRING, product_category STRING, region STRING, sale_amount DECIMAL(10,2))
PARTITIONED BY (transaction_date, region)
LOCATION 's3://athena-371178653860-22hcl401/sales-data/'
TBLPROPERTIES ('table_type' = 'ICEBERG');

3. Optimize by selecting only the necessary columns for query

Another best practice for working with Iceberg tables is to only select the columns that are necessary for a given query, and to avoid using the SELECT * syntax.

Why should you select only necessary columns?

Improved Query Performance: In analytics workloads, users typically analyze subsets of data, performing large-scale aggregations or trend analyses. To optimize these operations, analytics storage systems and file formats are designed for efficient column-based reading. Examples include columnar open file formats like Apache Parquet and columnar databases such as Amazon Redshift. A key best practice to select only the required columns in your queries, so the query engine can reduce the amount of data that needs to be processed, scanned, and returned. This can lead to significantly faster query execution times, especially for large tables.
Reduced Resource Utilization: Fetching unnecessary columns consumes additional system resources, such as CPU, memory, and network bandwidth. Limiting the columns selected can help optimize resource utilization and improve the overall efficiency of the data processing pipeline.
Lower Data Transfer Costs: When querying Iceberg tables stored in cloud storage (e.g., Amazon S3), the amount of data transferred from the storage service to the query engine can directly impact the data transfer costs. Selecting only the required columns can help minimize these costs.
Better Data Locality: Iceberg partitions data based on the values in the partition columns. By selecting only the necessary columns, the query engine can better leverage the partitioning scheme to improve data locality and reduce the amount of data that needs to be scanned.

How to only select necessary columns?

Identify the Columns Needed: Carefully analyze the requirements of each query and determine the minimum set of columns required to fulfill the query’s purpose.
Use Selective Column Names: In the SELECT clause of your SQL queries, explicitly list the column names you need, rather than using SELECT *.

4. Generate AWS Glue data catalog column level statistics

Table statistics play an important role in database systems that utilize Cost-Based Optimizers (CBOs), such as Amazon Redshift. They help the CBO make informed decisions about query execution plans. When a query is submitted to Amazon Redshift, the CBO evaluates multiple possible execution plans and estimates their costs. These cost estimates heavily depend on accurate statistics about the data, including: Table size (number of rows), column value distributions, Number of distinct values in columns, Data skew information, and more.

AWS Glue Data Catalog supports generating statistics for data stored in the data lake including for Apache Iceberg. The statistics include metadata about the columns in a table, such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. These column-level statistics provide valuable metadata that helps optimize query performance and improve cost efficiency when working with Apache Iceberg tables.

Why generating AWS Glue statistics matter?

Amazon Redshift can generate better query plans using column statistics, thereby improve performance on queries due to optimized join orders, better predicate push-down and more accurate resource allocation.
Costs will be optimized. Better execution plans lead to reduced data scanning, more efficient resource utilization and overall lower query costs.

How to generate AWS Glue statistics?

The Sagemaker Lakehouse Catalog enables you to generate statistics automatically for updated and created tables with a one-time catalog configuration. As new tables are created, the number of distinct values (NDVs) are collected for Iceberg tables. By default, the Data Catalog generates and updates column statistics for all columns in the tables on a weekly basis. This job analyzes 50% of records in the tables to calculate statistics.

On the Lake Formation console, choose Catalogs in the navigation pane.
Select the catalog that you want to configure, and choose Edit on the Actions menu.
Select Enable automatic statistics generation for the tables of the catalog and choose an IAM role. For the required permissions, see Prerequisites for generating column statistics.
Choose Submit.

You can override the defaults and customize statistics collection at the table level to meet specific needs. For frequently updated tables, statistics can be refreshed more often than weekly. You can also specify target columns to focus on those most commonly queried. You can set what percentage of table records to use when calculating statistics. Therefore, you can increase this percentage for tables that need more precise statistics, or decrease it for tables where a smaller sample is sufficient to optimize costs and statistics generation performance.These table-level settings can override the catalog-level settings previously described.

Read the blog Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena for more information.

5. Implement Table Maintenance Strategies for Optimal Performance

Over time, Apache Iceberg tables can accumulate various types of metadata and file artifacts that impact query performance and storage efficiency. Understanding and managing these artifacts is crucial for maintaining optimal performance of your data lake. As you use Iceberg tables, three main types of artifacts accumulate:

Small Files: When data is ingested into Iceberg tables, especially through streaming or frequent small batch updates, many small files can accumulate because each write operation typically creates new files rather than appending to existing ones.
Deleted Data Artifacts: Iceberg uses copy-on-write for updates and deletes. When records are deleted, Iceberg creates “delete markers” rather than immediately removing the data. These markers need to be processed during reads to filter out deleted records.
Snapshots: Every time you make changes to your table (insert, update, or delete data), Iceberg creates a new snapshot—essentially a point-in-time view of your table. While valuable for maintaining history, these snapshots increase metadata size over time, impacting query planning and execution.
Unreferenced Files: These are files that exist in storage but aren’t linked to any current table snapshot. They occur in two main scenarios:
1. When old snapshots are expired, the files exclusively referenced by those snapshots become unreferenced
2. When write operations are interrupted or fail midway, creating data files that aren’t properly linked to any snapshot

Why table maintenance matters?

Regular table maintenance delivers several important benefits:

Enhanced Query Performance: Consolidating small files reduces the number of file operations required during queries, while removing excess snapshots and delete markers streamlines metadata processing. These optimizations allow query engines to access and process data more efficiently.
Optimized Storage Utilization: Expiring old snapshots and removing unreferenced files frees up valuable storage space, helping you maintain cost-effective storage utilization as your data lake grows.
Improved Resource Efficiency: Maintaining well-organized tables with optimized file sizes and clean metadata requires less computational resources for query execution, allowing your analytics workloads to run faster and more efficiently.
Better Scalability: Properly maintained tables scale more effectively as data volumes grow, maintaining consistent performance characteristics even as your data lake expands.

How to perform table maintenance?

Three key maintenance operations help optimize Iceberg tables:

Compaction: Combines smaller files into larger ones and merges delete files with data files, resulting in streamlined data access patterns and improved query performance.
Snapshot Expiration: Removes old snapshots that are no longer needed while maintaining a configurable history window.
Unreferenced File Removal: Identifies and removes files that are no longer referenced by any snapshot, reclaiming storage space and reducing the total number of objects the system needs to track.

AWS offers a fully managed Apache Iceberg data lake solution called S3 tables that automatically takes care of table maintenance, including:

Automatic Compaction: S3 Tables automatically perform compaction by combining multiple smaller objects into fewer, larger objects to improve Apache Iceberg query performance. When combining objects, compaction also applies the effects of row-level deletes in your table. You can manage compaction process based on the configurable table level properties.
- targetFileSizeMB: Default is 512 MB. Can be configured to a value between between 64 MiB and 512 MiB.

Apache Iceberg offers various methods like Binpack, Sort, Z-order to compact data. By default Amazon S3 selects the best of these three compaction strategy automatically based on your table sort order

Automated Snapshot Management: S3 Tables automatically expires older snapshots based on configurable table level properties
- MinimumSnapshots (1 by default): Minimum number of table snapshots that S3 Tables will retain
- MaximumSnapshotAge (120 hours by default): This parameter determines the maximum age, in hours, for snapshots to be retained
Unreferenced File Removal: Automatically identifies and deletes objects not referenced by any table snapshots based on configurable bucket level properties:
- unreferencedDays (3 days by default): Objects not referenced for this duration are marked as noncurrent
- nonCurrentDays (10 days by default): Noncurrent objects are deleted after this duration

Note: Deletes of noncurrent objects are permanent with no way to recover these objects.

If you are managing Iceberg tables yourself, you’ll need to implement these maintenance tasks:

Using Athena:

Run OPTIMIZE command using the following syntax:
```
OPTIMIZE [database_name.]<table_name>;
```
This command triggers the compaction process, which utilizes a bin-packing algorithm to group small data files into larger ones. It also merges delete files with existing data files, effectively cleaning up the table and improving its structure.
Set the following table properties during iceberg table creation: vacuum_min_snapshots_to_keep (Default 1): Minimum snapshots to retain vacuum_max_snapshot_age_seconds (Default 432000 seconds or 5 days)
Periodically run the VACUUM command to expire old snapshots and remove unreferenced files. Recommended after performing operations like merge on iceberg tables. Syntax: VACUUM [database_name.]target_table. VACUUM performs snapshot expiration and orphan file removal

Using Spark SQL:

Schedule regular compaction jobs with Iceberg’s rewrite files action
Use expireSnapshots operation to remove old snapshots
Run deleteOrphanFiles operation to clean up unreferenced files
Establish a maintenance schedule based on your write patterns (hourly, daily, weekly)
Run these operations in sequence, typically compaction followed by snapshot expiration and unreferenced file removal
It’s especially important to run these operations after large ingest jobs, heavy delete operations, or overwrite operations

6. Create incremental materialized views on Apache Iceberg tables in Redshift to improve performance of time sensitive dashboard queries

Organizations across industries rely on data lake powered dashboards for time-sensitive metrics like sales trends, product performance, regional comparisons, and inventory rates. With underlying Iceberg tables containing billions of records and growing by millions daily, recalculating metrics from scratch during each dashboard refresh creates significant latency and degrades user experience.

The integration between Apache Iceberg and Amazon Redshift enables creating incremental materialized views on Iceberg tables to optimize dashboard query performance. These views enhance efficiency by:

Pre-computing and storing complex query results
Using incremental maintenance to process only recent changes since last refresh
Reducing compute and storage costs compared to full recalculations

Why incremental materialized views on Iceberg tables matter?

Performance Optimization: Pre-computed materialized views significantly accelerate dashboard queries, especially when accessing large-scale Iceberg tables
Cost Efficiency: Incremental maintenance through Amazon Redshift processes only recent changes, avoiding expensive full recomputation cycles
Customization: Views can be tailored to specific dashboard requirements, optimizing data access patterns and reducing processing overhead

How to create incremental materialized views?

Determine which Iceberg tables are the primary data sources for your time-sensitive dashboard queries.
Use the CREATE MATERIALIZED VIEW statement to define the materialized views on the Iceberg tables. Ensure that the materialized view definition includes only the necessary columns and any applicable aggregations or transformations.
If you have used all operators that are eligible for an incremental refresh, Amazon Redshift automatically creates an incrementally refresh-able materialized view. Refer to limitations for incremental refresh to understand the operations that are not eligible for an incremental refresh
Regularly refresh the materialized views using REFRESH MATERIALIZED VIEW command

7. Create Late binding views (LBVs) on Iceberg table to encapsulate business logic.

Amazon Redshift’s support for late binding views on external tables, including Apache Iceberg tables, allows you to encapsulate your business logic within the view definition. This best practice provides several benefits when working with Iceberg tables in Redshift.

Why create LBVs?

Centralized Business Logic: By defining the business logic in the view, you can ensure that the transformation, aggregation, and other processing steps are consistently applied across all queries that reference the view. This promotes code reuse and maintainability.
Abstraction from Underlying Data: Late binding views decouple the view definition from the underlying Iceberg table structure. This allows you to make changes to the Iceberg table, such as adding or removing columns, without having to update the view definitions that depend on the table.
Improved Query Performance: Redshift can optimize the execution of queries against late binding views, leveraging techniques like predicate pushdown and partition pruning to minimize the amount of data that needs to be processed.
Enhanced Data Security: By defining access controls and permissions at the view level, you can grant users access to only the data and functionality they require, improving the overall security of your data environment.

How to create LBVs?

Identify suitable Apache Iceberg tables: Determine which Iceberg tables are the primary data sources for your business logic and reporting requirements.
Create late binding views(LBVs): Use the CREATE VIEW statement to define the late binding views on the external Iceberg tables. Incorporate the necessary transformations, aggregations, and other business logic within the view definition.
Example:
```
CREATE VIEW my_iceberg_view AS
SELECT col1,col2,SUM(col3) AS total_col3
GROUP BY col1, col2
WITH NO SCHEMA BINDING;
```
Grant View Permissions: Assign the appropriate permissions to the views, granting access to the users or roles that require access to the encapsulated business logic.

Conclusion

In this post, we covered best practices for using Amazon Redshift to query Apache Iceberg tables, focusing on fundamental design decisions. One key area is table design and data type selection, as this can have the greatest impact on your storage size and query performance. Additionally, using Amazon S3 Tables to have a fully-managed tables automatically handle essential maintenance tasks like compaction, snapshot management, and vacuum operations, allowing you to focus building your analytical applications.

As you build out your workflows to use Amazon Redshift with Apache Iceberg tables, considering the following best practices to help you achieve your workload goals:

Adopting Amazon S3 Tables for new implementations to leverage automated management features
Auditing existing table designs to identify opportunities for optimization
Developing a clear partitioning strategy based on actual query patterns
For self-managed Apache Iceberg tables on Amazon S3, implementing automated maintenance procedures for statistics generation and compaction

AWS Big Data Blog