Building a Data Lakehouse with Amazon S3 and Dremio on Apache Iceberg Tables

By Alex Merced, Developer Advocate – Dremio
By Sathisan Vannadil, Sr. Partner Solutions Architect – AWS
By Jorge A Lopez, Principal Analytics, Storage – AWS

Dremio

With recent innovations in open-source technologies—such as Apache Arrow which allows for faster processing, and Apache Iceberg open table format which allows for more efficient query planning—the types and scale of workloads you can run directly onto your data lake storage is more expansive than ever.

The resulting architecture enables a world where you have the power and performance of a data warehouse with the affordability and flexibility of the data lake. This is referred to as a data lakehouse.

In this post, we discuss how efficiently you could implement a data lakehouse using Amazon Simple Storage Service (Amazon S3) and Dremio on Apache Iceberg.

Apache Iceberg is a high-performance, open table format for large analytical tables designed to mitigate the challenges introduced by unforeseen changes observed by enterprises. Iceberg enables data teams to quickly, easily, and safely keep up with data and analytics changes, which in turn helps businesses realize fast turnaround times to process the changes end-to-end.

Dremio is an AWS Data and Analytics Competency Partner and AWS Marketplace Seller whose data lake engine delivers fast query speed and a self-service semantic layer operating directly against Amazon S3 data. Dremio combines an Apache Arrow-based query engine with acceleration technologies to provide 10-100x faster query performance.

Data Lakehouse

Data lakehouse architecture aims to maximize the potential of data lake storage by leveraging open formats. The main benefits are:

Lower storage costs: Leveraging open formats allows data to be stored cost-effectively.
Reduced compute costs: Reducing the need for extensive extract, transform, load (ETL) processes results in lower compute costs. Additionally, more affordable query processing options enables cost optimization.
Tool diversity: Organizations have the freedom to choose from a wide range of tools.
Data control: Data remains within your own storage infrastructure.
Compute and storage separation: Users can independently scale compute resources without affecting the underlying storage infrastructure, providing greater adaptability to changing analytics requirements.

Components of Data Lakehouse

One of the advantages data lakehouse architecture is that it’s modular, as components are lightly coupled as compared traditional data warehouse. The components needed to have a data lakehouse are:

File format: To save data in a format that lends itself to diverse workloads.
Storage: Scalable, durable, and secure central repository capable of storing data.
Table format: A way for tools to know which data in the storage is a table.
Catalog: Repository to track which tables exist.
Query engine: A tool that can recognize our catalog and run workloads on the tables it tracks.
Governance tool: To define, enforce, and audit data access and usage policies.

For this architecture, we’ll use the following components:

File format: Apache Parquet
Storage: Amazon S3
Table format: Apache Iceberg
Catalog: AWS Glue
Query engine: Dremio
Governance tool: AWS Lake Formation

Apache Parquet

Apache Parquet is a binary columnar format which brings (1) speed and performance, and stores more data in less space; (2) structures the data in a columnar format perfect for analytical workloads; and (3) provides metadata via footers and row groups that allow query engines to scan these files in a smarter way.

Amazon S3

Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize data, and configure fine-tuned access controls to meet specific business, organizational, and compliance requirements.

Apache Iceberg

Apache Iceberg is a table format designed for data lake storage, providing capabilities similar to a traditional database. It adds a layer of metadata that organizes data files into tables, enabling engines to optimize query execution by identifying relevant files beforehand. This metadata-driven approach enhances query performance and efficiency by reducing the need to open unnecessary data files during processing.

Figure 1 – Architecture of Apache Iceberg.

The architecture of Apache Iceberg lends to its many features:

Enables ACID (Atomicity, Consistency, Isolation, Durability) transactions on lake storage.
Faster query planning and execution with partition pruning and min/max filtering.
Evolve table schemas.
Evolve partitioning schemes without rewriting the table.
Time travel and query previous table snapshots.
Hidden partitioning makes it easier for end users to take advantage of table partitioning.

AWS Glue

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue has a variety of support for Iceberg. You can use AWS Glue to perform read and write operations on Iceberg tables in S3, or work with Iceberg tables using the AWS Glue Data Catalog.

Dremio

Dremio is a query engine that leverages the in-memory processing power of Apache Arrow and smart query planning powers of Apache Iceberg from top to bottom to create an easy, fast and open platform. From Dremio, you can query data but also provide an easy-to-use access point to all data users.

Figure 2 – Architecture overview of Dremio.

AWS Lake Formation

AWS Lake Formation enables you to easily create secure data lakes, making data available for wide-ranging analytics. When it comes to Iceberg, AWS Lake Formation allows you to define and enforce database, table, and column-level access policies to query Iceberg tables stored in S3 by providing an authorization and governance layer on Iceberg data stored in S3.

Building a Data Lakehouse

In this section, we leverage the components described above to build a data lakehouse.

Ingesting Data

As the first step we’ll ingest data from source, land them on Amazon S3 as Parquet files, and register the data as Apache Iceberg table in AWS Glue Data Catalog. AWS Glue Studio simplifies this with built-in Apache Iceberg functionality. Download the dataset and upload it to an S3 bucket.

Next, we’ll create a database (for example “sampledb”) in your AWS Glue Data Catalog. Now, create an ETL job with the AWS Glue visual ETL tool. Head over to the Visual ETL section of AWS Glue and select “visual with a black canvas” with the selections shown below.

Figure 3 – Creating ETL job with the AWS Glue visual ETL tool.

For the job parameters, set the following values:

key	value
–datalake-formats	iceberg
–conf	spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions –conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog –conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ –conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog –conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO Note: Make sure there are no line breaks in the value you paste for –conf key.

key

value

–datalake-formats

iceberg

–conf

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions –conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog –conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ –conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog –conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Note: Make sure there are no line breaks in the value you paste for –conf key.

Make sure to change <your-warehouse-dir> to the location you want the tables created to be saved. Save the job, head to the scripts tab, and edit your script to look similar to the one found here, which does the following:

Converts the DynamicFrame from reading the CSV into a DataFrame.
Turns the DataFrame into a temporary view.
Uses CTAS to convert the CSV file into an Apache Iceberg table.

Once you run the job, the new table will become part of our AWS Glue Data Catalog.

Governing Your AWS Data Lake

Next, we’ll establish table governance using AWS Lake Formation’s attribute-based access control (ABAC), where access relies on AWS Identity and Access Management (IAM) role/user attributes and the resource itself. This involves using “LF-tags” (key/value tags) to implement tag-based access control (TBAC). If a role’s tag matches the resource, the tag’s rules apply. Benefits include fewer IAM policies and more granular access control to databases, tables, and columns.

To demonstrate, navigate to the AWS Lake Formation dashboard and set up administrators, if needed. Under LF-tags, add a new one. For a domain-based data mesh structure, use a key called “domain” with domain values. Finally, edit your AWS Glue catalog tables’ LF-tags for TBAC.

Figure 4 – AWS Glue Data Catalog tables’ LF-tags for TBAC.

We can now tag IAM users for specific database/table/column access, creating a flexible security system. Our lakehouse setup so far includes Apache Iceberg tables in AWS Glue, Apache Parquet files in Amazon S3, and governance through AWS Lake Formation.

Organizing the Semantic Layer

To set up Dremio, start by creating a Dremio Cloud account. Create a new “Dremio Sonar” project and run the AWS CloudFormation script for AWS resource creation. Once your Sonar project is ready, your dashboard will show only the sample source connected.

Next, connect your data source by clicking “add source.” Select AWS Glue and input credentials to add your AWS Glue databases and tables. Although you can query data via Dremio now, you’d likely want to distribute it to multiple users, assigning various access levels and with appropriate documentation. Dremio can assist in managing this distribution.

Step 1: Create Spaces

Spaces in Dremio are containers for data organization. We propose a bronze, silver, gold setup. Bronze represents raw data, silver is cleansed and joined data, and gold is data transformed for production. You can create a space for each department, with bronze, silver, and gold folders, housing respective views of tables.

To create a space, click the plus “+” button next to spaces, assign users and roles, and specify access levels. You can add folders in each space. A wiki option is available for documentation, supporting markdown syntax for easy writing.

To generate a view of the raw data for a department, browse the table from AWS Glue catalog and select “Save View as.”

Figure 5 – Table view from AWS Glue Data Catalog.

Save the view of the raw data in the bronze folder for our marketing space. Now, those with the right permission levels can work with that space to read and create views from the raw data in bronze for the silver and gold levels.

Step 2: Add Users

Once spaces are established, create users with specific access to Dremio’s data. Controls range from space, folder, and view access to defining access levels (read, write, etc.) and setting row/column access rules using user-defined functions (UDFs).

Add users and roles from the Sonar project settings, assigning them to appropriate spaces, folders, and datasets. Use SQL UDFs for row/column access rules. AWS Lake Formation rules govern data on the AWS platform data access, while Dremio’s rules control data consumer access through Dremio’s platform.

Querying the Data

Dremio enables you to organize, document, and govern data, promoting self-service through its powerful query engine. This engine optimizes performance using Apache Arrow for in-memory processing, Dremio’s columnar cloud cache for performance and cost savings, and data reflections to accelerate queries on multiple views.

You can execute queries via Dremio’s REST API, JDBC/ODBC support, Arrow Flight support, or through the SQL Runner in the web user interface (UI), offering an enhanced SQL integrated development environment (IDE) experience with features like syntax highlighting and dark mode.

Figure 6 – Dremio’s enhanced SQL IDE experience.

Dremio also had seamless integrations with several popular applications to query and visualize the data in your data lakes and relational databases. To boost view performance across multi-source data and transformations, you can use Reflections.

Reflections are a unique acceleration feature in Dremio, similar to indexes in databases. They optimize query performance by organizing and preparing data for the execution engine using techniques like columniation, compression, and aggregation.

Two types of Reflections exist: Raw for general query acceleration, and Aggregation for improving business intelligence-style queries. Invisible to end users, these Reflections boost performance without requiring changes to their queries or connections. Enabling Reflections on the desired view or dataset is just a flip of a switch on the reflections panel for a dataset.

Figure 7 – Dremio’s data Reflections.

Conclusion

In this post, we discussed building a scalable data lakehouse with Dremio and Amazon S3 on Apache Iceberg tables. In addition to the lakehouse enabling you to access data across a wide set of tools, it facilitates self-service analytics while having the right controls to satisfy regulatory requirements.

For more information on how to get started using Apache Iceberg for your analytical workloads, refer to the Dremio’s documentation. You can also learn more about Dremio in AWS Marketplace.

.

.

Dremio – AWS Partner Spotlight

Dremio is an AWS Partner whose data lake engine delivers fast query speed and a self-service semantic layer operating directly against Amazon S3 data.

Contact Dremio | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog

Building a Data Lakehouse with Amazon S3 and Dremio on Apache Iceberg Tables

Data Lakehouse

Components of Data Lakehouse

Apache Parquet

Amazon S3

Apache Iceberg

AWS Glue

Dremio

AWS Lake Formation

Building a Data Lakehouse

Ingesting Data

Governing Your AWS Data Lake

Organizing the Semantic Layer

Step 1: Create Spaces

Step 2: Add Users

Querying the Data

Conclusion

Dremio – AWS Partner Spotlight

Resources

Follow