AWS Big Data Blog
Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor
In the first post of this series, we described how AWS Glue for Apache Spark works with Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg datasets tables using the native support of those data lake formats. This native support simplifies reading and writing your data for these data lake frameworks so you can more easily build and maintain your data lakes in a transactionally consistent manner. This feature removes the need to install a separate connector and reduces the configuration steps required to use these frameworks in AWS Glue for Apache Spark jobs.
These data lake frameworks help you store data more efficiently and enable applications to access your data faster. Unlike simpler data file formats such as Apache Parquet, CSV, and JSON, which can store big data, data lake frameworks organize distributed big data files into tabular structures that enable basic constructs of databases on data lakes.
Expanding on the functionality we announced at AWS re:Invent 2022, AWS Glue now natively supports Hudi, Delta Lake and Iceberg through the AWS Glue Studio visual editor. If you prefer authoring AWS Glue for Apache Spark jobs using a visual tool, you can now choose any of these three data lake frameworks as a source or target through a graphical user interface (GUI) without any custom code.
Even without prior experience using Hudi, Delta Lake or Iceberg, you can easily achieve typical use cases. In this post, we demonstrate how to ingest data stored in Hudi using the AWS Glue Studio visual editor.
Since you’re reading this post, you may also be interested in the following: |
Example scenario
To demonstrate the visual editor experience, this post introduces the Global Historical Climatology Network Daily (GHCN-D) dataset. The data is publicly accessible through an Amazon Simple Storage Service (Amazon S3) bucket. For more information, see the Registry of Open Data on AWS. You can also learn more in Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight.
The Amazon S3 location s3://noaa-ghcn-pds/csv/by_year/
has all the observations from 1763 to the present organized in CSV files, one file for each year. The following block shows an example of what the records look like:
The records have fields including ID, DATE, ELEMENT, and more. Each combination of ID
, DATE
, and ELEMENT
represents a unique record in this dataset. For example, the record with ID
as AE000041196
, ELEMENT
as TAVG
, and DATE
as 20220101
is unique.
In this tutorial, we assume that the files are updated with new records every day, and want to store only the latest record per the primary key (ID
and ELEMENT
) to make the latest snapshot data queryable. One typical approach is to do an INSERT for all the historical data, and calculate the latest records in queries; however, this can introduce additional overhead in all the queries. When you want to analyze only the latest records, it’s better to do an UPSERT (update and insert) based on the primary key and DATE
field rather than just an INSERT in order to avoid duplicates and maintain a single updated row of data.
Prerequisites
To continue this tutorial, you need to create the following AWS resources in advance:
- An AWS Identity and Access Management (IAM) role for your ETL job or notebook as instructed in Set up IAM permissions for AWS Glue Studio
- An S3 bucket for storing data
- An AWS Glue database called
hudi_native
Process a Hudi dataset on the AWS Glue Studio visual editor
Let’s author an AWS Glue job to read daily records in 2022, and write the latest snapshot into the Hudi table on your S3 bucket using UPSERT. Complete following steps:
- Open AWS Glue Studio.
- Choose Jobs.
- Choose Visual with a source and target.
- For Source and Target, choose Amazon S3, then choose Create.
A new visual job configuration appears. The next step is to configure the data source to read an example dataset:
- Under Visual, choose Data source – S3 bucket.
- Under Node properties, for S3 source type, select S3 location.
- For S3 URL, enter
s3://noaa-ghcn-pds/csv/by_year/2022.csv
.
The data source is configured.
The next step is to configure the data target to ingest data in Apache Hudi on your S3 bucket:
- Choose Data target – S3 bucket.
- Under Data target properties- S3, for Format, choose Apache Hudi.
- For Hudi Table Name, enter
ghcn
. - For Hudi Storage Type, choose Copy on write.
- For Hudi Write Operation, choose Upsert.
- For Hudi Record Key Fields, choose
ID
. - For Hudi Precombine Key Field, choose
DATE
. - For Compression Type, choose GZIP.
- For S3 Target location, enter
s3://<Your S3 bucket name>/<Your S3 bucket prefix>/hudi_native/ghcn/
. (Provide your S3 bucket name and prefix.)
To make it easy to discover the sample data, and also make it queryable from Athena, configure the job to create a table definition on the AWS Glue Data Catalog:
- For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
- For Database, choose
hudi_native
. - For Table name, enter
ghcn
. - For Partition keys – optional, choose
ELEMENT
.
Now your data integration job is authored in the visual editor completely. Let’s add one remaining setting about the IAM role, then run the job:
- Under Job details, for IAM Role, choose your IAM role.
- Choose Save, then choose Run.
- Navigate to the Runs tab to track the job progress and wait for it to complete.
Query the table with Athena
Now that the job has successfully created the Hudi table, you can query the table through different engines, including Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum, in addition to AWS Glue for Apache Spark.
To query through Athena, complete the following steps:
- On the Athena console, open the query editor.
- In the query editor, enter the following SQL and choose Run:
SELECT * FROM "hudi_native"."ghcn" limit 10;
The following screenshot shows the query result.
Let’s dive deep into the table to understand how the data is ingested and focus on the records with ID=’AE000041196′.
- Run the following query to focus on the very specific example records with
ID='AE000041196'
:
SELECT * FROM "hudi_native"."ghcn" WHERE ID='AE000041196';
The following screenshot shows the query result.
The original source file 2022.csv
has historical records for record ID='USW00012894'
from 20220101
to 20221231
, however the query result shows only four records, one record per ELEMENT
at the latest snapshot of the day 20221230
or 20221231
. Because we used the UPSERT write option when writing data, we configured the ID field as a Hudi record key field, the DATE
field as a Hudi precombine field, and the ELEMENT
field as partition key field. When two records have the same key value, Hudi picks the one with the largest value for the precombine field. When the job ingested data, it compared all the values in the DATE
field for each pair of ID
and ELEMENT
, and then picked the record with the largest value in the DATE
field.
According to the preceding result, we were able to ingest the latest snapshot from all the 2022 data. Now let’s do an UPSERT of the new 2023 data to overwrite the records on the target Hudi table.
- Go back to AWS Glue Studio console, modify the source S3 location to
s3://noaa-ghcn-pds/csv/by_year/2023.csv
, then save and run the job.
- Run the same Athena query from the Athena console.
Now you see that the four records have been updated with the new records in 2023.
If you have further future records, this approach works well to upsert new records based on the Hudi record key and Hudi precombine key.
Clean up
Now to the final step, cleaning up the resources:
- Delete the AWS Glue database
hudi_native
. - Delete the AWS Glue table
ghcn
. - Delete the S3 objects under
s3://<Your S3 bucket name>/<Your S3 bucket prefix>/hudi_native/ghcn2022/
.
Conclusion
This post demonstrated how to process Hudi datasets using the AWS Glue Studio visual editor. The AWS Glue Studio visual editor enables you to author jobs while taking advantage of data lake formats and without needing expertise in them. If you have comments or feedback, please feel free to leave them in the comments.
About the authors
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.
Scott Long is a Front End Engineer on the AWS Glue team. He is responsible for implementing new features in AWS Glue Studio. In his spare time, he enjoys socializing with friends and participating in various outdoor activities.
Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.