Handle UPSERT data operations using open-source Delta Lake and AWS Glue

September 2024: This post was reviewed and updated for accuracy.

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. In this post, we discuss how to handle UPSERTs (updates and inserts) of the operational data using natively integrated Delta Lake with AWS Glue, and query the Delta Lake using Amazon Athena.

We examine a hypothetical insurance organization that issues commercial policies to small- and medium-scale businesses. The insurance prices vary based on several criteria, such as where the business is located, business type, earthquake or flood coverage, and so on. This organization is planning to build a data analytical platform, and the insurance policy data is one of the inputs to this platform. Because the business is growing, hundreds and thousands of new insurance policies are being enrolled and renewed every month. Therefore, all this operational data needs to be sent to Delta Lake in near-real time so that the organization can perform various analytics, and build machine learning (ML) models to serve their customers in a more efficient and cost-effective way.

Solution overview

The data can originate from any source, but typically customers want to bring operational data to data lakes to perform data analytics. One of the solutions is to bring the relational data by using AWS Database Migration Service (AWS DMS). AWS DMS tasks can be configured to copy the full load as well as ongoing changes (CDC). The full load and CDC load can be brought into the raw and curated (Delta Lake) storage layers in the data lake. To keep it simple, in this post we opt out of the data sources and ingestion layer; the assumption is that the data is already copied to the raw bucket in the form of CSV files. An AWS Glue ETL job does the necessary transformation and copies the data to the Delta Lake layer. The Delta Lake layer ensures ACID compliance of the source data.

The following diagram illustrates the solution architecture.

The use case we use in this post is about a commercial insurance company. We use a simple dataset that contains the following columns:

Policy – Policy number, entered as text
Expiry – Date that policy expires
Location – Location type (Urban or Rural)
State – Name of state where property is located
Region – Geographic region where property is located
Insured Value – Property value
Business Type – Business use type for property, such as Farming or Retail
Earthquake – Is earthquake coverage included (Y or N)
Flood – Is flood coverage included (Y or N)

The dataset contains a sample of 100 insurance policies. In the case of a production dataset, it may contain millions of records.

policy_id,expiry_date,location_name,state_code,region_name,insured_value,business_type,earthquake,flood
1000,2026-01-02,Urban,NY,East,1617630,Retail,N,Y
1001,2026-01-02,Urban,NY,East,8678500,Apartment,Y,N
1002,2026-01-02,Rural,WI,Midwest,2052660,Farming,N,N
1003,2026-01-02,Urban,NY,East,17580000,Apartment,Y,Y
1004,2026-01-02,Urban,NY,East,1925000,Hospitality,N,Y
1005,2026-01-04,Urban,IL,Midwest,12934500,Apartment,Y,N
1006,2026-01-05,Urban,WI,Midwest,928300,Office Bldg,N,Y
1007,2026-01-07,Rural,NY,East,2219900,Farming,Y,N
1008,2026-01-07,Urban,NY,East,14100000,Apartment,Y,Y
1009,2026-03-25,Urban,NJ,East,4651680,Apartment,Y,N
1010,2026-03-25,Urban,NY,East,5990067,Apartment,N,Y
1011,2026-03-25,Rural,NY,East,4102500,Farming,N,N
1012,2026-03-25,Urban,NY,East,3400000,Construction,Y,N
1013,2026-03-26,Urban,NY,East,9973900,Apartment,Y,Y
1014,2026-03-27,Urban,NY,East,15480000,Office Bldg,N,Y
1015,2026-03-27,Rural,NY,East,2446600,Farming,N,N
1016,2026-03-27,Urban,VT,Northeast,8861500,Office Bldg,Y,N
1017,2026-03-30,Urban,NH,Northeast,97920,Office Bldg,Y,N
1018,2026-03-30,Urban,NY,East,5150000,Apartment,N,Y
1019,2026-03-31,Rural,WI,Midwest,1451662,Farming,N,Y
1020,2026-03-31,Urban,NJ,East,1761960,Office Bldg,Y,N
1021,2026-03-31,Rural,NY,East,1649105,Farming,N,N
1022,2026-03-31,Urban,NY,East,2329500,Apartment,Y,Y
1023,2026-04-25,Urban,NJ,East,1595500,Office Bldg,N,Y
1024,2026-04-25,Urban,MI,Central,394220,Retail,Y,N
1025,2026-01-02,Urban,NY,East,3456789,Retail,N,Y
1026,2026-01-02,Rural,CA,West,7890123,Farming,Y,N
1027,2026-01-02,Urban,TX,South,2345678,Apartment,N,Y
1028,2026-01-02,Urban,FL,South,9012345,Hospitality,Y,Y
1029,2026-01-02,Rural,OR,West,4567890,Farming,N,N
1030,2026-01-04,Urban,IL,Midwest,7890123,Apartment,Y,N
1031,2026-01-05,Urban,PA,East,2345678,Office Bldg,N,Y
1032,2026-01-07,Rural,WA,West,9012345,Farming,Y,N
1033,2026-01-07,Urban,MA,Northeast,4567890,Apartment,N,Y
1034,2026-03-25,Urban,NJ,East,7890123,Apartment,Y,N
1035,2026-03-25,Rural,CO,West,2345678,Farming,N,Y
1036,2026-03-25,Urban,GA,South,9012345,Apartment,Y,N
1037,2026-03-25,Urban,VA,East,4567890,Construction,N,Y
1038,2026-03-26,Urban,MD,East,7890123,Apartment,Y,Y
1039,2026-03-27,Urban,NC,South,2345678,Office Bldg,N,N
1040,2026-03-27,Rural,MN,Midwest,9012345,Farming,Y,N
1041,2026-03-27,Urban,WI,Midwest,4567890,Office Bldg,N,Y
1042,2026-03-30,Urban,NH,Northeast,7890123,Office Bldg,Y,N
1043,2026-03-30,Urban,CT,Northeast,2345678,Apartment,N,Y
1044,2026-03-31,Rural,IA,Midwest,9012345,Farming,Y,N
1045,2026-03-31,Urban,DE,East,4567890,Office Bldg,N,Y
1046,2026-03-31,Rural,KY,South,7890123,Farming,Y,N
1047,2026-03-31,Urban,OH,Midwest,2345678,Apartment,N,Y
1048,2026-04-25,Urban,MI,Central,9012345,Office Bldg,Y,N
1049,2026-04-25,Urban,IN,Midwest,4567890,Retail,N,Y
1050,2026-01-02,Rural,MT,West,7890123,Farming,Y,N
1051,2026-01-02,Urban,ID,West,2345678,Apartment,N,Y
1052,2026-01-02,Urban,UT,West,9012345,Hospitality,Y,N
1053,2026-01-02,Rural,NM,West,4567890,Farming,N,Y
1054,2026-01-04,Urban,NV,West,7890123,Apartment,Y,N
1055,2026-01-05,Urban,AZ,West,2345678,Office Bldg,N,Y
1056,2026-01-07,Rural,SD,Midwest,9012345,Farming,Y,N
1057,2026-01-07,Urban,ND,Midwest,4567890,Apartment,N,Y
1058,2026-03-25,Urban,NE,Midwest,7890123,Apartment,Y,N
1059,2026-03-25,Rural,KS,Midwest,2345678,Farming,N,Y
1060,2026-03-25,Urban,OK,South,9012345,Apartment,Y,N
1061,2026-03-25,Urban,AR,South,4567890,Construction,N,Y
1062,2026-03-26,Urban,LA,South,7890123,Apartment,Y,N
1063,2026-03-27,Urban,MS,South,2345678,Office Bldg,N,Y
1064,2026-03-27,Rural,AL,South,9012345,Farming,Y,N
1065,2026-03-27,Urban,TN,South,4567890,Office Bldg,N,Y
1066,2026-03-30,Urban,SC,South,7890123,Office Bldg,Y,N
1067,2026-03-30,Urban,NC,South,2345678,Apartment,N,Y
1068,2026-03-31,Rural,WV,East,9012345,Farming,Y,N
1069,2026-03-31,Urban,VA,East,4567890,Office Bldg,N,Y
1070,2026-03-31,Rural,MD,East,7890123,Farming,Y,N
1071,2026-03-31,Urban,DE,East,2345678,Apartment,N,Y
1072,2026-04-25,Urban,PA,East,9012345,Office Bldg,Y,N
1073,2026-04-25,Urban,NJ,East,4567890,Retail,N,Y
1074,2026-01-02,Urban,NY,East,1234567,Retail,Y,N
1075,2026-01-02,Rural,CA,West,7654321,Farming,N,Y
1076,2026-01-02,Urban,TX,South,2345678,Apartment,Y,N
1077,2026-01-02,Urban,FL,South,9876543,Hospitality,N,Y
1078,2026-01-02,Rural,OR,West,4567890,Farming,Y,N
1079,2026-01-04,Urban,IL,Midwest,7654321,Apartment,N,Y
1080,2026-01-05,Urban,PA,East,2345678,Office Bldg,Y,N
1081,2026-01-07,Rural,WA,West,9876543,Farming,N,Y
1082,2026-01-07,Urban,MA,Northeast,4567890,Apartment,Y,N
1083,2026-03-25,Urban,NJ,East,7654321,Apartment,N,Y
1084,2026-03-25,Rural,CO,West,2345678,Farming,Y,N
1085,2026-03-25,Urban,GA,South,9876543,Apartment,N,Y
1086,2026-03-25,Urban,VA,East,4567890,Construction,Y,N
1087,2026-03-26,Urban,MD,East,7654321,Apartment,N,Y
1088,2026-03-27,Urban,NC,South,2345678,Office Bldg,Y,N
1089,2026-03-27,Rural,MN,Midwest,9876543,Farming,N,Y
1090,2026-03-27,Urban,WI,Midwest,4567890,Office Bldg,Y,N
1091,2026-03-30,Urban,NH,Northeast,7654321,Office Bldg,N,Y
1092,2026-03-30,Urban,CT,Northeast,2345678,Apartment,Y,N
1093,2026-03-31,Rural,IA,Midwest,9876543,Farming,N,Y
1094,2026-03-31,Urban,DE,East,4567890,Office Bldg,Y,N
1095,2026-03-31,Rural,KY,South,7654321,Farming,N,Y
1096,2026-03-31,Urban,OH,Midwest,2345678,Apartment,Y,N
1097,2026-04-25,Urban,MI,Central,9876543,Office Bldg,N,Y
1098,2026-04-25,Urban,IN,Midwest,4567890,Retail,Y,N
1099,2026-01-02,Rural,MT,West,7654321,Farming,N,Y

In the following sections, we walk through the steps to perform the Delta Lake UPSERT operations. We use the AWS Management Console to perform all the steps. However, you can also automate these steps using tools like AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), Terraforms, and so on.

Prerequisites

This post is focused towards architects, engineers, developers, and data scientists who build, design, and build analytical solutions on AWS. We expect a basic understanding of the console, AWS Glue, Amazon Simple Storage Service (Amazon S3), and Athena. Additionally, the persona is able to create AWS Identity and Access Management (IAM) policies and roles, create and run AWS Glue jobs and crawlers, and is able work with the Athena query editor.

Set up an S3 bucket for full and CDC load data feeds

To set up your S3 bucket, complete the following steps:

Log in to your AWS account and choose a Region nearest to you.
On the Amazon S3 console, create a new bucket. Make sure the name is unique (for example, delta-lake-cdc-blog-<some random number>).
Create the following folders:
1. $bucket_name/fullload – This folder is used for a one-time full load from the upstream data source
2. $bucket_name/cdcload – This folder is used for copying the upstream data changes
3. $bucket_name/delta – This folder holds the Delta Lake data files
Copy the sample dataset and save it in a file called full-load.csv to your local machine.
Upload the file using the Amazon S3 console into the folder $bucket_name/fullload.

s3 folders

Set up an IAM policy and role

In this section, we create an IAM policy for the S3 bucket access and a role for AWS Glue jobs to run, and also use the same role for querying the Delta Lake using Athena.

On the IAM console, choose Polices in the navigation pane.
Choose Create policy.
Select JSON tab and paste the following policy code. Replace the {bucket_name} you created in the earlier step.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListingOfFolders",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::{bucket_name}"
            ]
        },
        {
            "Sid": "ObjectAccessInBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::{bucket_name}/*"
        }
    ]
}

Name the policy delta-lake-cdc-blog-policy and select Create policy.
On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Select AWS Glue as your trusted entity and choose Next.
Select the policy you just created, and with two additional AWS managed policies:
1. delta-lake-cdc-blog-policy
2. AWSGlueServiceRole
3. CloudWatchFullAccess

Choose Next.
Give the role a name (for example, delta-lake-cdc-blog-role).

IAM role

Set up AWS Glue jobs

In this section, we set up two AWS Glue jobs: one for full load and one for the CDC load. Let’s start with the full load job.

On the AWS Glue console, choose ETL Jobs in the navigation pane. AWS Glue Studio opens in the right side panel.
Select Script editor. In the popup, choose Start fresh and choose Create script.

In the script editor, replace the code with the following code snippet

import sys
from awsglue.utils import getResolvedOptions
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','s3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

#Define the table schema
schema = StructType() \
      .add("policy_id",IntegerType(),True) \
      .add("expiry_date",DateType(),True) \
      .add("location_name",StringType(),True) \
      .add("state_code",StringType(),True) \
      .add("region_name",StringType(),True) \
      .add("insured_value",IntegerType(),True) \
      .add("business_type",StringType(),True) \
      .add("earthquake_coverage",StringType(),True) \
      .add("flood_coverage",StringType(),True) 

# Read the full load
sdf = spark.read.format("csv").option("header",True).schema(schema).load("s3://"+ args['s3_bucket']+"/fullload/")
sdf.printSchema()

# Write data as DELTA TABLE
sdf.write.format("delta").mode("overwrite").save("s3://"+ args['s3_bucket']+"/delta/insurance/")

Navigate to the Job details tab.
Provide a name for the job (for example, Full-Load-Job).
For IAM Role¸ choose the role delta-lake-cdc-blog-role that you created earlier.
For Worker type¸ choose G 2X.
For Job bookmark, choose Disable.
Set Number of retries to 0.
Under Advanced properties¸ keep the default values.
Under Job parameters:
1. Add the key --s3_bucket with the bucket name you created earlier as the value.
2. Add the key --datalake-formats and give the value delta
Keep the remaining default values and choose Save.

Job details

Now let’s create the CDC load job.

Create a second job called CDC-Load-Job.
Follow the steps on the Job details tab as with the previous job.
Alternatively, you may choose “Clone job” option from the Full-Load-Job, this will carry all the job details from the full load job.

In the script editor, enter the following code snippet for the CDC logic:

import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import col, expr

# For Delta Lake
from delta.tables import DeltaTable

# @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
    .builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read the CDC load
cdc_df = spark.read.csv("s3://" + args['s3_bucket'] + "/cdcload", header=True)
cdc_df.show(5, True)

# Read the full load (latest data) as Delta table
delta_table_path = "s3://" + args['s3_bucket'] + "/delta/insurance/"
delta_table = DeltaTable.forPath(spark, delta_table_path)

# Separate CDC data into inserts, updates, and deletes
inserts_updates_df = cdc_df.filter(col("op_flag").isin("I", "U"))
deletes_df = cdc_df.filter(col("op_flag") == "D")

# UPSERT process
delta_table.alias("prev_df").merge(
    source=inserts_updates_df.alias("cdc_df"),
    condition=expr("prev_df.policy_id = cdc_df.policy_id")
).whenMatchedUpdate(
    condition=col("cdc_df.op_flag") == "U",
    set={
        "expiry_date": col("cdc_df.expiry_date"),
        "location_name": col("cdc_df.location_name"),
        "state_code": col("cdc_df.state_code"),
        "region_name": col("cdc_df.region_name"),
        "insured_value": col("cdc_df.insured_value"),
        "business_type": col("cdc_df.business_type"),
        "earthquake_coverage": col("cdc_df.earthquake"),
        "flood_coverage": col("cdc_df.flood")
    }
).whenNotMatchedInsert(
    condition=col("cdc_df.op_flag") == "I",
    values={
        "policy_id": col("cdc_df.policy_id"),
        "expiry_date": col("cdc_df.expiry_date"),
        "location_name": col("cdc_df.location_name"),
        "state_code": col("cdc_df.state_code"),
        "region_name": col("cdc_df.region_name"),
        "insured_value": col("cdc_df.insured_value"),
        "business_type": col("cdc_df.business_type"),
        "earthquake_coverage": col("cdc_df.earthquake"),
        "flood_coverage": col("cdc_df.flood")
    }
).execute()

# Handle delete operations
for row in deletes_df.collect():
    delta_table.delete(condition=expr(f"policy_id = '{row['policy_id']}'"))

print("CDC processing completed.")

Run the full load job

On the AWS Glue console, open full-load-job and choose Run. The job takes about 2 minutes to complete, and the job run status changes to Succeeded. Go to $bucket_name and open the delta folder, which contains the insurance folder. You can note the Delta Lake files in it. Delta location on S3

Create delta table using Amazon Athena

Amazon Athena now supports querying Delta Lake tables directly, offering improved performance and seamless integration with your data lake architecture. To create a Delta Lake table in Athena, you can use a simplified CREATE EXTERNAL TABLE statement that specifies only the table location and the Delta Lake table type. Athena will automatically infer the schema and other metadata from the Delta Lake transaction log, eliminating the need for manual schema definition.

On the Athena console, open the query editor.

Run the following query to create insurance_policies delta table:

CREATE EXTERNAL TABLE insurance_policies
LOCATION 's3://<<replace-with-your-bucket-name>>/delta/insurance/'
TBLPROPERTIES (
  'table_type'='DELTA'
);

This statement creates an external table named insurance_policies that points to a Delta Lake dataset stored in the specified S3 location. The table_type property is set to DELTA to indicate that this is a Delta Lake table. Once created, you can query this table using standard SQL syntax in Athena, taking advantage of Delta Lake’s performance optimizations and ACID transaction support.

Query the delta table using Athena query editor

In this section, we query the delta_insurance table using Athena. Note that if you’re using Athena for the first time, set the query output folder to store the Athena query results (for example, s3://<your-s3-bucket>/query-output/).

On the Athena console, open the query editor.
Keep the default selections for Data source and Database.
Run the query SELECT * FROM delta_insurance;. This query returns a total of 25 rows, the same as what was in the full load data feed.
For the CDC comparison, run the following query and store the results in a location where you can compare these results later:

SELECT * FROM delta_insurance
WHERE policy_id IN (1101,1000,1001,1102,1003,1103,1005,1104,1007,1105) 
order by policy_id;

The following screenshot shows the Athena query result.

Upload the CDC data feed and run the CDC job

In this section, we update three insurance policies and insert two new policies.

Copy the following insurance policy data and save it locally as cdc-load.csv:

op_flag,policy_id,expiry_date,location_name,state_code,region_name,insured_value,business_type,earthquake,flood
I,1101,2026-01-02,Urban,NY,East,2500000,Retail,N,N
U,1000,2026-01-03,Urban,NY,East,1617630,Retail,N,Y
D,1001,2026-01-02,Urban,NY,East,8678500,Apartment,Y,Y
I,1102,2026-01-02,Rural,WI,Midwest,3000000,Farming,N,N
U,1003,2026-01-03,Urban,NY,East,17580000,Apartment,Y,N
I,1103,2026-01-02,Urban,NY,East,2200000,Hospitality,N,Y
D,1005,2026-01-04,Urban,IL,Midwest,12934500,Apartment,Y,Y
I,1104,2026-01-05,Urban,WI,Midwest,1100000,Office Bldg,N,Y
U,1007,2026-01-08,Rural,NY,East,2219900,Farming,N,Y
I,1105,2026-01-07,Urban,NY,East,16000000,Apartment,N,N

The first column in the CDC feed describes the UPSERT operations. U is for updating an existing record, and I is for inserting a new record and D is for deleting a record. In this CDC feed, there are five new policies, three updates to existing policies and two deletes.

Upload the cdc-load.csv file to the $bucket_name/cdcload/ folder.
On the AWS Glue console, run CDC-Load-Job. This job takes care of updating the Delta Lake accordingly.

The change details are as follows:

1101,1102,1103,1104 and 110002 – New policies added to the table
1000,1003 and 1007 – These policies are updated.
1001 and 1005 policies are removed.

Run the query again:

SELECT * FROM delta_insurance
WHERE policy_id IN (1101,1000,1001,1102,1003,1103,1005,1104,1007,1105)
order by policy_id;

As shown in the following screenshot, the changes in the CDC data feed are reflected in the Athena query results.

Advanced Analytics on Insurance Policies

To gain deeper insights into our insurance portfolio and better understand our risk exposure, we’ll perform some advanced analytics using Amazon Athena. These analyses will help us make data-driven decisions and develop more targeted strategies for risk management and business growth. We’ll focus on two key areas:

1. Risk Exposure Analysis by Region

Understanding our risk exposure across different regions is crucial for effective risk management and pricing strategies. This analysis aggregates the total insured value for each region, breaking it down by earthquake and flood risk. By examining these metrics, we can:

Identify regions with high concentration of insured value
Assess our exposure to specific natural disasters in different areas
Adjust our underwriting policies or reinsurance strategies based on regional risk profiles

Run the following query in Athena query editor and observe the results.

SELECT
  region_name,
  CONCAT('$',
    REGEXP_REPLACE(
      CAST(SUM(insured_value) AS VARCHAR),
      '(\d)(?=(\d{3})+(?!\d))', '$1,'
    )
  ) AS total_insured_value,
  CONCAT('$',
    REGEXP_REPLACE(
      CAST(SUM(CASE WHEN earthquake_coverage = 'Y' THEN insured_value ELSE 0 END) AS VARCHAR),
      '(\d)(?=(\d{3})+(?!\d))', '$1,'
    )
  ) AS earthquake_risk_value,
  CONCAT('$',
    REGEXP_REPLACE(
      CAST(SUM(CASE WHEN flood_coverage = 'Y' THEN insured_value ELSE 0 END) AS VARCHAR),
      '(\d)(?=(\d{3})+(?!\d))', '$1,'
    )
  ) AS flood_risk_value
FROM insurance_policies
GROUP BY region_name;

The results look as follows:

2. Business Type Distribution and Average Policy Value

Analyzing the distribution of policies across different business types and their average insured values provides valuable insights into our customer base and potential market opportunities. This analysis will help us:

Understand which business types are most prevalent in our portfolio
Identify sectors with higher average policy values
Tailor our marketing and product development efforts to target specific business segments
Assess the balance of our portfolio across various industries

These advanced analytics queries demonstrate the power of Athena to extract meaningful insights from our Delta Lake tables. By leveraging these insights, we can make more informed decisions about risk management, pricing, and business strategy.

Run the following query in Athena query editor and observe the results.

SELECT
 business_type,
 COUNT(*) AS policy_count,
 CONCAT('$',
     REGEXP_REPLACE(
      CAST(ROUND(AVG(insured_value), 2) AS VARCHAR),
      '(\d)(?=(\d{3})+(?!\d))', '$1,'
     )
 ) AS avg_insured_value
FROM insurance_policies
GROUP BY business_type
ORDER BY policy_count DESC;

The results look as follows:

Clean up

In this solution, we used all managed services, and there is no cost if AWS Glue jobs aren’t running. However, if you want to clean up the tasks, you can delete the two AWS Glue jobs, AWS Glue table, and S3 bucket.

Conclusion

Organizations are continuously looking at high performance, cost-effective, and scalable analytical solutions to extract the value of their operational data sources in near-real time. The analytical platform should be ready to receive changes in the operational data as soon as they occur. Typical data lake solutions face challenges to handle the changes in source data; the Delta Lake framework can close this gap. This post demonstrated how to build data lakes for UPSERT operations using AWS Glue and native Delta Lake tables, and how to query AWS Glue tables from Athena. You can implement your large scale UPSERT data operations using AWS Glue, Delta Lake and perform analytics using Amazon Athena.

References

About the Authors

Praveen Allam is a Solutions Architect at Amazon Web Services (AWS), specializing in designing scalable, high-performance, and cost-effective enterprise-grade applications. With a passion for building innovative AWS solutions, he focuses on serverless architecture, analytics, and generative AI, helping organizations harness the power of data-driven decision-making.

Vivek Singh is Senior Solutions Architect with the AWS Data Lab team. He helps customers unblock their data journey on the AWS ecosystem. His interest areas are data pipeline automation, data quality and data governance, data lakes, and lake house architectures.

Audit History

Last reviewed and updated in September 2024 by Praveen Allam | Sr. Solutions Architect

AWS Big Data Blog