Upsert into Amazon Redshift using AWS Glue and SneaQL

This is a guest post by Jeremy Winters and Ritu Mishra, Solution Architects at Full 360. In their own words, “Full 360 is a cloud first, cloud native integrator, and true believers in the cloud since inception in 2007, our focus has been on helping customers with their journey into the cloud. Our practice areas – Big Data and Warehousing, Application Modernization, and Cloud Ops/Strategy – represent deep, but focused expertise.”

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. As a company who has been building data warehouse solutions in the cloud for 10 years, we at Full 360 were interested to see how we can leverage AWS Glue in customer solutions. This post details our experience and lessons learned from using AWS Glue for an Amazon Redshift data integration use case.

UI-based ETL Tools

We have been anticipating the release of AWS Glue since it was announced at re:Invent 2016. Many of our customers are looking for an easy to use, UI-based tooling to manage their data transformation pipeline. However in our experience, the complexity of any production pipeline tends to be difficult to unwind, regardless of the technology used to create them. At Full 360, we build cloud-native, script-based applications deployed in containers to handle data integration. We think script-based transformation provides a balance of robustness and flexibility necessary to handle any data problem that comes our way.

AWS Glue caters both to developers who prefer writing scripts and those who want UI-based interactions. It is possible to initially create jobs using the UI, by selecting data source and target. Under the hood, AWS Glue auto-generates the Python code for you, which can be edited if needed, though this isn’t necessary for the majority of use cases.

Of course, you don’t have to rely on the UI at all. You can simply write your own Python scripts, store them in Amazon S3 with any dependent libraries, and import them into AWS Glue. AWS Glue also supports IDE integration with tools such as PyCharm, and interestingly enough, Zeppelin notebooks! These integrations are targeted toward developers who prefer writing Python themselves and want a cleaner dev/test cycle.

Developers who are already in the business of scripting ETL will be excited by the ability to easily deploy Python scripts with AWS Glue, using existing workflows for source control and CI/CD, and have them deployed and executed in a fully managed manner by AWS. The UI experience of AWS Glue works well, but it is good to know that the tool accommodates those who prefer traditional coding. You can also dig into complex edge cases for data handling where the UI doesn’t cut it.

Serverless!

When AWS Lambda was released, we were excited for its potential to host our ETL processes. With Lambda, you are limited to the five-minute maximum for function execution. We resorted to running Docker containers and Amazon ECS to orchestrate many of our customers long running tasks. With this approach, we are still required to manage the underlying infrastructure.

After a closer look at AWS Glue, we realize that it is a full serverless PySpark runtime, accompanied by an Apache Hive metastore compatible catalog-as-a-service. This means that you are not just running a script on a single core, but instead you have the full power of a multi-worker Spark environment available. If you’re not familiar with the Spark framework, it introduces a new paradigm that allows for the processing of distributed, in-memory datasets. Spark has many uses, from data transformation to machine learning.

In AWS Glue, you use PySpark dataframes to transform data before reaching your database. Dataframes manage data in a way similar to relational databases, so the methods are likely to be familiar to most SQL users. Additionally, you can use SQL in your PySpark code to manipulate the data.

AWS Glue also simplifies the management of the runtime environment by providing you with a DPU setting, which allows you to dial up or down the amount of compute resources used to run your job. One DPU is equivalent to 4 vCPU, 16 GB Mem.

Common use case

We can see that most customers would leverage AWS Glue to load one or many files from S3 into Amazon Redshift. To accelerate this process, you can use the crawler, an AWS console-based utility, to discover the schema of your data and store it in the AWS Glue Data Catalog, whether your data sits in a file or a database. We were able to discover the schemas of our source file and target table, then have AWS Glue construct and execute the ETL job. It worked! We successfully loaded the beer drinkers’ dataset, JSON files used in our advanced tuning class, into Amazon Redshift!

Our use case

For our use case, we integrated AWS Glue and Amazon Redshift Spectrum with an open-source project that we initiated called SneaQL. SneaQL is an open source, containerized framework for sneaking interactivity into static SQL. SneaQL provides an extension to ANSI SQL with command tags to provide functionality such as loops, variables, conditional execution, and error handling to your scripts. It allows you to manage your scripts in an AWS CodeCommit Git repo, which then gets deployed and executed in a container.

We use SneaQL for complex data integrations, usually with an ELT model, where the data is loaded into the database, then transformed into fact tables using parameterized SQL. SneaQL enables advanced use cases like partial upsert aggregation of data, where multiple data sources can merge into the same fact table.

We think AWS Glue, Redshift Spectrum, and SneaQL offer a compelling way to build a data lake in S3, with all of your metadata accessible through a variety of tools such as Hive, Presto, Spark, and Redshift Spectrum). Build your aggregation table in Amazon Redshift to drive your dashboards or other high-performance analytics.

In the video below, you see a demonstration of using AWS Glue to convert JSON documents into Parquet for partitioned storage in an S3 data lake. We then access the data from S3 into Amazon Redshift by way of Redshift Spectrum. Nearing the end of the AWS Glue job, we then call AWS boto3 to trigger an Amazon ECS SneaQL task to perform an upsert of the data into our fact table. All the sample artifacts needed for this demonstration are available in the Full360/Sneaql Github repository.

In this scenario, AWS Glue manipulates data outside of a data warehouse and loads it to Amazon Redshift, and SneaQL manipulates data inside Amazon Redshift.

We think that they are a good compliment to each other when doing complex integrations.

Our pipeline works as follows:

New files arrive in S3 in JSON format.
AWS Glue converts the JSON files in Parquet format, stored in another S3 bucket.
At the end of the AWS Glue script, the AWS SDK for Python (Boto) is used to trigger the Amazon ECS task that runs SneaQL.
SneaQL container pulls the secrets file from S3 then decrypts it with biscuit or AWS KMS.
SneaQL clones the appropriate branch from an AWS CodeCommit Git repo.
Data in Amazon Redshift is transformed by SneaQL statements stored in the repo.

The AWS Glue team also supports conditional triggers that would allow us to split the jobs, and make one dependent upon the other.

Final thoughts

Although, as a general rule, we believe in managing schema definitions in source control, we definitely think that the crawler feature has some attractive perks. Besides being handy for ad-hoc jobs, it is also partition-aware. This means that you can perform data movements without worrying about which data has been processed already, which is one less thing for you to manage. We’re interested to see the design patterns that emerge around the use of the crawler.

The freedom of PySpark also allows us to implement advanced use cases. While the boto3 library is available to all AWS Glue jobs, it is also possible to include any libraries and additional JAR files along with your Python script.

We think many customers will find AWS Glue valuable. Those who prefer to code their data processing pipeline will be quick to realize how powerful it is, especially for solving complex data cleansing problems. The learning curve for PySpark is real, so we suggest looking into dataframes as a good entry point. In the long run, the fact that AWS Glue is serverless makes the effort more than worth it.

If you have questions or suggestions, please comment below.