AWS Big Data Blog

Simplify data integration pipeline development using AWS Glue custom blueprints

Organizations spend significant time developing and maintaining data integration pipelines that hydrate data warehouses, data lakes, and lake houses. As data volume increases, data engineering teams struggle to keep up with new requests from business teams. Although these requests may come from different teams, they’re often similar, such as ingesting raw data from a source system into a data lake, partitioning data based on a certain key, write data from data lakes to a relational database, or assigning default values for empty attributes. To keep up with these requests, data engineers modify pipelines in a development environment, and test and deploy to a production environment. This redundant code creation process is error-prone and time consuming.

Data engineers need a way to enable non-data engineers like business analysts, data analysts, and data scientists to operate using self-service methods by abstracting the complexity of pipeline development. In this post, we discuss AWS Glue custom blueprints, which offer a framework for you to build and share reusable AWS Glue workflows.

Introducing AWS Glue custom blueprints

AWS Glue is a serverless data integration service that allows data engineers to develop complex data integration pipelines. In AWS Glue, you can use workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers.

With AWS Glue custom blueprints, data engineers can create a blueprint that abstracts away complex transformations and technical details. Non-data engineers can easily use blueprints using a user interface to ingest, transform, and load data instead of waiting for data engineers to develop new pipelines. These users can also take advantage of blueprints developed outside their organization; for example, AWS has developed sample blueprints to transform data.

The following diagram illustrates our architecture with AWS Glue custom blueprints.

The workflow includes the following steps:

  1. The data engineer identifies common data integration patterns and creates a blueprint.
  2. The data engineer shares the blueprint via the source control tool or Amazon Simple Storage Service (Amazon S3).
  3. Non-data engineers can easily register and use the blueprint via a user interface where they provide input.
  4. The blueprint uses these parameters to generate an AWS Glue workflow. You can simply run these workflows to ingest and transform data.

Develop a custom blueprint

Data ingested in a data lake is partitioned in a certain way. Sometimes, data analysts, business analysts, data scientists, and data engineers need to partition data differently based on their query pattern. For instance, a data scientist may want to partition the data by timestamp, whereas a data analyst may want to partition data based on location. The data engineer can create AWS Glue jobs that accepts parameters and partitions the data based on these parameters. Then they can package the job as a blueprint to share with other users, who provide the parameters and generate an AWS Glue workflow. Here, we will create a blueprint to solve this use case.

To create a custom blueprint, a data engineer has to create three components: A configuration file, layout file, AWS Glue job scripts and any additional libraries required in the creation of resources specified in the layout file.

AWS Glue job script(s) are usually for data transformation. In this example, the data engineer creates the job script partitioning.py, which accepts parameters such as the source S3 location, partition keys, partitioned table name, and target S3 location. The job reads data in the source S3 location, writes partitioned data to the target S3 location, and catalogs the partitioned table in the AWS Glue Data Catalog.

The configuration file is a JSON based file where data engineer defines list of inputs needed to generate the workflow. In this example, the data engineer creates blueprint.cfg, outlining all the inputs needed, such as input data location, partitioned table name, and output data location. AWS Glue uses this file to create a user interface for users to provide values when creating their workflow. The following figure shows how parameters from the configuration file are translated to the user interface.

The layout file is a Python file that uses the user inputs to create the following:

  • Prerequisite objects such as Data Catalog databases and S3 locations to store ETL scripts or use them as intermediate data locations
  • The AWS Glue workflow

In this example, the developer creates a layout.py file that generates the workflow based on the parameters provided by the user. The layout file includes code that performs the following functions:

  • Creates the AWS Glue database based on inputs provided by user
  • Creates the AWS Glue script S3 bucket and uploads partitioning.py
  • Creates temporary S3 locations for processing
  • Creates the workflow that first runs the crawler and then the job
  • Based on the input parameters, sets up the workflow schedule

Package a custom blueprint

After you develop a blueprint, you need to package it as a .zip file, which others can use to register the blueprint. It should contain following files:

  • Configuration file
  • Layout file
  • AWS Glue jobs scripts and additional libraries as required

You can share the blueprint with others using your choice of source control repository or file storage.

Register the blueprint

To register a blueprint on the AWS Glue console, complete the following steps:

  1. Upload the .zip file in Amazon S3.
  2. On the AWS Glue console, choose Blueprints.
  3. Choose Add blueprint.
  4. Enter the following information:
    1. Blueprint name
    2. Location of .zip archive
    3. Optional description
  5. Choose Add blueprint.

When the blueprint is successfully registered, its status turns to ACTIVE.

You’re now ready to use the blueprint to create an AWS Glue workflow.

Use a blueprint

We have developed a few blueprints to get you started. To use them, you have to download them, create a .zip files and register them as described in the previous section.

Custom Blueprint Name Description
Crawl S3 locations Crawl S3 locations and create tables in the AWS Glue Data Catalog
Convert data format to Parquet Convert S3 files in various formats to Parquet format using Snappy compression
Partition Data Partition files based on user inputs to optimize data layout in Amazon S3
Copy data to DynamoDB Copy data from Amazon S3 to Amazon DynamoDB
Compaction Compact input files into larger files to improve query performance
Encoding Convert encoding in S3 files

In this post, we show how a data analyst can easily use the Partition Data blueprint. Partitioning data improves query performance by organizing data into parts based on column values such as date, country, or region. This helps restrict the amount of data scanned by a query when filters are specified on the partition. You may want to partition data differently, such as by timestamp or other attributes. With the partitioning blueprint, data analysts can easily partition the data without deep knowledge in data engineering.

For this post, we use the Daily Global & U.S. COVID-19 Cases & Testing Data (Enigma Aggregation) dataset available in the AWS COVID-19 data lake as our source. This data contains US and global cases, deaths, and testing data related to COVID-19 organized by the country.

The dataset includes two JSON files, and as of this writing the total data size is 215.7 MB. This data is not partitioned and not optimized for best query performance. It’s common to query this kind of historical data by specifying a date range condition in the WHERE clause. To minimize data scan size and achieve optimal performance, we partition this data using the date time field.

You can partition the datasets via nested partitioning or flat partitioning:

  • Flat partitioningpath_to_data/dt=20200918/
  • Nested partitioningpath_to_data/year=2020/month=9/day=18/

In this example, the input data contains the date field, and its value is formatted as 2020-09-18 (YYYY-MM-DD). For flat partitioning, you can simply specify the date field as partitioning key. However, it becomes tricky to implement nested partitioning. The developer needs to extract the year, month, and day from the date, and it’s hard for non-data engineers to code this. This blueprint abstracts this complexity and can generate nested fields (such as year, month, and day) in any granularity from a date time field.

To use this blueprint, complete the following steps:

  1. Download the files from GitHub with the following code:
    $ git clone https://github.com/awslabs/aws-glue-blueprint-libs.git
    $ cd aws-glue-blueprint-libs/samples/
  2. Compress the blueprint files into a .zip file:
    $ zip partitioning.zip partitioning/*
  3. Upload the .zip file to your S3 bucket:
    $ aws s3 cp partitioning.zip s3://path/to/blueprint/
  4. On the AWS Glue console, choose Blueprints.
  5. Choose Add blueprint.
  6. For Blueprint name, enter partitioning-tutorial.
  7. For ZIP archive location (S3), enter s3://path/to/blueprint/partitioning.zip.
  8. Wait for the blueprint status to show as ACTIVE.
  9. Select your partitioning-tutorial blueprint, and on the Actions menu, choose Create workflow.
  10. Specify the following parameters:
    1. WorkflowNamepartitioning
    2. IAMRole – The AWS Identity and Access Management (IAM) role to run the AWS Glue job and crawler
    3. InputDataLocations3://covid19-lake/enigma-aggregation/json/global/
    4. DestinationDatabaseNameblueprint_tutorial
    5. DestinationTableNamepartitioning_tutorial
    6. OutputDataLocations3://path/to/output/data/location/
    7. PartitionKeys: (blank)
    8. TimestampColumnNamedate
    9. TimestampColumnGranularityday
    10. NumberOfWorkers5 (the default value)
    11. IAM role – The role that AWS Glue assumes to create the workflow, crawlers, jobs, triggers and any other resources defined in the layout script. For a suggested policy for the role, see Permissions for Blueprint Roles
  1. Choose Submit.
  2. Wait for the blueprint run status to change to SUCCEEDED.
  3. In the navigation pane, choose Workflows.
  4. Select partitioning and on the Actions menu, choose Run.
  5. Wait for the workflow run status to show as Completed.

You can navigate to the output file on the Amazon S3 console to see that the Parquet files have been written under the partitioned folders of year=yyyy/month=MM/day=dd/ successfully.

The blueprint registers two tables:

  • source_partitioning_tutorial – The non-partitioned table that is generated by the AWS Glue crawler as a data source
  • partitioning_tutorial – The new partitioned table in the AWS Glue Data Catalog

You can access both tables using Amazon Athena. Let’s compare the data scan size for both tables to see the benefit of partitioning.

First, run the following query against the non-partitioned source table:

SELECT * FROM "blueprint_tutorial"."source_partitioning_tutorial"
WHERE date='2020-09-18'

The following screenshot shows the query results.

Then, run the same query against the partitioned table:

SELECT * FROM "blueprint_tutorial"."partitioning_tutorial"
WHERE year=2020 AND month=9 AND day=18

The following screenshot shows the query results.

The query in the non-partitioned table scanned 215.67 MB of data. The query on the partitioned table scanned 126.42 KB, which is 1700 times less data. This technique reduces usage costs for Athena.

Conclusion

In this post, we demonstrated how data engineers can use AWS Glue custom blueprints to simplify data integration pipelines and promote reusability. Non-data engineers such as data scientists, business analysts, and data analysts can ingest and transform data using a rich UI that abstracts the technical details so they can gain faster insights from their data. Our sample templates can get you started using AWS Glue custom blueprints. We highly encourage you to build blueprints and make them available to the AWS Glue community.


About the authors

Noritaka Sekiyama is a big data architect at AWS Glue and Lake Formation. His passion is for implementing artifacts for building data lakes.

 

 

 

Keerthi Chadalavada is a software development engineer at AWS Glue. She is passionate about building fault tolerant and reliable distributed systems at scale.

 

 

 

Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop, and deploy modern data platforms.