AWS Big Data Blog
Use AWS Glue DataBrew recipes in your AWS Glue Studio visual ETL jobs
AWS Glue Studio is now integrated with AWS Glue DataBrew. AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. DataBrew is a visual data preparation tool that enables you to clean and normalize data without writing any code. The over 200 transformations it provides are now available to be used in an AWS Glue Studio visual job.
In DataBrew, a recipe is a set of data transformation steps that you can author interactively in its intuitive visual interface. In this post, you’ll see how to use build a recipe in DataBrew and then apply it as part of an AWS Glue Studio visual ETL job.
Existing DataBrew users will also benefit from this integration—you can now run your recipes as part of a larger visual workflow with all the other components AWS Glue Studio provides, in addition to being able to use advanced job configuration and the latest AWS Glue engine version.
This integration brings distinct benefits to the existing users of both tools:
- You have a centralized view in AWS Glue Studio of the overall ETL diagram, end to end
- You can interactively define a recipe, seeing values, statistics, and distribution on the DataBrew console, then reuse that tested and versioned processing logic in AWS Glue Studio visual jobs
- You can orchestrate multiple DataBrew recipes in an AWS Glue ETL job or even multiple jobs using AWS Glue workflows
- DataBrew recipes can now use AWS Glue job features such as bookmarks for incremental data processing, automatic retries, auto scale, or grouping small files for greater efficiency
Solution overview
In our fictitious use case, the requirement is to clean up a synthetic medical claims dataset created for this post, which has some data quality issues introduced on purpose to demonstrate the DataBrew capabilities on data preparation. Then the claims data is ingested into the catalog (so it’s visible to analysts), after enriching it with some relevant details about the corresponding medical providers coming from a separate source.
The solution consists of an AWS Glue Studio visual job that reads two CSV files with claims and providers, respectively. The job applies a recipe of the first one to address the quality issues, select columns from the second one, join both datasets, and finally store the result on Amazon Simple Storage Service (Amazon S3), creating a table on the catalog so the output data can be used by other tools like Amazon Athena.
Create a DataBrew recipe
Start by registering the data store for the claims file. This will allow you to build the recipe in its interactive editor using the actual data so you can evaluate the result of the transformations as you define them.
- Download the claims CSV file using the following link: alabama_claims_data_Jun2023.csv.
- On the DataBrew console, choose Datasets in the navigation pane, then choose Connect new dataset.
- Choose the option File upload.
- For Dataset name, enter
Alabama claims
. - For Select a file to upload, choose the file you just downloaded on your computer.
- For Enter S3 destination, enter or browse to a bucket in your account and Region.
- Leave the rest of the options by default (CSV separated with comma and with header) and complete the dataset creation.
- Choose Project in the navigation pane, then choose Create project.
- For Project name, name it
ClaimsCleanup
. - Under Recipe details, for Attached recipe, choose Create new recipe, name it
ClaimsCleanup-recipe
, and choose theAlabama claims
dataset you just created. - Select a role suitable for DataBrew or create a new one, and complete the project creation.
This will create a session using a configurable subset of the data. After it has initialized the session, you can notice some of the cells have invalid or missing values.
In addition to the missing values in the columns Diagnosis Code, Claim Amount, and Claim Date, some values in the data have some extra characters: Diagnosis Code values are sometimes prefixed with “code ” (space included), and Procedure Code values are sometimes followed by single quotes.
Claim Amount values will likely be used for some calculations, so convert to number, and Claim Data should be converted to date type.
Now that we identified the data quality issues to address, we need to decide how to deal with each case.
There are multiple ways you can add recipe steps, including using the column context menu, the toolbar on the top, or from the recipe summary. Using the last method, you can search for the indicated step type to replicate the recipe created in this post.
Claim Amount is essential for this use case, and the decision is to remove such rows.
- Add the step Remove missing values.
- For Source column, choose Claim Amount.
- Leave the default action Delete rows with missing values and choose Apply to save it.
The view is now updated to reflect the step application and the rows with missing amounts are no longer there.
Diagnosis Code can be empty so this is accepted, but in the case of Claim Date, we want to have a reasonable estimation. The rows in the data are sorted in chronological order, so you can impute missing dates using the previews valid value from the preceding rows. Assuming every day has claims, the largest error would be assigning it to the preview day if it were the first claim that day missing the date; for illustration purposes, let’s consider that potential error acceptable.
First, convert the column from string to date type.
- Add the step Change type.
- Choose Claim Date as the column and date as the type, then choose Apply.
- Now to do the imputation of missing dates, add the step Fill or impute missing values.
- Select Fill with last valid value as the action and choose Claim Date as the source.
- Choose Preview changes to validate it, then choose Apply to save the step.
So far, your recipe should have three steps, as shown in the following screenshot.
- Next, add the step Remove quotation marks.
- Choose the Procedure Code column and select Leading and trailing quotation marks.
- Preview to verify it has the desired effect and apply the new step.
- Add the step Remove special characters.
- Choose the Claim Amount column and to be more specific, select Custom special characters and enter
$
for Enter custom special characters.
- Add a Change type step on the column Claim Amount and choose double as the type.
- As the last step, to remove the superfluous “code ” prefix, add a Replace value or pattern step.
- Choose the column Diagnosis Code, and for Enter custom value, enter
code
(with a space at the end).
Now that you have addressed all data quality issues identified on the sample, publish the project as a recipe.
- Choose Publish in the Recipe pane, enter an optional description, and complete the publication.
Each time you publish, it will create a different version of the recipe. Later, you will be able to choose which version of the recipe to use.
Create a visual ETL job in AWS Glue Studio
Next, you create the job that uses the recipe. Complete the following steps:
- On the AWS Glue Studio console, choose Visual ETL in the navigation pane.
- Choose Visual with a blank canvas and create the visual job.
- At the top of the job, replace “Untitled job” with a name of your choice.
- On the Job Details tab, specify a role that the job will use.
This needs to be an AWS Identity and Access Management (IAM) role suitable for AWS Glue with permissions to Amazon S3 and the AWS Glue Data Catalog. Note that the role used before for DataBrew is not usable for run jobs, so won’t be listed on the IAM Role drop-down menu here.
If you used only DataBrew jobs before, notice that in AWS Glue Studio, you can choose performance and cost settings, including worker size, auto scaling, and Flexible Execution, as well as use the latest AWS Glue 4.0 runtime and benefit from the significant performance improvements it brings. For this job, you can use the default settings, but reduce the requested number of workers in the interest of frugality. For this example, two workers will do. - On the Visual tab, add an S3 source and name it
Providers
. - For S3 URL, enter
s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv
.
- Select the format as CSV and choose Infer schema.
Now the schema is listed on the Output schema tab using the file header.
In this use case, the decision is that not all columns in the providers dataset are needed, so we can discard the rest.
- With the Providers node selected, add a Drop Fields transform (if you didn’t select the parent node, it won’t have one; in that case, assign the node parent manually).
- Select all the fields after Provider Zip Code.
Later, this data will be joined by the claims for Alabama state using the provider; however, that second dataset doesn’t have the state specified. We can use knowledge of the data to optimize the join by filtering the data we really need.
- Add a Filter transform as a child of Drop Fields.
- Name it
Alabama providers
and add a condition that the state must matchAL
.
- Add the second source (a new S3 source) and name it
Alabama claims
. - To enter the S3 URL, open DataBrew on a separate browser tab, choose Datasets in the navigation pane, and on the table copy the location shown on the table for Alabama claims (copy the text starting with s3://, not the http link associated). Then back on the visual job, paste it as S3 URL; if it is correct, you will see in the Output schema tab the data fields listed.
- Select CSV format and infer the schema like you did with the other source.
- As a child of this source, search in the Add nodes menu for
recipe
and choose Data Preparation Recipe.
- In this new node’s properties, give it the name
Claim cleanup recipe
and choose the recipe and version you published before. - You can review the recipe steps here and use the link to DataBrew to make changes if needed.
- Add a Join node and select both Alabama providers and Claim cleanup recipes as the parent.
- Add a join condition equaling the provider ID from both sources.
- As the last step, add an S3 node as a target (note the first one listed when you search is the source; make sure you select the version that is listed as the target).
- In the node configuration, leave the default format JSON and enter an S3 URL on which the job role has permission to write.
In addition, make the data output available as a table in the catalog.
- In the Data Catalog update options section, select the second option Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions, then select a database on which you have permission to create tables.
- Assign
alabama_claims
as the name and choose Claim Date as the partition key (this is for illustration purposes; a tiny table like this doesn’t really need partitions if further data won’t be added later).
- Now you can save and run the job.
- On the Runs tab, you can keep track of the process and see detailed job metrics using the job ID link.
The job should take a few minutes to complete.
- When the job is complete, navigate to the Athena console.
- Search for the table
alabama_claims
in the database you selected and, using the context menu, choose Preview Table, which will run a simple SELECT * SQL statement on the table.
You can see in the result of the job that the data was cleaned by the DataBrew recipe and enriched by the AWS Glue Studio join.
Apache Spark is the engine that runs the jobs created on AWS Glue Studio. Using the Spark UI on the event logs it produces, you can view insights about the job plan and run, which can help you understand how your job is performing and potential performance bottlenecks. For instance, for this job on a large dataset, you could use it to compare the impact of filtering explicitly the provider state before doing the join, or identify if you can benefit from adding an Autobalance transform to improve parallelism.
By default, the job will store the Apache Spark event logs under the path s3://aws-glue-assets-<your account id>-<your region name>/sparkHistoryLogs/
. To view the jobs, you have to install a History server using one of the methods available.
Clean up
If you no longer need this solution, you can delete the files generated on Amazon S3, the table created by the job, the DataBrew recipe, and the AWS Glue job.
Conclusion
In this post, we showed how you can use AWS DataBrew to build a recipe using the provided interactive editor and then use the published recipe as part of an AWS Glue Studio visual ETL job. We included some examples of common tasks that are required when doing data preparation and ingesting data into AWS Glue Catalog tables.
This example used a single recipe in the visual job, but it’s possible to use multiple recipes at different parts of the ETL process, as well as reusing the same recipe on multiple jobs.
These AWS Glue solutions allow you to effectively create advanced ETL pipelines that are straightforward to build and maintain, all without writing any code. You can start creating solutions that combine both tools today.
About the authors
Mikhail Smirnov is a Sr. Software Dev Engineer on the AWS Glue team and part of the AWS Glue DataBrew development team. Outside of work, his interests include learning to play guitar and traveling with his family.
Gonzalo Herreros is a Sr. Big Data Architect on the AWS Glue team. Based on Dublin, Ireland, he helps customers succeed with big data solutions based on AWS Glue. On his spare time, he enjoys board games and cycling.