Simplify incoming data ingestion with dynamic parameterized datasets in AWS Glue DataBrew

When data analysts and data scientists prepare data for analysis, they often rely on periodically generated data produced by upstream services, such as labeling datasets from Amazon SageMaker Ground Truth or Cost and Usage Reports from AWS Billing and Cost Management. Alternatively, they can regularly upload such data to Amazon Simple Storage Service (Amazon S3) for further processing. In this post, we demonstrate how you can prepare data for files that are already in your S3 bucket as well as new incoming files using AWS Glue DataBrew, a visual data preparation service that provides over 250 transformations for cleaning and normalizing data. The recently launched dynamic datasets feature of DataBrew allows you to effectively reuse such datasets by using a single recipe in multiple runs of the same job. In this post, I show how to do this and highlight other use cases for the dynamic datasets feature.

Solution overview

In this post, I use DataBrew to create recurring data preparation jobs that can use dynamically updated data from Amazon S3. We can achieve such functionality by defining dynamic DataBrew datasets. Dynamic datasets are parameterized datasets, in which parameters are evaluated at the dataset loading time. DataBrew loads datasets during interactive session initialization so that they can be used to visually construct a recipe based on a sample of data. Datasets are also loaded when jobs run. It can be either a recipe job, in which a recipe is applied to the whole set of data to produce desired outputs, or a profile job that performs analysis of the input data to better understand initial quality and relations between columns. Therefore, the selected files can be different depending on the time of a session and job, hence the term dynamic datasets.

After I define my dynamic dataset, I create a DataBrew project so that I can start an interactive session and define my data preparation recipe. When the recipe is finalized, I create and schedule a recipe job so that future data is also processed as it becomes available in Amazon S3.

Prerequisites

If you’d like to follow along with this post, you need the following prerequisites:

An AWS account
A Kaggle account to download the sample dataset. For this post, I use the dataset Mars2020-Image-Catalogue.

Use case for dynamic datasets

For this use case, I need to clean up raw data that is written into an S3 bucket in CSV format every day, possibly multiple times. These files are metadata about Mars images taken by the Perseverance Rover. Let’s assume that we need to prepare this data to be used in image classification training that runs daily and uses all the data from the past 24 hours.

The following screenshot shows how the sample files look when uploaded into an S3 bucket.

Each file name includes information about the date and time it was created, and I want to have this information available in my output data. Additionally, this folder contains not only CSV files with metadata that I need, but also another CSV file: image-catalogue.csv, which I want to exclude from my processing.

Create a DataBrew dataset

We create a DataBrew dataset for these S3 files to be picked up daily.

On the DataBrew console, choose Datasets.
Choose Connect new dataset.
For Dataset name, enter daily-images-info.
Navigate to the folder with our sample files.

If we only need to process all the files one time, we can select the whole folder and use the S3 location s3://dynamic-datasets/Mars2020-Image-Catalogue/Mars2020-Image-Catalogue. But for our use case, we only need files for the last day, so this won’t work. We need to define the right parameters so that our dataset is connected to the latest files only.

Let’s start with choosing a particular file location and parameterizing it by replacing the changing part that contains date and time with the corresponding parameter.

Select a single file that ends with _metadata.csv.
In the location field, highlight the date and time.
On the context menu that appears, choose Create custom parameter.

This lets you to define a parameter to be used as a replacement for the previously highlighted part of the S3 location.

For Parameter name, enter file_creation_time.
For Type¸ choose Date.

We need to tell DataBrew which date and time format is used in our S3 path. The Predefined format drop-down menu contains a list of popular formats, but because in our case file names use the non-standard underscore (_) to separate between hours, minutes ,and seconds we need to define the right format.

For Date format, select Custom.
Enter yyyy-MM-dd-hh_mm_ss.

The Locale option can be useful if a part of S3 path that is being parameterized contains not only the numeric part of the date but also textual parts like a short or long month name in a language different from English.

The Format preview section allows you to check if the entered date format produces a string in the expected form.

Choose Create.

Our new parameter is added in the S3 location field instead of the particular date and time, and all matching files that contain date and time in their name are selected in the Matching files section. This gives us 107 matching files (at the time of writing this post) within the entire Mars-2020-Image-Catalogue folder.

However, we only need files from the last day, so let’s change our parameter to reflect this requirement.

Choose the ellipsis next to the parameter name in the left pane and choose Edit.

The drop-down list for Time range contains several relative time ranges, such as Past year, Past week, and so on.

Choose Past 24 hours.
Select Add as column.

We want to preserve this information when data from multiple files are combined. This option creates a column named file_creation_date when our dataset is loaded.

Choose Save.

You’re not limited by the time ranges provided in the Time range drop-down. If you need to define your own time range, choose Custom and select your range of dates (From, To, or both). You can an absolute date and time values, as shown in the following screenshot.

Alternatively, you can use a relative date. These dates are relative to when the files are loaded by DataBrew either during interactive session initialization or during a job start.

After we save our parameter, our dataset contains exactly what we need—five files created during the past 24 hours.

For this post, I use a time range in the parameter definition rather than selecting Specify last updated date range. The last updated date range selector filters files by their last modified date attribute in Amazon S3. I include an example of using it later in this post, but it’s not always the same as a date in the file name. With a parameter definition, the date in the file reflects the actual image capture date, whereas the last modified date reflects when I uploaded these files into my S3 bucket.

Choose Create dataset.

Our dataset is now listed on the Datasets page.

Create a project and recipe to clean up the raw data

To create a project and recipe to clean the data, complete the following steps:

On the Datasets page of the DataBrew console, select a dataset.
On the Actions menu, choose Create project with this dataset.
Enter a name for the project (for this post, daily-images-preparation).
You can choose an existing AWS Identity and Access Management (IAM) role or create a new one (for this post, we choose Create new IAM role).
Choose Create project.

Wait for the project session to initialize. When it’s ready, we can see our data describing Mars images captured during the past 24 hours. If you horizontally scroll to the last column in the view, you can see our added column file_creation_date.

This information from the S3 now path is available for further operations as any other column.

Let’s define our recipe now. For simplicity, my recipe only contains two steps:

Delete columns that aren’t needed for my image classification training
Filter out images with missing values in the caption column

To learn more about popular transformations, see 7 most common data preparation transformations in AWS Glue DataBrew and Recipe action reference.

Create and schedule a job

Now that our recipe is ready, we can create a job that applies this recipe to our daily-images-info dataset. This process also doesn’t differ between dynamic and non-dynamic datasets.

In our recipe details pane, choose Create job.
For Job name, enter a name.
Under Job output settings, enter the output location for our processed data.
Expand Associated schedules and choose Create schedule.
For Schedule name¸ enter a name.
For Run frequency, choose At a particular time.
For this post, configure the job to run at 11:59 PM every day.
For Role name, choose Create new role to create a role with access to write the prepared output to Amazon S3.
Choose Create job.

We don’t choose Create and run job because we don’t want to start our data preparation job yet; it runs on a schedule.

When job runs according to the schedule, it produces the prepared output with a daily portion of data, as shown in the following screenshot of my S3 output location.

Other use cases for DataBrew dynamic datasets

In this section, we discuss other use cases in which parameterized dynamic datasets can be useful.

Capture information from Amazon S3 and make it available in your dataset

In the first example of this post, I configured the file_creation_date parameter to be added as a column. Your data could have even more useful information that is available as a part of S3 path rather than as a column in your input file. For example, let’s say we have our input files located across multiple folders:

s3://bucket-name/input/EMEA/France/cases-2021-04-18.json
s3://bucket-name/input/APAC/Thailand/cases-2021-04-16.json
...

While processing such input, we may need to aggregate all these files, but still keep track of which folders and files this data comes from. We can use three parameters to parameterize the S3 location:

s3://bucket-name/input/{region}/{country}/cases-{report-date}.json

We define the first two parameters as the String type, and define report-date as the Date type with the predefined yyyy-MM-dd date format. For all these parameters, we should select Add as column.

Now, in addition to the original columns, three new columns are available in your dataset. You can not only use these columns for filtering and transformations in your recipe steps, but also use them to partition output, so your recipe job file structure resembles your input.

Match multiple files across different folders without using regular expressions syntax

An alternative way to parameterize the S3 location in DataBrew datasets is by using regular expressions like the following: s3://bucket-name/performance/product-<[A-Z]+>/node-<\d{1,4}>-metrics.json.

However, not all users are comfortable writing syntactically correct regular expressions to select necessary files and filter out files that should be excluded. With the custom parameters of dynamic datasets, you can select the right parameter type and apply one of the provided filters. For example, for numeric parameters, you can choose from the options shown in the following screenshot.

For parameters of type String, you can choose among the conditions in the following screenshot.

You can use custom parameters defined in such a manner in the S3 location instead of regular expressions. For example, s3://bucket-name/performance/product-{product_code}node-{node number}metrics.json.

Filter files by the last modified attribute in Amazon S3

In my first example, I showed how to select DataBrew dataset files based on the date that is encapsulated in the S3 file path. In other scenarios, you may prefer to rely on the Last Modified file attribute in Amazon S3 and use it for filtering instead. To do this, select Specify last updated date range and either use one of the predefined relative ranges or specify your own relative or absolute range by choosing Custom.

Limit dataset files by a number of the latest or oldest files in a location

Another use case is when the S3 location or set of matching files contains several files, but you’re only interested in one or several latest or oldest files. In this scenario, select Specify number of files to include under Choose filtered files and specify whether you want the latest or oldest files and how many.

This limit is applied on top of already matching files, whether it’s a folder or parameterized location.

Conclusion

DataBrew is a visual data preparation tool that enables you to clean and normalize data without writing code. The dynamic datasets feature described in this post and the ability to schedule your jobs make it easier to include this data preparation stage in your whole data processing cycle, so that your dataset, recipe, and job are defined one time and used repeatedly for upcoming data.

Additionally, you can employ Amazon EventBridge rules to forward DataBrew output to another service after a recipe job runs . For an example, see Setting up automated data quality workflows and alerts using AWS Glue DataBrew and AWS Lambda. Together with DataBrew job scheduling, this can automate your data preparation tasks.

About the Author

Mikhail Smirnov is a Sr. Software Engineer working on AWS Glue DataBrew. His previous experience includes leading user experience development for Amazon Lex and Amazon Comprehend. When not working, Mikhail likes to improve his guitar-playing skills and spend time with his kids.