Preparing your third-party data from AWS Data Exchange with AWS Glue DataBrew

Companies are accelerating their use of third-party data to enhance their decision-making and deliver more value to their customers. Often you need to cleanse and transform the third-party data before you can join it to internal data and generate insights. This step requires you to write code or SQL. Your organization may have data analysts who have limited experience with these skills and prefer working with visual tools to prepare and visualize the data. In this blog post, I show how your data analysts can use AWS Glue DataBrew, a visual data preparation tool, to prepare your third-party data.

Background concepts

AWS Data Exchange enables you to securely find, subscribe, and use third-party data in the cloud. It has more than 3,500 data products from more than 230 qualified data providers in the AWS Marketplace. For this blog post, I use the Foursquare Places 2021: US Sample product from AWS Data Exchange. This product can be used for retail planning and site selection, analysis of foot traffic data, and generating recommendations for consumer experience. You can learn more about the data in the product visit the Foursquare product page.
AWS Glue DataBrew (DataBrew) helps you clean and normalize data without writing code. This reduces the time it takes to prepare data for analytics and machine learning (ML) by up to 80 percent compared to traditional approaches to data preparation. DataBrew offers over 250 prebuilt transformations to automate data preparation tasks such as merging and joining datasets, filtering anomalies, standardizing formats, and correcting invalid values. These preparation tasks would otherwise require days or weeks of writing hand-coded transformations.
Amazon Location Service is a fully managed service that makes it easy for developers to add location functionality, such as maps, points of interest, geocoding, routing, tracking, and geofencing to their applications.

Use case

For this blog post, I assume you are a data engineer working for a food delivery startup that is expanding to a new market. You have been tasked with providing a dataset of dining locations in Chicago to the planning team so they can best select where to deploy their vehicles. The planning team will also display the point of interest over a map for better visualization.

In this post, I show how to convert a dataset from Foursquare Places 2021: US Sample into a GeoJson format using Glue Databrew for further analysis. The dataset contains latitude- and longitude-based points of interest (POI) covering different categories, such as Dining and Drinking, Arts and Entertainment, Business and Professional Services. Furthermore, I show how to use Amazon Location Service to plot this data over a map.

Prerequisite

Before I dive deep into the solution overview and in order to complete this blog post, you need the following prerequisite:

An Amazon S3 bucket to export your AWS Data Exchange datasets. For more information, see Create a Bucket.

Solution Overview

Here is the step-by-step process:

Subscribe to Foursquare data from AWS Data Exchange.
Create DataBrew datasets with AWS Data Exchange.
Create a DataBrew project.
Use DataBrew to implement a data transformation recipe.
Configure and start DataBrew to write output to your S3 bucket.
(Optional) Display point of interest locations over a map using Amazon Location Service (Amazon Location).

The following diagram shows the overall solution.

ETL Architecture

In this blog post, I cover steps 1 to 5 in detail. For step 6, I give an overview of the steps that enable you to deploy Amazon Location and overlay data.

Step 1: Subscribe to Foursquare data from AWS Data Exchange

To subscribe to a data product, do the following:

Sign in to your AWS account. Navigate to AWS Marketplace, and select the Foursquare Places 2021: US Sample
On the data product page, choose Continue to subscribe.
Under Complete subscription, for Pricing information, choose a product offer and fill the subscription request.
Once the subscription is active, open the subscription. To see one or more revisions, choose the entitled dataset called US_venues_integrated.
To clean the data, go to the Glue DataBrew.

Step 2: Create a DataBrew dataset with AWS Data Exchange

To create the datasets, complete the following steps:

In the DataBrew console Datasets page, choose Connect new. For Dataset name, enter a name. I entered fsq-1.
On the left sidebar, choose AWS Data and then choose US_venues_integrated. Note that you are only able to see datasets from those products that reside in the Region you are using DataBrew console from.
Select part-00000 and then choose Selected file type as
In Column header values, select Treat first row as header.
Under Enter your S3 destination section, choose the S3 Bucket where you want the datasets to be stored. Note that you are only able to see S3 buckets that reside in the Region you are using DataBrew from.
Choose Create dataset.
Create another dataset called fsq-2 by repeating steps 2.1-5 and entering fsq-2 for step 2.1, selecting part-00003 in step 2.3. For this dataset, make sure to select Add default header in step 2.4.

Before proceeding to create a DataBrew project and start applying transformations, you can use DataBrew Profile jobs to evaluate the quality of your data by profiling it to understand data patterns and detect anomalies.

Step 3: Create DataBrew project

To create your DataBrew project, complete the following steps:

On the DataBrew console Projects, choose Create project.
For Project Name, enter fsq-dining-dataset.
Under Select a dataset section, choose the fsq-1. The recipe name is populated automatically.
Expand the Sampling section. For Type, choose First n rows and set Custom size to 5000.
If this is your first time using DataBrew, choose Create New IAM role and specify suffix as fsq. If this is not your first time using DataBrew, you can select an existing IAM role.
Choose Create project.

You see a pop-up informing you that a session is getting initiated. When the interactive session is ready, you see sample rows retrieved by DataBrew. Now you can start applying transformations to consolidate the datasets and clean the data.

Step 4: Use DataBrew to implement the data transformation recipe

A recipe is a series of transformations performed on the data. My recipe to transform the dataset to GeoJSON format contains 12 steps. These can be categorized into two subgroups.

Data cleansing: removes columns and rows that are not needed in my dataset.
Data transformation: converts data into GeoJSON format.

a) Cleanse your data

1. Consolidate two datasets into a single dataset

To consolidate two datasets into a single dataset, do the following in the DataBrew console:

- - On the DataBrew navigation bar, choose the Union.
  - Select the fsq-2 dataset and then choose Next.
  - For specify Union details, choose Map Columns using Column position, then choose Finish.
  - To see a visual representation of the union of the two datasets, on the top right, choose the Lineage icon.

2. Remove the columns that are not relevant

The Foursquare dataset has a large set of columns, and I want to keep only the following columns: latitude, longitude, name, address, popularity, hours_popular, hours, dma, and category_labels. To do so, do the following in the DataBrew console:

- - Select 126 columns, then choose Unselect all.
  - On the search bar, enter and select each of the following columns: latitude, longitude, name, address, popularity, dma, hours_popular, hours, and category_labels.
  - Select Delete unselected columns.

The DataBrew console removes the number of columns and gives a preview of the delete transformation applied.

3. Keep only rows for Chicago city

To filter rows for Chicago, perform the following steps:

- - On the DataBrew navigation bar, choose the Filter icon and then select By Condition.
  - Select Column dma.
  - For the filter condition as Is exactly end, enter Chicago.
  - Choose Apply.

Note that you can now see Chicago-specific rows on the console. Next, you choose only dining-related rows.

4. Keep only rows for the dining category

To filter rows for specific dining categories, perform the following steps:

- - On the DataBrew navigation bar, choose the Filter icon, select By Condition, and then choose Contains.
  - Select the category_labels column.
  - Under Enter custom value, enter Dining.
  - Choose Apply.

5. Remove columns dma and category_label

To remove the dma and category_label columns, do the following:

- - On the DataBrew navigation bar, choose the Column icon and then select Delete.
  - Select Source columns.
  - Choose dma and category_label.
  - Choose Apply.

6. Remove rows with missing values

You must remove the rows where the latitude and longitude are missing, since location is an important attribute for the planning team.

Perform the following steps for rows with missing data for latitude and longitude:

- - On the DataBrew navigation bar, choose the Missing icon.
  - Select column Latitude.
  - For Missing value action, select Delete rows with missing values.
  - Choose Apply.
  - Repeat steps 3.b.6.1–4 for the column Longitude.

You notice that none of the rows is removed; this is because you are working on a sample with both longitude and latitude data present. This step is important, so the final result of the DataBrew job will have only rows where the longitude and latitude are not null to be able to use the point of interest.

Congratulations; your data cleansing process is complete.

b) Transform your data into GeoJSON format

Now that you have cleansed your data, you must complete the transformation recipe by preparing the columns for the GeoJSON format. This format is a nested JSON data structure that has data fields like type and geometry, which are not in this dataset. In this section, you create these fields.

To prepare the columns to be transformed into the GeoJSON format, follow the instructions in the DataBrew console. The following is an example of the GeoJSON format.

{
	"type": "Feature",
	"geometry": {
		"type ": "Point",
		"coordinates": [125.6, 10.1]
	},
	"Properties": {
		"name": "Dinagat islands "
	}
}

1. Create the type column

Before you start nesting columns for the GeoJSON, you need to create a column for the geometry data field. The following steps take you through creating a new column using the DataBrew console.

- - On the navigation bar, in Create, select based on a function.
  - In Select a function, choose REPEAT.
  - In Values using, select Custom Value and enter Point.
  - In Number of times, enter 1.
  - In Destination column, enter type.

DataBrew creates a new Type column with rows filled by Point.

2. Create the coordinates column

The coordinates field is an array made of latitude and longitude. To get the array structure, you use the Nest/Unnest feature. This feature is a transformation that enables you to combine columns into a new column of Array, Struct, or Map data type. Refer to the following coordinates column in the GeoJSON example.

{
	"type": "Feature",
	"geometry": {
		"type ": "Point",
		"coordinates": [125.6, 10.1]
	},
	"Properties": {
		"name": "Dinagat islands "
	}
}

To create the coordinates column, perform the following steps:

- - On the DataBrew navigation bar, choose Nest-Unnest and then choose Nest.
  - In the Source column, select in the following order Longitude and then Latitude.
  - In Nest Values into, select Array.
  - In Destination column, select Replace existing column.
  - In Destination column name, input coordinates and choose Apply.

3. Create the geometry column

The following shows the geometry column, which is a JSON object.

{
	"type": "Feature",
	"geometry": { "type ": "Point", "coordinates": [125.6, 10.1] },
	"Properties": {
		"name": "Dinagat islands "
	}
}

To create this object, use the Struct type of the NEST-UNEST transformation by following these steps:

- - On the DataBrew navigation bar, choose Nest-Unnest and choose Nest.
  - In Source column, select coordinates and type.
  - In Nest Values into, select Struct.
  - In Destination Column, select Replace existing column.
  - In Destination column name, enter geometry and then choose Apply.

After this transformation, you see a new column created called geometry. This is a JSON object, which is a result of the NEST transformation of type and coordinates columns created in steps 4.b.1– 3.

4. Populate type column with Feature as the value

Since your GeoJSON object will have nonspatial properties, you must create a Type column with Feature as value. This is defined as part of the GeoJSON standard. The following example shows the field in the GeoJSON object.

{
	"type": "Feature",
	"geometry": {
		"type ": "Point",
		"coordinates": [125.6, 10.1]
	},
	"Properties": {
		"name": "Dinagat islands"…
	}
}

To create the type column, do the following:

- - In Create, select based on a function.
  - In Select a function, choose REPEAT.
  - In Values using, select Custom Value.
  - In Custom value, enter Feature.
  - In Destination column, enter type and then choose Apply.

5. Create the properties column

This column represents the non-location attribute of a location, the example below shows the structure of the properties column which is a JSON. To create a properties column, perform the following steps:

{
	"type": "Feature",
	"geometry": {
		"type ": "Point",
		"coordinates": [125.6, 10.1]
	},
	"Properties": { "name": "Dinagat islands"… }
}

- - Choose Nest-Unnest and choose Nest.
  - In Source column, select all the columns except type and geometry.
  - In Nest Values into, select Struct.
  - In Destination column, select Replace existing column.
  - In Destination column name, enter properties and choose Apply.

The following screenshot shows an example of what the prepared recipe will look like after you have added all the transformation suggested in the preceding steps. It lists parameters for union, column deletions, row deletions, how to filter values, column creations, and nest values.

DatBrew transformation recipe

The Databrew console now shows the result of applying 12 transformations on the sample of 5,000 rows you selected in step 3.4. Next, you run the recipe containing your 12 transformations on the entire dataset via a job.

Step 5: Configure and start DataBrew job to write output to your S3 bucket

Create a job to apply the recipe you created in step 4 on all the datasets and configure it to output a JSON file. To do that, complete the following steps:

In the DataBrew console top right of the project editor, choose the Create job button.
For Job name, enter fsq-transformation.
Select JSON as the output format. For File type, choose JSON.
For Role name, choose an existing role or create a new one.
Choose Create and run job.
Navigate to the Jobs page and wait for the fsq-transformation job to complete.
Choose the Destination link to navigate to Amazon S3 to access the job output.

Congratulations! You have successfully prepared the data in GeoJSON format. You can now plot and analyze this data using a variety of tools.

Step 6 Overview

a.) Display point of interest over a map using Amazon Location Service

If you are interested in using Amazon Location Service to plot this data, to bootstrap a web application, follow the instructions in the AWS Location Services GitHub repository. To add custom data as an overlay for the map on the front-end web application you just bootstrapped from the repository, follow the adding custom data example. The following image shows a first look at the application with 20 records of transformed data. It is a map of Chicago with a cluster of about 10 blue dots in the city and around 10 blue dots scattered across the suburbs. Each blue dot represents a dining point of interest in Chicago.

PointsOfInterest-over-Amazon-Location-Service

b.) Explore AWS Data Exchange catalog and identify datasets you can use to generate additional insights for your business.

Conclusion

In this post, I showed you how to subscribe to a third-party data provider and transform its data using DataBrew. As a next step, you can now transform your own data, combine it with data from AWS Data exchange, and then analyze it with Amazon Athena orAmazon Quicksight or display it on a map powered by Amazon Location Service.

About the author

author Lotfi is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

AWS Marketplace