AWS Machine Learning Blog
Serverless Unsupervised Machine Learning with AWS Glue and Amazon Athena
Have you ever had the need to segment a data set based on some of its attributes? K-means is one of the most common machine learning algorithms used to segment data. The algorithm works by separating data into different groups, called clusters. Each sample is assigned a cluster so that the samples assigned to the same cluster are more similar to each other than to those in other clusters.
In this blog post, I walk through using AWS Glue to take a data set of taxi rides located on Amazon S3 and apply K-means to separate the data into 100 different clusters based on the ride coordinates. Then, I use Amazon Athena to query the number of rides and the approximate area of each cluster. Finally, I use Amazon Athena to calculate the coordinates of the four areas with the most rides. Both AWS Glue and Amazon Athena allow you to perform these tasks without the need to provision or manage servers.
Solution overview
I’ll use the New York City Taxi data set used in an earlier blog post: Harmonize, Query, and Visualize Data from Various Providers using AWS Glue, Amazon Athena, and Amazon QuickSight. I’ll use the table consisting of the green type of taxi rides for January 2016.
I’ll show you an AWS Glue job script that uses Spark machine learning K-means clustering libraries to segment a data set based on coordinates. The script performs a job by loading the green taxi data and adding a column indicating the cluster each row gets assigned to. The script saves the table to an Amazon s3 bucket (destination file) using parquet format. The bucket can be queried using Amazon Athena.
Let’s consider the problem of distributing the taxi rides dataset in 100 different groups (clusters) equally between all the registered pickup locations (given by the pickup_longitude and pickup_latitude columns). To solve this problem the AWS Glue script reads the input table, and then, using Spark machine learning libraries, it implements K-means with the number of clusters set to 100. Results are stored in an Amazon S3 bucket using parquet format so you can query them using Amazon Athena.
Walkthrough
Execute the AWS Glue job
Follow these steps:
- In the AWS Management Console, go to the AWS Glue console. Create a new database for the AWS Glue crawlers (which create table definitions in the Data Catalog) to write tables into.
- Create a new crawler pointing to:
- Run the crawler.
Make sure the crawler classifies the green table containing the following attributes.
- Upload the script file MLkmeans.py into one of your S3 buckets.
- Add a new AWS Glue Job, choose a name and role for the job, select the option of running a job from “An existing script that you provide,” choose the S3 path of the uploaded script, then choose an S3 path for temporary files. Choose Next twice and then finish.
- Edit the script.
- Select the job and choose the option to edit:
- Edit the destination variable to where you want to store results (line 17 in the following image).
- Edit namespace and tablename with the database and table name of the green table created by the crawler that ran previously (lines 18 and 19 in the following image).
- Select the job and choose the option to edit:
- Run the AWS Glue job.
- Verify that the parquet file gets created in the destination path.
- Create a new crawler pointing to the destination path.
- Run the crawler on the destination path to create a new table in the AWS Glue Data Catalog pointing to the newly converted dataset.
How to Query results using Athena
After the crawler completes analyzing the parquet dataset created by the AWS Glue extract, transform, and load (ETL) job, you should have a table in the Data Catalog with the following columns:
The prediction column was added by the k-means algorithm and contains an integer representing the Cluster ID each row got assigned to.
Let’s look at an example by listing all of the calculated clusters with the following query in Amazon Athena:
When you replace RESULTDATABASE.RESULTTABLENAME with your result table name and database, the query should look like this:
The results show how many taxi pickups were made within each geographic region described by the count column, as well as the area covered by each region described by the approximate_cluster_area column.
Let’s look at another example by listing the 10 clusters with the most activity and calculate the coordinates of their center:
When you replace RESULTDATABASE.RESULTTABLENAME with your result table name and database, the query should look like this:
Results will show the ten clusters with the most amount of rides. If we plot those coordinates on a map using Amazon Quicksight geospacial visualization feature it looks like this:
Summary
In this blog post, you learned how you can use AWS Glue and Amazon Athena to use unsupervised machine learning algorithms without launching or managing servers. In the example, we separated a data set of taxi rides into 100 different groups based on the ride coordinates. Using query data like the area of each group and number of rides coordinates can be calculated.
The solutions presented in the blog can also be applied to other data sets with just a few modifications, you can use them to address the needs of your own use cases. I look forward to your feedback and suggestions.
Additional Reading
Learn how to build PMML-based applications and generate predictions with AWS.
About the Author
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.