AWS Database Blog

Simplify Amazon DynamoDB data extraction and analysis by using AWS Glue and Amazon Athena

More than 100,000 AWS customers have chosen Amazon DynamoDB for mobile, web, gaming, ad tech, IoT, and many other applications. For example, Duolingo uses DynamoDB to store 31 billion items in tables that reach 24,000 read capacity units per second and 3,300 write capacity units per second.

DynamoDB can address a wide variety of applications and microservices at any scale. Many DynamoDB customers also perform analytics and ad hoc queries on their data stored in DynamoDB tables. This process typically involves copying or archiving data from DynamoDB tables to a data store for big data analytics and querying, such as Amazon S3. You can use AWS Glue and the new DynamoDB crawling and extraction features to simplify the task of moving data to Amazon S3 for analysis.

In this post, we show how to use AWS Glue to crawl a DynamoDB table, extract the data into Amazon S3, and perform analysis using SQL queries on Amazon Athena. These services simplify the task of moving and analyzing data from DynamoDB by using fully managed, scalable, and serverless services on AWS. We also provide a hands-on tutorial and an AWS CloudFormation template that lets you test the workflow in your own AWS account.

Solution overview

The solution in this post uses AWS Glue, Amazon S3, and Athena to crawl, extract, and perform analytics on data from DynamoDB, as shown in the following diagram.

AWS Glue is a fully managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics. AWS Glue automatically discovers and profiles your data via the AWS Glue Data Catalog and AWS Glue crawlers. It recommends and generates ETL code to transform your source data into target schemas. It then runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination. With AWS Glue, you can also set up, orchestrate, and monitor complex data flows.

You can also create AWS Glue ETL jobs to read, transform, and load data from DynamoDB tables into services such as Amazon S3 and Amazon Redshift for downstream analytics.

Architecture diagram of the solution beginning from the Amazon DynamoDB table

Here’s how our example solution architecture works, as illustrated in the preceding diagram:

  1. AWS CloudFormation creates a DynamoDB table in your AWS account and populates it with a subset of the New York City Taxi and Limousine Commission (TLC) Trip Record Data. A crawler in AWS Glue detects the schema from DynamoDB and populates the AWS Glue Data Catalog with the metadata.
  2. An AWS Glue job then extracts the data from the DynamoDB table in Apache Parquet file format and stores it in S3. Parquet is a columnar storage file format available to projects in the Hadoop ecosystem, making queries more efficient in Athena.
  3. AWS Glue ETL jobs store the results in a predefined schema provided by the AWS Glue Data Catalog.
  4. Athena uses the AWS Glue Data Catalog as its external schema store for the newly extracted data in Amazon S3. You can perform analytics on the data by using Athena to query Amazon S3 directly using SQL.

Best practices for large-scale DynamoDB tables

AWS Glue ETL jobs use the concept of concurrent data processing units (DPUs) to describe the compute capacity of a job run. A single DPU provides four virtual central processing units (vCPUs) and 16 GB of memory. The AWS Glue DynamoDB reader also scans the DynamoDB table.

Although the following tips aren’t discussed in depth in this post, you should consider these best practices and scaling features for large-scale DynamoDB tables:

  1. Dial up the number of AWS Glue ETL job DPUs. This number depends on how much data is stored in DynamoDB and how much space in-memory the data will occupy in the AWS Glue ETL Apache Spark execution environment.
  2. Dial up your DynamoDB table’s read capacity units. This number depends on how much data is stored in the DynamoDB table and the time the AWS Glue ETL job needs to complete. For example, suppose that you have provisioned 100 read capacity units for your DynamoDB table. You can then perform 100 reads, or 409,600 bytes, per second. If that table contains 20 GB of data (21,474,836,480 bytes), and your job performs a full table scan using 100 percent of the read capacity units, you can approximate how long the job will take to run:
    21,474,836,480 / 409,600 = 52,429 seconds = 14.56 hours

    In this case, you might decide to increase the read capacity units to optimize the time it takes for the job to complete.

  1. Use AWS Glue PySpark extensions for connecting to DynamoDB. Specify the argument dynamodb.throughput.read.percent, and set it up or down. This setting specifies the read capacity units to use during a job run. By default, it is set to 0.5 or 50 percent. For more information, see Connection Types and Options for ETL in AWS Glue.

Avoid consuming all of the read capacity units of your table from your AWS Glue jobs so as not to impact production applications.

Tutorial

Before you get started with this tutorial, note the following prerequisites and caveats:

  1. AWS account – You must have an AWS account and an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and services listed in the “Solution overview.” Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template.
  2. AWS Management Console access – You use the console throughout this tutorial, and we provide screenshots to guide you along the way.
  3. AWS cost and usage – You are responsible for any AWS charges associated with performing this hands-on tutorial in your AWS account. In our tests, it cost approximately $2.00 to complete it, but the amount depends on your AWS Free Tier consumption for Amazon S3, DynamoDB, AWS Lambda, AWS Glue, and Athena. We have done as much as possible to reduce the resources you use in this tutorial. Follow the steps at the end to clean up the AWS resources you create to prevent any excess charges associated with the tutorial.

Launch the AWS CloudFormation template

Sign in to the console, and then choose the following Launch Stack button to create a CloudFormation stack in your AWS account. The button launches an AWS CloudFormation stack in your AWS account that automatically creates AWS resources for you. The stack includes a DynamoDB table loaded with sample data to work with throughout this tutorial.

Launch Stack button

The CloudFormation stack launches in the US East (N. Virginia) Region by default. If you choose to deploy the stack in a different Region, you must launch the template in an AWS Region that supports AWS Glue. See the Region Table for details about AWS Glue service availability by Region. You must choose a Region with AWS Glue service availability, or you will encounter errors as you proceed through this post.

Accept the defaults through each step of the wizard. The wizard took us about 4 minutes to complete, to give you an idea of time investment (not much). Refer to the Outputs tab, or copy and paste the information to a text editor for easy access later.

Console screen showing the outputs after the stack is created

Explore the DynamoDB table created by the AWS CloudFormation template

You can now explore data that was populated in the DynamoDB table that was created by the template. Go to the DynamoDBTableURL shown on the Outputs tab of the CloudFormation stack. The stack loads 50,000 rows into the DynamoDB table to use as sample data in this tutorial. Although this is a relatively small amount of data, it serves the purpose of demonstrating the mechanisms of extracting data from DynamoDB in this tutorial.

Crawl your DynamoDB table

To get started with using DynamoDB in your AWS Glue workflow, you use an AWS Glue crawler to populate metadata about the table in the AWS Glue Data Catalog automatically. A crawler connects to a data store and progresses through a prioritized list of classifiers to extract the schema of your data and other statistics. Finally, the AWS Glue crawler populates the Data Catalog with this metadata.

To create the AWS Glue crawler:

  1. Open the AWS Glue console.
  2. Choose Crawlers in the navigation pane, and then choose Add crawler.
  3. Enter a name for your crawler (for example, dynamodb table crawler) and choose Next.
  4. In the Choose a data store list, choose DynamoDB.
  5. Choose the small folder icon next to the Table name box, and choose the table that DynamoDB created. It should match the TableName shown in the AWS CloudFormation output. Click Choose to close the menu, and then choose Next.
  6. On the Add another data store page, leave the default setting as No, and then choose Next.
  7. On the Choose an IAM role page, click Choose an existing IAM role, and then choose the role that matches the IAMRoleName in the AWS CloudFormation output. Choose Next when you are done.
  8. On the Create a schedule for this crawler page, leave the default setting as Run on demand, and choose Next.
  9. You should now be on the Configure the crawler’s output In the Database list, choose the database that matches the GlueDatabaseName shown in the AWS CloudFormation output. Then choose Next.
  10. You should now see the page shown in the following screenshot. Review the values and choose Finish to create the crawler.

Screenshot of the confirmation screen after adding the AWS Glue crawler

Now, return to the Crawlers page. Choose the crawler that you just created, and then choose Run crawler.

Screenshot showing the name of the crawler just created

In a few minutes, the Status column for this table should change to Ready, and you should see 1 in the Tables added column. At this point, the AWS Glue crawler has automatically populated the Data Catalog with metadata from the DynamoDB table. You can proceed to the next step, or you can take a few minutes to investigate the newly added table by choosing Tables in the navigation pane.

ETL with DynamoDB sources

Now that the crawler has run against the DynamoDB table and created an entry for the table in the Data Catalog, you can create and run ETL jobs with the DynamoDB table as a source. In the following steps, you create an AWS Glue ETL job that extracts data from DynamoDB and stores the data in Apache Parquet file format in the Amazon S3 bucket:

  1. In the AWS Glue console, choose Jobs in the navigation pane, and then choose Add job.
  2. Give the job a name, such as dynamodb extract job.
  3. For the IAM role, choose the IAMRoleName shown in the AWS CloudFormation output.
  4. Choose A new script to be authored by you.
  5. Choose Security configuration, script libraries, and job parameters (optional), and update Concurrent DPUs per job run to 2. Then choose Next.
  6. On the Connections page, choose Next.
  7. Review the Job properties page, and choose Save job and edit script.
  8. Remove any existing code from the editor. Copy the following code and paste it into the editor, updating TYPE_YOUR_DYNAMODB_TABLE_NAME with your DynamoDB TableName from the AWS CloudFormation output. Then choose Save.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "aws-blogs-glue-database098234ytb2", table_name = "TYPE_YOUR_DYNAMODB_TABLE_NAME".lower().replace('-', '_'), transformation_ctx = "datasource0")
    
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dolocationid", "long", "dolocationid", "long"), ("fare_amount", "double", "fare_amount", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("lpep_dropoff_datetime", "string", "lpep_dropoff_datetime", "timestamp"), ("vendorid", "long", "vendorid", "long"), ("lpep_pickup_datetime", "string", "lpep_pickup_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("tripid", "long", "tripid", "long"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("trip_distance", "double", "trip_distance", "double"), ("trip_type", "long", "trip_type", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("payment_type", "long", "payment_type", "long"), ("total_amount", "double", "total_amount", "double"), ("extra", "double", "extra", "double"), ("tip_amount", "double", "tip_amount", "double"), ("mta_tax", "double", "mta_tax", "double"), ("pulocationid", "long", "pulocationid", "long")], transformation_ctx = "applymapping1")
    
    selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["dolocationid", "fare_amount", "ratecodeid", "lpep_dropoff_datetime", "vendorid", "lpep_pickup_datetime", "passenger_count", "tripid", "tolls_amount", "improvement_surcharge", "trip_distance", "trip_type", "store_and_fwd_flag", "payment_type", "total_amount", "extra", "tip_amount", "mta_tax", "pulocationid"], transformation_ctx = "selectfields2")
    
    resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "aws-blogs-glue-database098234ytb2", table_name = "ddb-target-s3-table", transformation_ctx = "resolvechoice3")
    
    resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_struct", transformation_ctx = "resolvechoice4")
    
    datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "aws-blogs-glue-database098234ytb2", table_name = "ddb-target-s3-table", transformation_ctx = "datasink5")
    job.commit()
  1. Finally, choose Run job, and then choose Run job again in the next dialog box.

Your AWS Glue ETL job is now running. ETL jobs run on fully managed Apache Spark execution environments so that you don’t have to worry about provisioning virtual instances or clustered resources to run Spark jobs. You are charged only when the jobs are running.

In our environment, the job took around 17 minutes to complete, but this might vary for you. When your job says Succeeded in the Run status column on the Jobs page, you can proceed to the next section.

Analyze the AWS Glue ETL job results

If you have followed the post to this point, data from DynamoDB should now be extracted and stored in Amazon S3 in a compressed Parquet file format. AWS CloudFormation also created a table entry in AWS Glue for the resulting table data in Amazon S3, so you can analyze that data with Athena using standard SQL. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 by using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you can start analyzing data immediately.

Navigate to the Athena console. In the Database drop-down list, choose the GlueDatabaseName that is shown in the AWS CloudFormation output. You should see one table listed. This is a table of the data stored in Amazon S3 so that you can analyze it using standard SQL.

At this point, you should be on a page similar to the one shown in the following screenshot.

Screenshot of the Athena console showing the database just created

Now, try the following analytical queries against this data. In the Athena query pane, paste the following separate SQL statements. Choose Run query to run each query, and view the output under Results.

Example query 1: Analyze long taxi trips in NYC organized by routes.

SELECT MAX(trip_distance) AS "max trip distance",
       CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4)) as "route"
FROM "aws-blogs-glue-database098234ytb2"."ddb-target-s3-table"
GROUP BY CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4))
ORDER BY "max trip distance" DESC
LIMIT 10;

Example results:

Screenshot of the output from the previous query showing maximum trip distance and routes

Example query 2: Analyze popular taxi trip routes in NYC.

SELECT (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4))) as "route",
       COUNT(tripid) AS "count of trips on route"
FROM "aws-blogs-glue-database098234ytb2"."ddb-target-s3-table"
GROUP BY (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4)))
ORDER BY "count of trips on route" DESC
LIMIT 10;

Example results:

Screenshot of the previous query's output showing routes and count of trips on route

Example query 3: Analyze valuable taxi trip routes in NYC based on dollars per mile.

Do the results make sense? Are there possible outliers or errors in the data? What values should you look at to find anomalies or one-offs? Hint: Do these expensive trips happen often?

SELECT (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4))) as "route", SUM(total_amount) AS "total dollars on route", (SUM(total_amount)/ SUM(trip_distance)) AS "dollars per mile on route", SUM(trip_distance) as "total miles traveled on route"
FROM "aws-blogs-glue-database098234ytb2"."ddb-target-s3-table"
GROUP BY (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4)))
ORDER BY "dollars per mile on route" DESC
LIMIT 50;

Example results:

Results of the previous query showing routes, total dollars on route, dollars per mile on route, and total miles traveled on route

Example query 4: Determine how often trips occur in the sample.

Add the COUNT column that you produced earlier to the query to determine whether trips occur very few times in this sample.

SELECT (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4))) as "route", SUM(total_amount) AS "total dollars on route", (SUM(total_amount)/ SUM(trip_distance)) AS "dollars per mile on route", SUM(trip_distance) as "total miles traveled on route", COUNT(tripid) AS "count of trips on route"
FROM "aws-blogs-glue-database098234ytb2"."ddb-target-s3-table"
GROUP BY (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4)))
ORDER BY "dollars per mile on route" DESC
LIMIT 50;

Example results:

Output of previous query showing the count of trips on each route

Example query 5: Analyze trips that occur more frequently, and filter for trips that occur 15 or more times.

SELECT (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4))) as "route",
       SUM(total_amount) AS "total dollars on route",
       (SUM(total_amount)/ SUM(trip_distance)) AS "dollars per mile on route",
       SUM(trip_distance) as "total miles traveled on route",
       COUNT(tripid) AS "count of trips on route"
FROM "aws-blogs-glue-database098234ytb2"."ddb-target-s3-table"
GROUP BY (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4)))
HAVING COUNT(tripid) >= 15
ORDER BY "dollars per mile on route" DESC
LIMIT 50;

Example results:

Output from previous query showing routes that have occurred more than 15 times

Example query 6: Finally, drill down into the individual trips of the “most valuable” route.

SELECT (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4))) as "route",
       total_amount,
       trip_distance
FROM "aws-blogs-glue-database098234ytb2"."ddb-target-s3-table"
WHERE (CAST(pulocationid as varchar(4)) || ' to ' || CAST(dolocationid as varchar(4))) = '188 to 188';

Example results:

Output of the previous query showing the most valuable routes

To better interpret the results of these example queries, join the results of the preceding queries with the Taxi Zone Lookup Table provided on NYC.gov to learn where exactly these most valuable routes are located in New York. You can upload this comma-separated values (CSV) file to Amazon S3 and use Athena to combine the data to gather more details. You also can look at other metrics, such as the sum of the total charges on the routes, instead of dollars per mile.

Cleanup

After you explore and analyze the data, you should clean up the resources used in this tutorial to prevent any reoccurring AWS costs.

To clean up the resources, navigate to the Amazon S3 console and empty the Amazon S3 bucket that was shown in the AWS CloudFormation output. Then, navigate to the AWS CloudFormation console, and delete the AWS CloudFormation stack.

Conclusion

Extracting data from DynamoDB tables in Amazon S3 is a common practice for complex analytics on large datasets. In this blog post, we showed how to extract DynamoDB data to Amazon S3 for analytics by using AWS Glue. AWS Glue can crawl a DynamoDB table and specify it as a source for AWS Glue ETL jobs. Using the sample PySpark code, you extracted the data from DynamoDB into Amazon S3 in Parquet file format. You then used standard SQL to query the data with Amazon Athena.

You can reuse the structure of this solution for your own DynamoDB analytics architectures. We hope that you found this post useful, and we can’t wait to see how you use DynamoDB and AWS Glue in your architecture.

About the Authors

Photo of LaithLaith Al-Saadoon is a solutions architect with a focus on data analytics at Amazon Web Services. He spends his days obsessing over designing customer architectures to process enormous amounts of data at scale. In his free time, he follows the latest in machine learning and artificial intelligence.

 

 

 

Photo of GhiyathGhiyath Alazzah is a software development engineer for AWS Glue. He likes to tackle problems of scale, specifically in the fields of data analytics and processing. In his spare time, he listens to and plays music.