AWS Database Blog

Implementing bulk CSV ingestion to Amazon DynamoDB

This post reviews what solutions exist today for ingesting data into Amazon DynamoDB. It also presents a streamlined solution for bulk ingestion of CSV files into a DynamoDB table from an Amazon S3 bucket and provides an AWS CloudFormation template of the solution for easy deployment into your AWS account.

Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. Today, hundreds of thousands of AWS customers have chosen to use DynamoDB for mobile, web, gaming, ad tech, IoT, and other applications that need low-latency data access. A popular use case is implementing bulk ingestion of data into DynamoDB. This data is often in CSV format and may already live in Amazon S3. This bulk ingestion is key to expediting migration efforts, alleviating the need to configure ingestion pipeline jobs, reducing the overall cost, and simplifying data ingestion from Amazon S3.

In addition to DynamoDB, this post uses the following AWS services at a 200–300 level to create the solution:

Prerequisites

To complete the solution in this post, you need the following:

  • An AWS account.
  • An IAM user with access to DynamoDB, Amazon S3, Lambda, and AWS CloudFormation.

Current solutions for data insertion into DynamoDB

There are several options for ingesting data into Amazon DynamoDB. The following AWS services offer solutions, but each poses a problem when inserting large amounts of data:

  • AWS Management Console – You can manually insert data into a DynamoDB table, via the AWS Management Console. Under Items, choose Create item. You must add each key-value pair by hand. For bulk ingestion, this is a time-consuming process. Therefore, ingesting data via the console is not a viable option.
  • AWS CLI – Another option is using the AWS CLI to load data into a DynamoDB table. However, this process requires that your data is in JSON format. Although JSON data is represented as key-value pairs and is therefore ideal for non-relational data, CSV files are more commonly used for data exchange. Therefore, if you receive bulk data in CSV format, you cannot easily use the AWS CLI for insertion into DynamoDB.
  • AWS Data Pipeline – You can import data from Amazon S3 into DynamoDB using AWS Data Pipeline. However, this solution requires several prerequisite steps to configure Amazon S3, AWS Data Pipeline, and Amazon EMR to read and write data between DynamoDB and Amazon S3. In addition, this solution can become costly because you incur additional costs for using these three underlying AWS services.

Bulk CSV ingestion from Amazon S3 into DynamoDB

There is now a more efficient, streamlined solution for bulk ingestion of CSV files into DynamoDB. Follow the instructions to download the CloudFormation template for this solution from the GitHub repo. The template deploys the following resources:

  • A private S3 bucket configured with an S3 event trigger upon file upload
  • A DynamoDB table with on-demand for read/write capacity mode
  • A Lambda function with a timeout of 15 minutes, which contains the code to import the CSV data into DynamoDB
  • All associated IAM roles needed for the solution, configured according to the principle of least privilege

To ingest the data, complete the following steps:

  1. On the AWS CloudFormation console, choose Create stack.
  2. Choose With new resources (standard).
  3. In the Specify template section, for Template source, choose Upload a template file.
  4. Choose Choose File.
  5. Choose the CloudFormation template file you downloaded previously.
  6. Choose Next.
  7. In the Specify stack details section, for Stack name, enter a name for your stack.
  8. For Parameters, enter parameter names for the following resources:
    • BucketName – S3 bucket name where you upload your CSV file.
      The bucket name must be a lowercase, unique value, or the stack creation fails.
    • DynamoDBTableName – DynamoDB table name destination for imported data.
    • FileName – CSV file name ending in .csv that you upload to the S3 bucket for insertion into the DynamoDB table.
  9. Choose Next.
  10. Choose Next again.
  11. Select I acknowledge that AWS CloudFormation might create IAM resources.
  12. Choose Create Stack.
  13. When the stack is complete, navigate to your newly created S3 bucket and upload your CSV file.
    The upload triggers the import of your data into DynamoDB. However, you must make sure that your CSV file adheres to the following requirements:
  • Structure your input data so that the partition key is located in the first column of the CSV file. Make sure that the first column of your CSV file is named uuid. For more information about selecting a partition key according to best practices, see Choosing the Right DynamoDB Partition Key.
  • Confirm that your CSV file name matches the exact file name, which ends in .csv suffix, that you entered previously.
    For a 100,000 row-long file, this execution should take approximately 80 seconds. The Lambda function timeout can accommodate about 1 million rows of data; however, you should break up the CSV file into smaller chunks. Additionally, this solution does not guarantee the order of data imported into the DynamoDB table. If the execution fails, make sure that you have created and set your environment variables correctly, as specified earlier. You can also check the error handling messages in the Lambda function console.
  1. On the DynamoDB console, choose Tables.
  2. Choose the table you entered in your CloudFormation template for DynamoDBTableName.
    You can now view your imported data and associated table details.

Conclusion

This post discussed the common use case of ingesting large amounts of data into Amazon DynamoDB and reviewed options for ingestion available as of this writing. The post also provided a streamlined, cost-effective solution for bulk ingestion of CSV data into DynamoDB that uses a Lambda function written in Python.

Download the CloudFormation template from the GitHub repo to build and use this solution. This CloudFormation stack launches several resources, including a Lambda function, S3 bucket, DynamoDB table, and several IAM roles. Please keep in mind that you incur charges for using these services as part of launching the template and using this solution. These costs also increase as your input file size grows. To reduce costs, consider selecting provisioned write capacity rather than on-demand (which this post used to speed up the data import) for your DynamoDB table.

You can now easily ingest large datasets into DynamoDB in a more efficient, cost-effective, and straightforward manner. If you have any questions or suggestions, please leave a comment.

 


About the Authors

 

Julia Soscia is a Solutions Architect with Amazon Web Services, based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Analytics and Machine Learning.

 

 

 

Anthony Pasquariello is a Solutions Architect with Amazon Web Services, based in New York City. He is focused on providing customers technical consultation during their cloud journey, especially around security best practices. He holds his Master’s degree in electrical & computer engineering from Boston University. Outside of technology, he enjoys writing, being in nature, and reading philosophy.

 

 

Nir Ozeri is a Solutions Architect with Amazon Web Services.