Implementing bulk CSV ingestion to Amazon DynamoDB
|June 2023: Amazon DynamoDB can now import Amazon S3 data into a new table. DynamoDB import from S3 helps you to bulk import terabytes of data from Amazon S3 into a new DynamoDB table with no code or servers required.|
|November 2022: This post was reviewed and updated for accuracy.|
This post reviews what solutions exist today for ingesting data into Amazon DynamoDB. It also presents a streamlined solution for bulk ingestion of CSV files into a DynamoDB table from an Amazon S3 bucket and provides an AWS CloudFormation template of the solution for easy deployment into your AWS account. Please note the solution is not necessarily designed to take existing tables unless you have a matching UUID (primary key), in which case, any new column names will be added and data will be written from the last row onwards.
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. Today, hundreds of thousands of AWS customers have chosen to use DynamoDB for mobile, web, gaming, ad tech, IoT, and other applications that need low-latency data access. A popular use case is implementing bulk ingestion of data into DynamoDB. This data is often in CSV format and may already live in Amazon S3. This bulk ingestion is key to expediting migration efforts, alleviating the need to configure ingestion pipeline jobs, reducing the overall cost, and simplifying data ingestion from Amazon S3.
In addition to DynamoDB, this post uses the following AWS services at a 200–300 level to create the solution:
- Amazon S3
- AWS Lambda
- AWS CloudFormation
To complete the solution in this post, you need the following:
- An AWS account.
- An IAM user with access to DynamoDB, Amazon S3, Lambda, and AWS CloudFormation.
Current solutions for data insertion into DynamoDB
There are several options for ingesting data into Amazon DynamoDB. The following AWS services offer solutions, but each poses a problem when inserting large amounts of data:
- AWS Management Console – You can manually insert data into a DynamoDB table, via the AWS Management Console. Under Items, choose Create item. You must add each key-value pair by hand. For bulk ingestion, this is a time-consuming process. Therefore, ingesting data via the console is not a viable option.
- AWS CLI – Another option is using the AWS CLI to load data into a DynamoDB table. However, this process requires that your data is in JSON format. Although JSON data is represented as key-value pairs and is therefore ideal for non-relational data, CSV files are more commonly used for data exchange. Therefore, if you receive bulk data in CSV format, you cannot easily use the AWS CLI for insertion into DynamoDB.
- AWS Data Pipeline – You can import data from Amazon S3 into DynamoDB using AWS Data Pipeline. However, this solution requires several prerequisite steps to configure Amazon S3, AWS Data Pipeline, and Amazon EMR to read and write data between DynamoDB and Amazon S3. In addition, this solution can become costly because you incur additional costs for using these three underlying AWS services.
February 2023 Update: Console access to the AWS Data Pipeline service will be removed on April 30, 2023. On this date, you will no longer be able to access AWS Data Pipeline though the console. You will continue to have access to AWS Data Pipeline through the command line interface and API. Please note that AWS Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. For information about migrating from AWS Data Pipeline, please refer to the AWS Data Pipeline migration documentation.
Bulk CSV ingestion from Amazon S3 into DynamoDB
There is now a more efficient, streamlined solution for bulk ingestion of CSV files into DynamoDB, refer to Amazon DynamoDB can now import Amazon S3 data into a new table
to learn more.
You can use the solution presented in this post to import CSV data to an existing DynamoDB table. Follow the instructions to download the CloudFormation template for this solution from the GitHub repo. The template deploys the following resources:
- A private S3 bucket configured with an S3 event trigger upon file upload
- A DynamoDB table with on-demand for read/write capacity mode
- A Lambda function with a timeout of 15 minutes, which contains the code to import the CSV data into DynamoDB
- All associated IAM roles needed for the solution, configured according to the principle of least privilege
To ingest the data, complete the following steps:
- On the AWS CloudFormation console, choose Create stack.
- Choose With new resources (standard).
- In the Specify template section, for Template source, choose Upload a template file.
- Choose Choose File.
- Choose the CloudFormation template file you downloaded previously.
- Choose Next.
- In the Specify stack details section, for Stack name, enter a name for your stack.
- For Parameters, enter parameter names for the following resources:
- BucketName – S3 bucket name where you upload your CSV file.
The bucket name must be a lowercase, unique value, or the stack creation fails.
- DynamoDBTableName – DynamoDB table name destination for imported data.
- FileName – CSV file name ending in .csv that you upload to the S3 bucket for insertion into the DynamoDB table.
- BucketName – S3 bucket name where you upload your CSV file.
- Choose Next.
- Choose Next again.
- Select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create Stack.
- When the stack is complete, navigate to your newly created S3 bucket and upload your CSV file.
The upload triggers the import of your data into DynamoDB. However, you must make sure that your CSV file adheres to the following requirements:
- Structure your input data so that the partition key is located in the first column of the CSV file. Make sure that the first column of your CSV file is named uuid. For more information about selecting a partition key according to best practices, see Choosing the Right DynamoDB Partition Key.
- Confirm that your CSV file name matches the exact file name, which ends in .csv suffix, that you entered previously.
For a 100,000 row-long file, this execution should take approximately 80 seconds. The Lambda function timeout can accommodate about 1 million rows of data; however, you should break up the CSV file into smaller chunks. Additionally, this solution does not guarantee the order of data imported into the DynamoDB table. If the execution fails, make sure that you have created and set your environment variables correctly, as specified earlier. You can also check the error handling messages in the Lambda function console.
- On the DynamoDB console, choose Tables.
- Choose the table you entered in your CloudFormation template for
You can now view your imported data and associated table details.
This post discussed the common use case of ingesting large amounts of data into Amazon DynamoDB and reviewed options for ingestion available as of this writing. The post also provided a streamlined, cost-effective solution for bulk ingestion of CSV data into DynamoDB that uses a Lambda function written in Python.
Download the CloudFormation template from the GitHub repo to build and use this solution. This CloudFormation stack launches several resources, including a Lambda function, S3 bucket, DynamoDB table, and several IAM roles. Please keep in mind that you incur charges for using these services as part of launching the template and using this solution. These costs also increase as your input file size grows. To reduce costs, consider selecting provisioned write capacity rather than on-demand (which this post used to speed up the data import) for your DynamoDB table.
About the Authors
Julia Soscia is a Solutions Architect with Amazon Web Services, based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Analytics and Machine Learning.
Anthony Pasquariello is a Solutions Architect with Amazon Web Services, based in New York City. He is focused on providing customers technical consultation during their cloud journey, especially around security best practices. He holds his Master’s degree in electrical & computer engineering from Boston University. Outside of technology, he enjoys writing, being in nature, and reading philosophy.
Nir Ozeri is a Solutions Architect with Amazon Web Services.