AWS Database Blog

Migrate Delimited Files from Amazon S3 to an Amazon DynamoDB NoSQL Table Using AWS Database Migration Service and AWS CloudFormation

Recently, AWS Database Migration Services (AWS DMS) added support for using Amazon S3 as a source for your database migration. This new support means that you can now load data in comma-separated value (CSV) format from S3 into any supported target, whether or not the target has native S3 support.

In most cases, when you are migrating to a new database, you have access to your source database and can use the database directly as a source. Sometimes, however, you might not have access to the source directly. In other cases, the source is really old, or possibly unsupported. In these cases, if you can export the data in CSV format, you can still migrate, or replatform, your data.

In this blog post, we show you how to use Amazon S3 as a source and push a file with 4 million rows of data into Amazon DynamoDB using AWS DMS. We use AWS CloudFormation to create all of the resources, and initiate the task. You can find all of the code samples used today in this repository.

Before you begin working with DMS, you need at least two AWS Identity and Access Management (IAM) service roles with sufficient permissions to access the resources in your account. These are the dms-cloudwatch-logs-role for pushing logs to Amazon CloudWatch and the dms-vpc-role for use by the DMS service. For more information on these, see Creating the IAM Roles to Use with the AWS CLI and AWS DMS API in the DMS documentation.

When you are working with S3 as a source, check that the S3 bucket is in the same AWS Region as the replication instance you are using to migrate. In addition, the AWS account you are using for the migration must have read access to the source bucket. Finally, the role assigned to the user account creating the migration task must have S3 permissions for GetObject and ListBucket.

When working with a DynamoDB database as a target for AWS DMS, make sure that your IAM role allows AWS DMS to assume and grant access to the DynamoDB tables that are being migrated into. If you are using a separate role, make sure that you allow the DMS service to perform AssumeRole. The user account creating the migration task must be able to perform the DynamoDB actions PutItem, CreateTable, DescribeTable, DeleteTable, DeleteItem, and ListTables.

For the examples in this blog post, we use a role with the S3, DynamoDB, and DMS permissions listed preceding, as shown in the following image.

Test data
You can work through the examples in this post easily using test data readily available on the web. IMDB datasets are available for noncommercial or personal use from a requester pays S3 bucket. When you use your account and access keys to download a file from this bucket, your account is charged for the data transfer and request costs.

As you work through all of the examples in this post, you can use the IMDB file title.basics.tsv.gz downloaded from imdb-datasets/documents/v1/current/. The repository includes a short shell script that downloads the file and pushes it to the S3 bucket of your choice. Simply change the BUCKETKEY variable to your bucket and the AWSPROFILE variable to your profile name (possibly default). As an alternative, modify the code samples in the repository to work with files of your choice. The following image shows the shell script that downloads the file from IMDB.

For the examples in this blog post, you work with one of the IMDB files, title.basics.csv. The file contains about 4 million rows and is roughly 367.6 MB. You load it directly from your S3 bucket into a target DynamoDB table, which you can precreate. By precreating the table, you can provision higher than the default write throughput for faster loading of the data.

S3 as a source
To use S3 as a source for DMS, the source data files must be in CSV format. To load the file title.basics.csv from your S3 bucket, you need to provide a few things to DMS. These are a JSON mapping for the table, the bucket name, and a role with sufficient permissions to access that bucket.

JSON mapping
In the table mapping JSON file, for the first property, you need to tell DMS how many tables you are loading. In this example, you load only one table, so enter the value of 1.

The second property is Tables, and it contains an array of table definitions that matches that counter. To load your single table title_basics, tell DMS your table name, the path to the table from the S3 bucket, the name of your table owner, and how many columns in the table. You also provide a description of those columns. The following figure shows the sample JSON with the TableColumns property collapsed.

In the TableColumns property, you define each of the columns that correspond to your .csv file. In the example, there are nine string type columns of varying length that make up the file and ultimately the table. There are common properties among the columns that include name, type, and length.

For the column that you want to use as the partition key, you specify two additional properties indicating that the column required is nullable (ColumnNullable) and that it serves as the partition key (ColumnIsPK). The following screenshot shows title_basics column definitions.

Creating the S3 endpoint by using the console
If you are going to configure your S3 endpoint from the console, navigate to Database Migration Services, Endpoints, and choose the blue Create endpoint button. You can then enter the following information:

  • The source
  • A name for your endpoint
  • S3 as the source engine
  • The Amazon Resource Name (ARN) of the service role with access to the S3 bucket
  • The S3 bucket name
  • The JSON table structure that you defined

The following shows a completed console form example.

The title.basics.csv file is actually a tab-delimited file. For DMS to interpret it correctly, you need to open the Advanced property section and enter the extra connection attributes value of csvDelimiter=/t into the box. For more information about advanced properties, see Extra Connection Attributes for S3 as a Source for AWS DMS in the DMS documentation. The extra connection attributes box is shown following.

Creating the S3 endpoint by using a template
To use CloudFormation to create the same resource that we just created from the console preceding, you enter the information into a Resource section of your template using the type AWS::DMS::Endpoint. In the template, you don’t have to include a name for your endpoint. Instead, you can have it inherit the name from the stack resource label. You can read more about CloudFormation in Getting Started with AWS CloudFormation: Learn Template Basics in the CloudFormation documentation.

In your template, you have to provide all of the other information that is required in the console, including the table definition, service role ARN, bucket name, and CSV delimiter. You enter the table definition as escaped JSON in a single line. The following screenshot shows a CloudFormation template with an S3 endpoint.

DynamoDB as a target
When AWS DMS creates tables on an Amazon DynamoDB target endpoint, it sets several Amazon DynamoDB parameter values. The cost for the table creation depends on the amount of data and the number of tables to be migrated.

When AWS DMS sets Amazon DynamoDB parameter values for a migration task, the default value for the Read Capacity Units (RCU) parameter is set to 200.

The Write Capacity Units (WCU) parameter value is also set, but its value depends on several other settings. The default value for the WCU parameter is 200. To read more about how DMS calculates WCU and how to influence it, see Using an Amazon DynamoDB Database as a Target for AWS Database Migration Service in the DMS documentation.

Creating the DynamoDB endpoint by using the console
If you are going to configure your S3 endpoint from the console, navigate to Database Migration Services, Endpoints, and choose the blue Create endpoint button. You can then enter the following information:

  • A name for your endpoint
  • The target
  • DynamoDB as the target engine
  • The ARN of the service role with permissions for DynamoDB

The following shows a completed console form example.

Creating the DynamoDB endpoint by using a template
Just as with the S3 endpoint example, to use CloudFormation you enter the information into another Resource section of your template using the type AWS::DMS::Endpoint. Your endpoint name inherits its name from the stack resource label.

You need to provide the service role ARN, the engine name as DYNAMODB, and the endpoint type of target. An example is shown following.

CloudFormation orchestration
AWS CloudFormation makes deploying a set of Amazon Web Services (AWS) resources as simple as submitting a template. A template is a simple text file that describes a stack, a collection of AWS resources that you want to deploy together as a group. You use the template to define all the AWS resources you want in your stack.

For this example, in addition to the endpoints that you created, you also need a replication instance and a migration task. CloudFormation takes the template and creates all of the resources specified in the right order and starts the migration task.

In its simplest form, after you’ve completed the prerequisites and uploaded your files, a migration is essentially four steps:

  1. Create a replication instance
  2. Create a source endpoint
  3. Create a target endpoint
  4. Create and run a replication task

These four pieces are what your CloudFormation template contains in the Resources section:

  1. DMSReplicationInstance
  2. DMSEndpointS3
  3. DMSEndpointDynamoDB
  4. DMSTaskMigration

The following screenshot shows the CloudFormation template structure.

Creating a replication instance by using a template
To specify a replication instance resource in a CloudFormation template, use the type AWS::DMS::ReplicationInstance. You specify how much allocated storage that you want the replication instance to have, how big you want the instance to be, and whether you want the instance to be publicly accessible. An example is shown following.

You also need to indicate the security group that you want the replication instance to use. For more information, see Setting Up a Network for Database Migration in the DMS documentation.

Creating a replication task by using a template
The migration task is where all of your work is done. The task takes the replication instance and the two endpoints and uses those resources to tell DMS how to move your data and what settings to use.

In the following image, the replicationInstanceArn, SourceEndpointArn, and TargetEndpointArn have Ref: values that point to the other Resource entries within the template. CloudFormation detects these names and when the resources are instantiated, uses the resulting ARN values to build the DMS task.

The replication task settings, and the table mappings, are pulled in by using an escaped JSON string. You can find both the formatted raw JSON strings and the escaped versions in the repository with the rest of the code samples for you to review. There’s also a simple Python script that takes the formatted JSON files and transform them into escaped one-liners for pulling into CloudFormation templates. If you want to make changes to the JSON settings, you might find it easier to change them in the formatted file and then transform them.

Working with parameters and metadata by using a template
Let’s add just a quick word about parameters used in the CloudFormation template example. Both the parameter section and the metadata section are optional in a CloudFormation template. However, parameters can help significantly simplify items that commonly change in a template. Metadata can simplify how a template is visually laid out on the screen after upload.

CloudFormation template parameters are used in this example to identify the values that change when you run this template in your account. Items like the security group for the replication instance, the S3 bucket name, and the service role ARN to use for accessing resources change, because they are specific to your account. Additionally, providing a list of replication instance sizes and a default value helps make it easier to see at a glance what choices are available.

An example of CloudFormation parameters is shown following.

CloudFormation metadata information is used in this template to help organize the information on the console screen for your user. An example is shown following.

Creating a stack
Creating a stack from your CloudFormation template is as easy as uploading your template to S3 through the console, filling in a few values, and then choosing Create.

Start by opening the CloudFormation console and choosing the blue Create stack button. Next, choose Upload a template to S3, and choose the file cloudformation-dms-migration-s3-dynamodb.yaml from your local drive. Your template loads on the screen and looks similar to the following image.

When you’ve completed the form with the proper values for instance size, security group, S3 bucket name, and service role ARN, you can then choose Next, Next, and Create to launch your stack. The stack instantiates your resources, starts your task, and migrates your data. Your stack indicates when all resources have been launched both overall and in the Events panel, shown following.

Once your stack indicates that creation is complete, you can observe your task execution either in the DMS console or the CloudWatch console. The DMS console is shown following.

A completed load in the CloudWatch logs for the DMS task shows that more than 4 million rows were successfully loaded. The CloudWatch task completion message is shown following.

Now that you have a CloudFormation template for loading data using DMS from S3, you can extend this example to load your own data. Add multiple tables to your ExternalTableDefinition and let DMS load multiple tables at one time. Change the target endpoint from DynamoDB to Amazon Aurora with PostgreSQL compatibility, or to Amazon Redshift or another DMS target type, simply by swapping out the resources in your template. Using S3 as a source for DMS, you can get delimited data from just about anywhere and push quickly it to any number of target engines.

About the Author

Wendy Neu has worked as a Data Architect with Amazon since January 2015. Prior to joining Amazon, she worked as a consultant in Cincinnati, OH helping customers integrate and manage their data from different unrelated data sources.