Transferring Amazon S3 data from AWS Regions to AWS Regions in China

AWS customers with data located in multiple AWS Regions often ask about moving files from AWS Regions outside of China to the AWS China (Beijing) Region and the AWS China (Ningxia) Region to localize data within China for compliance, data center operations, and data storage requirements. To best serve customers in China and comply with China’s laws and regulations, AWS has collaborated with China partners with proper telecom licenses to deliver cloud services. The AWS Region in Beijing, which was generally available to Chinese customers in 2016, is operated by Beijing Sinnet Technology Co. Ltd. (Sinnet), and the AWS Region in Ningxia, which was launched in 2017, is operated by Ningxia Western Cloud Data Technology Co., Ltd. (NWCD).

Outside of China, to move data between two AWS Regions, you can use Amazon S3 Replication, a feature of Amazon S3, to automatically and asynchronously replicate data to a different bucket. Since AWS China Regions are operated separately from other AWS Regions, including account credentials that are unique to AWS China Accounts, Amazon S3 Replication is not available between AWS China Regions and AWS Regions outside of China.

In this blog post, we cover one solution to move Amazon S3 objects from buckets located in AWS Regions outside of China to buckets located in AWS China Regions.

Overview and solution tutorial

To move data from an AWS Region outside of China to one in China, you can use the step-by-step guide provided here to create a solution using AWS services. This solution is designed to let you transfer thousands of large Amazon S3 objects from buckets in an AWS Region outside of China to buckets in an AWS China Region. This solution is one-directional, and cannot be used to move data from AWS China Regions to any other AWS Region.

In this approach, you are going to set up an Amazon Simple Queue Service (Amazon SQS) event with Amazon S3. For new files created in the Amazon S3 bucket, Amazon SQS queues the new file information. At the same time, a Python code (worker cluster), running on Amazon EC2 instances, and scaling group pulls file information from Amazon SQS. It splits files into threads and transmits them into S3 buckets in an AWS China Region. The code that runs on the worker cluster also adds file details in Amazon DynamoDB that can be analyzed.

AWS services and features involved:

Amazon S3 and multipart upload
Amazon EC2 for worker cluster and single node sender job
EC2 Auto Scaling policy
AWS Systems Manager (AWS SSM) Parameter Store
Amazon DynamoDB
Amazon SQS
AWS Cloud Development Kit (AWS CDK)

The following diagram represents the architecture that we walk through in this post:

Diagram reprensenting the architecture involved with this solution (transferring data into AWS China Regions) including S3, EC2, DynamoDB, SQS, and AWS SSM

To optimize the workload, you should deploy the worker cluster in the same AWS Region as the source S3 bucket. You can also manage the security information and IAM credentials of the AWS China Region in the AWS SSM Parameter Store.

We have shared the code at the AWS Samples GitHub library. Download the code from the repository Amazon S3 resumable upload.

Here are the steps and components of this solution:

Each new file or file update in the source bucket gets queued in Amazon SQS. The status of jobs and attributes are recorded in Amazon DynamoDB. A Python code sender job, which runs on a single node EC2 instance, synchronizes the existing files. If any discrepancies are found, it then compares them with the DynamoDB file table details and adds them in the queue.
The worker cluster, running on Amazon EC2, processes the jobs on the Amazon SQS queue.
The worker cluster uses the multipart upload feature of Amazon S3 to transfer files to the destination S3 buckets in an AWS China Region. This cluster can scale up and down based on the Auto Scaling group settings.
You can configure source and destination S3 bucket details in the configuration file, shared in the solution. The connection between AWS Regions outside of China and AWS China Regions could be public or AWS Direct Connect through Direct Connect.

Solution deployment

Download the solution from GitHub library. This code uses the AWS CDK. The AWS CDK is a software development framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation.

Prerequisite to deploy the sample code:

Install the AWS Command Line Interface (AWS CLI).
If you don’t have an AWS China account, you must request one. To register for an AWS China account, you must have a business license or other equivalent license registered in China.
All AWS CDK applications require Node.js 10.3 or later, even when your app is written in Python, Java, or C#. You may download a compatible version for your platform at nodejs.org. We recommend the current LTS version (as of this writing, the latest is the 12.x release).
After installing Node.js, install the AWS CDK toolkit:
- ```
npm install -g aws-cdk
```
Test the installation by using:
- ```
cdk --version
```
If you are new to the AWS CDK and Python, read the documentation on working with AWS CDK in Python.

Configuration:

Before deploying the code, you must update the configuration files and AWS SSM Parameter Store so that your code can execute as expected. Here are the steps:

1: Create an AWS SSM Parameter Store with the following details, as shown in the following example screenshot.

Name: s3_migration_credentials
Type: SecureString
Tier: Standard
KMS key source: My current account/alias/aws/ssm
Value: IAM access and secret keys for your AWS China account.

{
  "aws_access_key_id": "your_aws_access_key_id",
 "aws_secret_access_key": "your_aws_secret_access_key",
 "region": "cn-northwest-1 or other"
}

Create an AWS SSM Parameter Store with Name, Type, Tier, KMS key source, and Value

2: Edit the app.py file in the code you have downloaded and provide your source and destination bucket mappings.

[{
    "src_bucket": "your_global_bucket_1",
    "src_prefix": "your_prefix",
    "des_bucket": "your_china_bucket_1",
    "des_prefix": "prefix_1"
    },{
    "src_bucket": "your_global_bucket_2",
    "src_prefix": "your_prefix",
    "des_bucket": "your_china_bucket_2",
    "des_prefix": "prefix_2"
}]

These mappings are stored in the AWS SSM Parameter Store, which you can update later.

4: (optional step) you can modify the default config file, “./code/s3_migration_cluster_config.ini”, based on your requirements. For example, you can change the Amazon S3 storage class. You can also modify retry and logging levels. We recommend that you refer to this file and look at different configuration options.

Some other configurable options are:

You can modify the instance type and size of your worker cluster nodes. The default EC2 instance for the worker cluster is C5.Large. You can change the node type in the cdk_ec2_stack.py file:

# Adjust ec2 type here based on your file processing load
worker_type = "c5.large"
jobsender_type = "t3.micro"

This architecture sends notification emails. Update the recipient email address (alarm_your_email@email.com in the following example) in the file cdk_ec2_stack.py.

#Set up your alarm email
alarm_email = alarm_your_email@email.com

The solution creates a VPC of CIDR 10.10.0.0/16. You can modify your VPC setting in the file cdk_vpc_Stack.py.

5. Build and deploy the AWS CDK application, Refer to the CDK developer guide for deployment instruction of CDK solutions.

cdk synth
cdk deploy

Test your data flow

You can upload sample objects to your source S3 bucket to test the solution. Uploaded files from the source S3 bucket should start to be available in the destination S3 bucket within minutes, depending on object volume and size. When we tested this solution using five objects, on a single node c5.large instance (setting 5 files X 30 threads in this test), throughput reached up to 800 Mbps. Your results may vary on network speed. Transfer performance can be affected by current usage of network traffic and the network environment.

Test your data flow - when we tested this solution using 5 objects, on a single node, throughput reached up to 800 Mbps

In another test, we increased the number of objects to 916, and the Auto Scaling group added nine Amazon EC2 instances (C5.large) to transfer 1.2 TB (916 files) in one hour reaching 7.2 Gbps of throughput. You can also use DynamoDB data to connect with Amazon QuickSight and analyze your file transfer details.

In another test, we increased the number of objects to over 900, and the auto scaling group added 9 EC2 instances

Network performance

This solution uses an Amazon EC2 congestion-based congestion control algorithm, TCP bottleneck bandwidth and round trip (TCP BRR), which improves network performance. It also uses the public network. Transfer rates can be affected by many factors, including your network conditions, the link sections when routing to an overseas Region, and various telecom network carriers at home and abroad.

Furthermore, we recommend using AWS Direct Connect from AWS Regions outside of China to those in China by contacting AWS Direct Connect Partners such as Wangsu and China Mobile. Please consult with a partner for more speciﬁc plans, contracts, quotes, and cycles. AWS China Regions in Beijing and Ningxia are not connected to the AWS global backbone and infrastructure. To reduce potential packet loss and lower latency between AWS China Regions and AWS Regions outside of China, China ISPs provide internet route optimization. Chinese ISPs, such as China Telecom, also provide value added solutions for further optimization of internet access. To help customers connect to VPCs in China and other Regions, Chinese ISPs like China Mobile and China Telecom provide dedicated lines for customers via AWS Direct Connect. By using China Mobile, for example, a hosted connection can be set up in one week. Customers must sign a contract directly with ISP, similar to the process in any other Region. Last but not least, customers must comply with Chinese laws in determining schemes for data transfer and localization. To learn more about getting started with AWS Services in AWS China (Beijing) Region and AWS China (Ningxia) Region, read this blog post.

Cleaning up

After testing, you should delete this solution and any example resources you deployed if you do not need them, to avoid incurring unwanted charges. As this solution is based on an AWS CloudFormation change set, please refer to the documentation for deleting a change set.

Summary

Using Amazon EC2, Amazon SQS, and Amazon DynamoDB, you can move Amazon S3 objects from AWS Regions to AWS China Regions. You can also enhance the transfer speed to upload objects from an AWS Region outside of China to an AWS China Region. To do so, you use an Amazon S3 multipart upload and TCP bottleneck bandwidth and round trip (TCP BRR) congestion control algorithm. China is an important country for global companies, and if your business or operation is expanding in China, this solution can help you move your data or files to an AWS China Region. To get started with AWS services in China, please refer to the blog post Getting Started with AWS Services in AWS China (Beijing) Region and AWS China (Ningxia) Region.

Thanks for reading this blog post! If you have any comments or questions, please leave them in the comments section.