How can I improve the transfer performance of the sync command for Amazon S3?
Last updated: 2020-07-09
I'm using the AWS Command Line Interface (AWS CLI) sync command to transfer data on Amazon Simple Storage Service (Amazon S3). However, the transfer is taking a long time to complete. How can I improve the performance of a transfer using the sync command?
Try the following approaches for improving the transfer time when you run the sync command:
Note: The sync command compares the source and destination buckets to determine which source files don't exist in the destination bucket, or which source files were modified when compared to the files in the destination bucket. Then, the sync command copies the new or updated source files to the destination bucket. The number of objects in the source and destination bucket can impact the time it takes for the sync command to complete the process. It's important to understand how the transfer size can impact the duration of the sync, as well as the cost that you can incur from the requests to Amazon S3 that are associated with the operation.
Running multiple instances of the AWS CLI
To copy a large amount of data, you can run multiple instances of the AWS CLI to perform separate sync operations in parallel. For example, you can run parallel sync operations for different prefixes:
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/folder1 s3://destination-AWSDOC-EXAMPLE-BUCKET/folder1 aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/folder2 s3://destination-AWSDOC-EXAMPLE-BUCKET/folder2
Or, you can run parallel sync operations for separate exclude and include filters. For example, the following operations separate the files to sync by the key names that begin with the numbers 0 through 4, and the key names that begin with the numbers 5 through 9:
Note: Even when you use exclude and include filters, the sync command still reviews all the files in the source bucket to determine the source files that should be copied to the destination bucket. This means that if you have multiple sync operations that target different key name prefixes, then each sync operation reviews all the source files. However, because of the exclude and include filters, only the files that are included in the filters are copied to the destination bucket.
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/ s3://destination-AWSDOC-EXAMPLE-BUCKET/ --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*" aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET/ s3://destination-AWSDOC-EXAMPLE-BUCKET/ --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"
For more information on optimizing the performance of your workload, see Best practices design patterns: Optimizing Amazon S3 performance.
Modify the AWS CLI configuration value for max_concurrent_requests
To potentially improve performance, you can modify the value of max_concurrent_requests. This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10, and you can increase it to a higher value. However, note the following:
- Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
- Too many concurrent requests can overwhelm a system, which might cause connection timeouts or slow the responsiveness of the system. To avoid timeout issues from the AWS CLI, you can try setting the --cli-read-timeout value or the --cli-connect-timeout value to 0.
If you're using Amazon Elastic Compute Cloud (Amazon EC2), check the instance configuration
If you're using an EC2 instance to run the sync operation, consider the following:
- Review the instance type that you're using. Instance types that are larger than m3.xlarge can provide better results, because they have high bandwidth and Amazon Elastic Block Store (Amazon EBS)-optimized networks.
- If the instance is in a different AWS Region than the bucket, then consider using an instance in the same Region. Reducing the geographical distance between the instance and the bucket can reduce latency.
- If the instance is in the same Region as the source bucket, then consider setting up an Amazon Virtual Private Cloud (Amazon VPC) endpoint for Amazon S3. VPC endpoints can help improve overall performance.