How can I improve the transfer performance of the sync command for Amazon S3?
Last updated: 2020-01-14
I'm using the AWS Command Line Interface (AWS CLI) sync command to transfer data on Amazon Simple Storage Service (Amazon S3). However, the transfer is taking a long time to complete. How can I improve the performance of a transfer using the sync command?
Try the following approaches for improving the transfer time when you run the sync command:
Running multiple instances of the AWS CLI
To copy a large amount of data, you can run multiple instances of the AWS CLI to perform separate sync operations in parallel. For example, you can run parallel sync operations for different prefixes:
aws s3 sync s3://source-awsexamplebucket/folder1 s3://destination-awsexamplebucket/folder1 aws s3 sync s3://source-awsexamplebucket/folder2 s3://destination-awsexamplebucket/folder2
Or, you can run parallel sync operations for separate exclude and include filters. For example, the following operations separate the files to sync by the key names that begin with the numbers 0 through 4, and the key names that begin with the numbers 5 through 9:
aws s3 sync s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*" aws s3 sync s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"
For more information on optimizing the performance of your workload, see Best Practices Design Patterns: Optimizing Amazon S3 Performance.
Modify the AWS CLI configuration value for max_concurrent_requests
To potentially improve performance, you can modify the value of max_concurrent_requests. This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10, and you can increase it to a higher value. However, note the following:
- Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
- Too many concurrent requests can overwhelm a system, which might cause connection timeouts or slow the responsiveness of the system. To avoid timeout issues from the AWS CLI, you can try setting the --cli-read-timeout value or the --cli-connect-timeout value to 0.
If you're using Amazon Elastic Compute Cloud (Amazon EC2), check the instance configuration
If you're using an EC2 instance to run the sync operation, consider the following:
- Review the instance type that you're using. Instance types that are larger than m3.xlarge can provide better results, because they have high bandwidth and Amazon Elastic Block Store (Amazon EBS)-optimized networks.
- If the instance is in a different AWS Region than the bucket, then consider using an instance in the same Region. Reducing the geographical distance between the instance and the bucket can reduce latency.
- If the instance is in the same Region as the source bucket, then consider setting up an Amazon Virtual Private Cloud (VPC) endpoint for Amazon S3. VPC endpoints can help improve overall performance.