AWS Storage Blog
Copying objects greater than 5 GB with Amazon S3 Batch Operations
Update (3/4/2022): Added support for Glacier Instant Retrieval storage class.
Update (4/19/2022): Included the copy destination prefix parameter in the Amazon CloudFormation template.
Update (10/26/2022):Added performance guidance and best practices, and included template optimized for copying objects restored from archive to a different storage class.
A large number of customers store their data in Amazon S3, and some of these customers scale to millions or billions of individual objects. Amazon S3 customers need an easy, reliable, and scalable way to perform bulk operations on these large datasets – with objects ranging in size from a few kilobytes up to 5 GB. Amazon S3 Batch Operations lets you manage billions of objects at scale with just a few clicks in the Amazon S3 console or a single API request. A few of the supported operations include copying, replacing tags, replacing access control, and invoking AWS Lambda functions.
As of the writing of this blog, the copy operation supports objects up 5-GB individual size. As customers have objects of all sizes stored in Amazon S3, you may at times need to copy objects larger than 5 GB.
In this blog, I cover a solution that provides you with the ability to copy objects larger than 5 GB using the S3 Batch Invoke Lambda operation.
Not all objects larger than 5 GB can be copied within the current Lambda function 15-minute timeout limit, especially across AWS Regions. The solution has been tested successfully, under ideal conditions, with up to 2-TB single objects size in the same region and with a 1-TB single object size between two regions EU-WEST-2 and US-EAST-1.
Components and walkthrough
The following is a summary of the components of this solution:
- S3 buckets with the manifest file uploaded containing the objects to be copied alternatively. If you enable S3 Inventory on the production source bucket, then you can initiate S3 Batch operations jobs using the S3 Inventory configuration page.
- A Lambda function running a code and AWS SDK to perform the copy. I use Python Boto3 SDK in this example.
- I have modified the Boto3 SDK default configuration to increase the performance, ensuring that large objects can be copied as fast as possible before the Lambda function timeout limit. I modified the botocore configurations settings and the built-in high-level Boto3 transfer utility configuration. The botocore max_pool_connections parameter specifies the maximum number of connections to allow in a pool. By default, it is set to 10, however, since I want to use as many concurrent connections to handle the copy, I want to increase this, as the more concurrent connections the SDK uses the faster the object is copied. The value we set here must be at least the same value as the next configuration settings, the max_concurrency, since the max_concurrency relies on the underlying connections pools. The transfer utility max_concurrency determines the number of concurrency or threads used to perform the operation. These two increased settings speed up the transfer so it completes within the Lambda timeout limit. I have also customized the number of SDK retries using the max_retries parameter to give it more resiliency.
- An IAM role attached to the Lambda function with the appropriate permissions to perform the copy operation. If the objects are encrypted with a customer managed KMS key, the IAM role must be granted the required access to the KMS key.
- Source and destination S3 buckets.
Prerequisites
The instructions in this post assume that you have necessary account permissions in addition to working knowledge of IAM roles, administering Lambda functions, and managing S3 buckets in your AWS account. You also need to have the following resources:
- An Amazon S3 bucket.
- An existing manifest CSV file (or S3 Inventory configured on the source S3 bucket).
Summary of the steps
- Deploy the AWS CloudFormation template provided in the “Solution walkthrough” section to create the Lambda function and the associated IAM role.
- Specify the destination S3 bucket name in the function environmental variable. You can optionally modify COPY and SDK configuration parameters.
- Start the S3 Batch Operations job from the source S3 bucket using the Inventory configuration tab or via the S3 Batch Operations console page. Then, select either an S3 Inventory JSON or a CSV manifest file, and follow the wizard.
- Monitor the job progress in the Batch Operations console to confirm it is successful, then check the destination S3 bucket to confirm the object has been copied successfully.
Solution walkthrough
You can deploy the CloudFormation template to quickly get started; the template is available here. For migration between storage classes, for example Glacier Flexible Retrieval (GFR) to Glacier Instant Retrieval, after you have performed the initial restore from Glacier, you can download a template with an additional parameter to specify and restrict the source storage class here.
- Navigate to the AWS CloudFormation console.
- Choose Create Stack (with new resources). At the Prerequisite section, accept the default option Template is ready.
- At the Specify Template section, select Upload a template file, choose file and then use the previously downloaded CloudFormation template. After selecting, choose Next.
- You need to specify the Stack name, the copy destination bucket name, the CSV manifest/S3 Inventory bucket name, the Batch Operation job report bucket name, and if you want to enable or disable CopyMetadata and CopyTagging (for copying objects metadata and tags). You can also choose the StorageClass of the copy. The parameters in the stack can be modified later by simply updating the Stack, for example if you want to change the destination S3 bucket.
- The Lambda function and IAM roles will be created automatically.
- The template contains some predefined values that apply to the Lambda function Boto3 SDK code, mainly: max_concurrency: 940, max_retries: 100, max_pool_connections: 940 and multipart_chunksize: 16777216. You can optionally modify the SDK parameters as required. The value 940 for the max_pool_connections and max_concurrency is chosen to be high and also give some allowance to avoid hitting the Lambda function unmodifiable quota of 1,024 execution processes/threads and file descriptors. The max retries value can be increased, however a value of 100 is sufficient for most cases. The multipart chunksize of 16777216 (16 MB) can be increased or decreased as needed to match your unique requirements. Please note that the chunk-size affects the resultant E-Tag value of objects. Choose Next to proceed.
- At the Configure stack options page, choose Next to proceed. At the next page, scroll down to accept the acknowledgement and Create Stack.
Cross-account scenario
If the destination S3 bucket is in another AWS account, then you must also apply a resource level bucket policy to the S3 bucket with the required permissions. See the following S3 bucket policy example with the minimum required permissions:
{
"Version": "2012-10-17",
"Id": "Policy1541018284691",
"Statement": [
{
"Sid": "Allow Cross Account Copy",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:root"
},
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:PutObjectTagging"
],
"Resource": "arn:aws:s3:::DESTINATION_BUCKET/*"
}
]
}
Where “1234567890” in the bucket policy Principal is the source Account AWS Account-ID. You can optionally set Object Ownership on the destination account bucket to Bucket owner preferred to ensure that the destination account owns the objects.
Starting the Batch Operations job
Now that you have successfully created the Lambda function and associated IAM role, it’s time to start the S3 Batch Operations job.
- Go to the Amazon S3 console.
- From the navigation pane, choose Batch Operations and choose Create job. Alternatively, you can start a job by choosing your source S3 bucket, going to the Management tab, scrolling down to the Inventory Configurations section, selecting the inventory, and finally choose Create job from Manifest.
- For Manifest format, select CSV or the S3 Inventory and enter the S3 location. Then, choose Next.
- For Operation type, select Invoke AWS Lambda function.
- For Lambda function, select the Lambda function that has been automatically created, to locate the function, start typing the Stack name, this will filter the displayed function names. The function name will be in the format “StackName-S3BatchCopyLambdafunction-.” For example, if you specified your CloudFormation stack name as “MyS3batch” then the S3 Batch Lambda function name will be “MyS3batch-S3BatchCopyLambdafunction-.” Then, choose Next.
- For Path to completion report destination, enter the S3 bucket location where you want the report to be delivered.
S3 Batch Operations requires permissions to be able to run successfully, so we need an IAM role granting it the required permissions:
- At the Permissions section, use the IAM role that has been automatically created for the S3 Batch Operations service, to locate the IAM role, start typing the Stack name to filter the displayed IAM role names. The role name will be in the format “StackName-S3BatchOperationsService-.” For example, if you specified your CloudFormation stack name as “MyS3batch” then the S3 Batch Operations service IAM role name will be “MyS3batch -S3BatchOperationsService-.” Once you successfully locate the IAM role, please select the role. Then, choose Next.
- On the Review page, review the details of the job. Then, choose Create job.
- After you create the job, the job’s status changes from New to Preparing. Then, the status changes to Awaiting your confirmation. To run the job, you must choose the Job ID to open the job details and then choose Run job. Next choose Run Job again to start the job.
- After the job completes, check the completion report to confirm that all the objects have been successfully copied. You can also check the destination S3 bucket.
Troubleshooting and guidance
The solution is dependent on the availability and performance of multiple underlying AWS services including S3, Lambda and IAM services.
Amazon S3 Batch Operations is an at least once execution engine, which means it performs at least one invocation per key in the provided manifest. In rare cases, there might be more than one invocation per key, for example service-related throttles or if a customer-provided Invoke Lambda Operation function code returns a temporary failure response to Batch Operations.
In our example. the values of SDK configuration settings max_pool_connections and max_concurrency are set to 940. Note that due to the increased request rates, you might experience throttling from Amazon S3 during the Copy Operation. Excessive throttling can lead to longer running tasks, and possible task failures.
As a best practice, we recommend applying a lifecycle expiration rule to expire incomplete multipart uploads to your S3 bucket that might be caused by failed tasks as a result of Lambda function timeouts.
To address performance issues, please refer to S3 performance guidelines. The following are some quick tips:
- Consider enabling S3 Request metrics to track and monitor the request rates and number of 5XX errors on your S3 bucket.
- Please consider reducing your request rate, by reducing the max_concurrency and max_pool_connections to a lower value, by updating the CloudFormation Stack parameters before starting the job. For example, if you are copying objects within the same bucket or same region you can set the SDK max_concurrency, max_pool_connections, and retry values to 60, 60 and 30 respectively.
- Job tasks that fail with “Task timed out after 9xxx seconds” signal a Lambda function timeout. Possible reasons for Lambda timeout include S3 throttling causing the function to keep retrying the task until the function times out, or potentially if the object size is too large for copying within the lambda timeout limit. Adjust the SDK configurations as needed to meet your unique requirements.
- S3 Batch Operations will utilize all available Lambda concurrency, up to 1,000. If you have a need to reserve some concurrency for other Lambda functions, you can optionally reduce the concurrency used by a Lambda function by setting the reserved concurrency and specify a desired value less than 1,000.
- If issues with slow performance, excessive throttling, or other issues persist, contact AWS Support with the error message and S3 RequestID and Extended RequestID in the Amazon S3 Batch Operations failure report or function CloudWatch Logs. You can also get S3 requestIDs by querying S3 Access or Cloudtrail logs if enabled.
For very large workloads; millions of objects or terabytes of data or critical workloads with tight deadlines, consider contacting your AWS Account contact before starting the copy or migration process.
Cleaning up
There are costs associated with using this solution including S3 requests and Lambda function invocation costs.
As an optional step, remember to clean up the resources used for this setup if they are no longer required. To remove the resources, go to the Cloudformation console, select the stack and then choose Delete.
Conclusion
Amazon S3 customers often store objects of all sizes in their S3 buckets, ranging from a few kilobytes to hundreds of gigabytes. Customers often need to copy objects larger than 5 GB as part of their workflows, for business or compliance requirements. In this blog post, I demonstrated performing bulk operations on objects stored in S3 using S3 Batch Operations. I also covered copying objects larger than 5 GB between S3 buckets, within and across AWS accounts, using S3 Batch Operations’ Invoke AWS Lambda Job type. To do this, I created AWS resources, including Lambda functions and IAM roles. The Lambda function runs an optimized AWS boto3 SDK and code that performs S3 object copy using multiple concurrent threads when invoked by Amazon S3 Batch Operations.
Thanks for reading this blog post on copying objects greater than 5 GB in size using Amazon S3 Batch Operations. If you have any comments or questions, don’t hesitate to post them in the comments section.