AWS Storage Blog
Accelerate Amazon S3 throughput with the AWS Common Runtime
Data is at the center of every machine learning pipeline. Whether pre-training foundation models (FMs), fine-tuning FMs with business-specific data, or serving inference queries, every step of the machine learning lifecycle needs low-cost, high-performance data storage to keep compute resources busy and performing useful work. Customers use Amazon Simple Storage Service (Amazon S3) to store training data and model checkpoints because of its elasticity, performance to scale to multiple terabits per second, and storage classes like S3 Intelligent-Tiering to automatically save on storage cost when access patterns change.
Today, we announced new updates to the AWS Command Line Interface (AWS CLI) and the AWS SDK for Python (Boto3) that automatically accelerate data transfer to and from Amazon S3, making them even better foundations for your machine learning pipeline. The AWS CLI and Boto3 now integrate with the AWS Common Runtime (CRT) S3 client, which is designed and built specifically to deliver high-throughput data transfer to and from Amazon S3. This integration is now enabled by default on Amazon EC2 Trn1, P4d, and P5 instance types, and can be enabled as an opt-in on other instance types.
What is the AWS Common Runtime?
Customers love Amazon S3 for its simple REST APIs that can be accessed by any HTTP client. However, to get the best performance for data-intensive applications, customers should implement performance best practices including request parallelization, timeouts, retries, and backoff. Several years ago, we noticed that we were reimplementing these patterns in each of our AWS SDKs, and that customers were also having to implement them in their own applications. We wanted to make it easy to access S3’s elastic performance from any application, without needing to reimplement these common design patterns.
To get this portable performance, we built the AWS Common Runtime (CRT). The CRT is a collection of open-source libraries written in C that implement common functionality for interacting with AWS services, including a high-performance HTTP client and an encryption library. The CRT libraries work together to provide a fast, reliable client experience for AWS services. For Amazon S3, the CRT includes a native S3 client that implements automatic request parallelization, request timeouts and retries, and connection reuse and management to avoid overloading the network interface. For example, to download a single large object from S3, the CRT client automatically downloads multiple byte ranges in parallel, increasing throughput and saturating the network interface of many EC2 instances.
The CRT is already available in production for multiple AWS SDKs, including Java and C++, and was previously available as an experimental option in the AWS CLI. It’s also the foundation of our open source file client, Mountpoint for Amazon S3. Today, we’re making the CRT generally available in the AWS CLI and Boto3 on the Trn1, P4d, and P5 EC2 instance types, which have large CPUs and network interfaces that benefit most from these performance design patterns. For other instance types, you can opt-in to use the CRT in your Boto3 applications or with the AWS CLI, and get automatic performance improvements in many cases.
Performance improvements for ML pipelines
To demonstrate the potential performance improvements that you can achieve with the AWS Common Runtime, we collected four benchmark datasets representative of the steps of the ML lifecycle:
- Caltech-256: The entire Caltech 256 image dataset, containing 30,607 small image files averaging 40 kB in size, for a total dataset size of 1.1 GB.
- Caltech-256-WebDataset: The same Caltech 256 image dataset, but this time stored using the WebDataset format, which collects multiple images together into 100 MB “shard” objects. Sharded datasets can often achieve higher performance when using Amazon S3 for ML training.
- C4-en: The English-language subset of the C4 dataset based on the Common Crawl corpus, containing 1,024 files of 320 MB each.
- Checkpoint: A single 7.6 GB PyTorch checkpoint file, representative of a sharded checkpoint of a large ML model.
We used the AWS CLI to upload and download each of these datasets from a trn1n.32xlarge EC2 instance, both without and with the AWS CRT enabled. These are the results:
The CRT delivers speedups of 2 – 6x across these benchmarks, without any additional work other than updating to the latest release of the AWS CLI. We saw similar performance improvements for a Python application using Boto3 with the CRT enabled.
Getting started with the CRT in your application
To use the CRT in the AWS CLI, first install the latest version of the AWS CLI. This is also a great opportunity to update to version 2 of the AWS CLI if you haven’t already, as the CRT integration is only available in version 2. When running on a Trn1, P4d, or P5 EC2 instance, that’s all you need to do — the CRT will be enabled by default when using CLI commands like aws s3 sync
. On other instance types, you can enable the CRT running the following command:
aws configure set s3.preferred_transfer_client crt
To use the CRT in Boto3, first make sure that you install Boto3 with the additional crt
feature. For example, when installing with pip
, run:
pip install boto3[crt]
On Trn1, P4d, and P5 instances, when Boto3 is installed with the crt
feature, it will automatically use the CRT for upload_file
and download_file
calls. For example, to upload a file to S3 using the CRT:
import boto3 s3 = boto3.client('s3') s3.upload_file('/tmp/hello.txt', 'doc-example-bucket', 'hello.txt')
On other instance types, you can use the s3transfer package to access the CRT with Boto3, although we don’t yet consider this package stable and it may change in the future.
Performance tuning
The CRT implements automatic performance optimizations for applications using S3, and the default settings will provide speedups in many circumstances. These defaults automatically configure the CRT based on the specifics of the instance type it is running on, including CPU topology, amount of memory, and the number and layout of Elastic Network Adapter (ENA) interfaces. Based on these details, the CRT chooses a parallelization strategy for S3 requests, including the number of parallel connections, the size of each request, and the number of requests per S3 IP address.
In some circumstances, you might want to override these defaults, for example to limit the amount of network bandwidth used by CRT transfers. When using the AWS CLI’s CRT integration, you can override the defaults by configuring the target_bandwidth
parameter. For example, to limit transfers to 5 Gigabits per second, run:
aws configure set s3.target_bandwidth 5Gb/s
This configuration override is not yet available for Boto3, but will be exposed in a future release.
Caveats and opting out
While this first release of the CRT for the CLI and Boto3 will improve performance for many ML applications, there are three caveats to be aware of.
Multi-process execution
The CRT achieves high throughput data transfer by making S3 requests in parallel across multiple threads. This is a great fit for applications that only use one S3 client at a time, because these threads can spread across the vCPUs of your instance. However, if you use multiple processes that each create their own S3 client, these threads can contend with each other and reduce S3 performance. These multiple clients may also contend with each other for network bandwidth, creating congestion that reduces performance.
In this first release, the CRT integrations in the AWS CLI and Boto3 automatically detect when multiple processes are creating CRT-based S3 clients, and fall back to their non-CRT-based S3 clients in these cases. This fallback reduces the risk of contention on your system by ensuring at most one CRT client exists, but the other clients may see worse performance as a result. This limitation only affects multiple S3 clients. A single S3 client can be shared by many threads in the same process, or by many S3 transfers within the same AWS CLI invocation.
There are two common ways that your applications might end up with multiple processes creating their own S3 client. First, if you run multiple invocations of the AWS CLI at the same time, each CLI process has its own S3 client. For example, if you have previously used tricks like running the AWS CLI under the parallel
or xargs -P
utilities to improve performance, you will have multiple AWS CLI processes, each with their own S3 client. With the new CRT integration, you should prefer to use only one CLI process and let the CLI manage the parallelism of your transfer for you. Second, if you use Boto3 with ML frameworks like PyTorch, you may end up with multiple worker processes for data loading (for example, the num_workers
argument to PyTorch’s DataLoader).
Multi-region and cross-region access
The CRT integration in the AWS CLI and Boto3 does not currently support automatic region detection for S3 buckets. This means that if your application is accessing an S3 bucket in a different region to the one your instance is running in, you will need to manually specify the target region. You can do this using the --region argument
in the AWS CLI, or by setting the AWS_REGION
environment variable for both the AWS CLI and Boto3. For Boto3, because the region is configured at client creation time, this limitation also means that a single S3 client can only access buckets from a single region. If you need to access buckets from multiple regions, you will need to create multiple clients.
Transfer configuration
The CRT integration in Boto3 does not support the TransferConfig API for configuring the client at a per-transfer level. Instead, the CRT will automatically configure the client to maximize network bandwidth, and will share that bandwidth across all concurrent S3 requests in the same process.
Opting out of the CRT
You can opt out of the CRT if you need to work around any of these limitations. To disable the CRT integration for the AWS CLI, run:
aws configure set s3.preferred_transfer_client classic
Similarly, to disable the CRT S3 integration for Boto3, set the preferred_transfer_client
in
TransferConfig to classic
when used with boto3 transfer call.
from boto3.s3.transfer import TransferConfig
config = TransferConfig(preferred_transfer_client='classic')
client = boto3.client('s3', region_name='us-west-2')
client.upload_file('/tmp/file', Bucket='doc-example-bucket',
Key='test_file', Config=config)
Conclusion and future improvements
Amazon S3’s elasticity and high performance make it a great place to store ML training data and model checkpoints. With today’s improvements to the AWS CLI and Boto3, it’s even easier to optimize performance when accessing S3 in your ML pipeline, helping you complete jobs faster and reduce costs. In the future, we will enable the AWS Common Runtime on more instance types by default, and expose finer-grained tuning knobs to help you further optimize performance for your workloads. The AWS CLI, Boto3, and the AWS Common Runtime are all open-source projects, and we always welcome your feedback on their respective GitHub repositories.