AWS Machine Learning Blog
Announcing the Amazon S3 plugin for PyTorch
November 2023: On 11/22/2023, AWS announced the Amazon S3 Connector for PyTorch ─ a new connector that delivers high throughput for PyTorch training jobs that access data in Amazon S3. We recommend customers use the new connector for PyTorch training jobs that read and write data in Amazon S3. The Amazon S3 Connector for PyTorch delivers a new implementation of PyTorch’s dataset primitive that you can use to load training data from Amazon S3. It also includes a checkpointing interface to save and load checkpoints directly to Amazon S3, without first saving to local storage and writing custom code to upload to Amazon S3. To learn more, read the What’s New post and the landing page in the GitHub repository. |
Amazon S3 plugin for PyTorch is an open-source library which is built to be used with the deep learning framework PyTorch for streaming data from Amazon Simple Storage Service (Amazon S3). With this feature available in PyTorch Deep Learning Containers, you can take advantage of using data from S3 buckets directly with PyTorch dataset and dataloader APIs without needing to download it first on local storage.
What is the Amazon S3 plugin for PyTorch?
The Amazon S3 plugin for PyTorch is designed to be a high-performance PyTorch dataset library to efficiently access data stored in S3 buckets. It provides streaming data access to data of any size and therefore eliminates the need to provision local storage capacity. The library is designed to use high throughput offered by Amazon S3 with minimal latency.
It also provides a way to transfer data from Amazon S3 in parallel when needed to get maximum performance without worrying about thread safety or multiple connections to Amazon S3. You can also stream data from .zip or .tar archives and shuffle the dataset within shards or across the shards as required. The Amazon S3 plugin for PyTorch works seamlessly with existing PyTorch code base because S3Dataset
and S3IterarableDataset
provided by this plugin are implementations of PyTorch’s internal Dataset and IterableDataset interfaces, so you don’t need to change the existing code to make it work with Amazon S3.
The library itself is file format agnostic and presents objects in Amazon S3 as a binary buffer (blob). You can apply any additional transformations on the data received from Amazon S3. You can also easily extend S3Dataset or S3IterableDataset to consume data from Amazon S3 and perform data processing as needed.
Benefits of using the Amazon S3 plugin for PyTorch
The plugin offers the following benefits:
- Support for both map-style and iterable-style dataset interfaces – PyTorch supports two different types of datasets. The Amazon S3 plugin for PyTorch also provides the flexibility to use either map-style or iterable-style dataset interfaces based on your needs:
- Map-style dataset – Represents a map from indexes or keys to data samples. It provides random access capabilities.
- Iterable-style dataset – Represents an iterable over data samples. This type of dataset is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.
- Support for various data formats – Training data can be in a variety of different formats, such as CSV, Parquet, and JPEG. This plugin is file format agnostic and presents objects in Amazon S3 as a binary buffer (blob). You can apply any additional transformations on the data received from Amazon S3.
- Support for shuffling – In deep learning, you may need to shuffle data across shards and within shards to reduce variance. This plugin provides a way to shuffle data in-memory within shards using
ShuffleDataset
or across shards by providing the input parametershuffle_urls
while extendingS3IterableDataset
.
Building blocks
The Amazon S3 plugin for PyTorch provides a native experience of using data from Amazon S3 to PyTorch without adding complexity in your code. To achieve this, it relies heavily on the AWS SDK. AWS provides high-level utilities for managing transfers to and from Amazon S3 through the AWS SDK. This plugin uses standard TransferManager APIs from the AWS_SDK_CPP package underneath to communicate with Amazon S3. These APIs make extensive use of Amazon S3 multipart download capabilities to achieve enhanced throughput and reliability, and are also thread safe.
When dealing with large content sizes and high bandwidth, this can have a significant increase on throughput. TransferManager is also responsible for managing resources such as connections and threads, and hides the complexity of transferring files behind simple APIs.
To use TransferManager, the plugin has C++ APIs underneath for the following actions:
- Validating access to S3 buckets
- Parsing S3 paths
- Checking file existence
- Getting file sizes
- Listing files
- Reading files
To provide easy access to PyTorch users, the plugin uses Pybind11 to wrap the preceding C++ functions and make them available to be used as PyTorch dataset constructs.
The Amazon S3 plugin for PyTorch is available to use through pre-configured PyTorch Docker images, or directly from the GitHub repository.
Configuration
Before reading data from the S3 bucket, you need to provide the bucket Region parameter AWS_REGION
. By default, a Regional endpoint is used for Amazon S3, with the Region controlled by AWS_REGION
.
If AWS_REGION
isn’t specified, us-west-2
is used by default. You can specify it either by running export AWS_REGION=us-east-1
or through code with os.environ['AWS_REGION'] = 'us-east-1'
.
To read objects in a bucket that isn’t publicly accessible, you must provide AWS credentials through one of the following methods:
- Install and configure the AWS Command Line Interface (AWS CLI) with
aws configure
- Set credentials in the AWS credentials profile file on the local system, located at
~/.aws/credentials
on Linux, macOS, or Unix - Set the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
environment variables - If you’re using this library on an Amazon Elastic Compute Cloud (Amazon EC2) instance, specify an AWS Identity and Access Management (IAM) role and then give the EC2 instance access to that role
Use the library
Getting started with this library is easy, as we demonstrate in the following example.
First, log in to Amazon Elastic Container Registry (Amazon ECR):
You can use the following commands to run the container. You must use nvidia-docker
for GPU images.
Use the map-style dataset
If each object in Amazon S3 contains a single training sample, then you can use the map-style dataset (S3Dataset
). To partition data across nodes and to shuffle data, you can use this dataset with the PyTorch distributed sampler. Additionally, you can apply preprocessing to the data in Amazon S3 by extending the S3Dataset
class. The following example code uses map-style S3Dataset
for image datasets:
Please replace the S3 paths with your actual path. This same code is available in the amazon-s3-plugin-for-pytorch GitHub repo. You can run this example with the following code:
Use the iterable-style dataset
If each object in Amazon S3 contains multiple training samples (such as archive files containing multiple small files), we recommend using the iterable-style dataset implementation (S3IterableDataset
).
Consider using a .tar file for image classification. You can load it easily by writing a custom Python generator function using the iterator returned by S3IterableDataset
. (To create shards from a file dataset, refer to the following GitHub repo.) See the following code:
Please replace the S3 paths with your actual path. You can easily use this dataset with DataLoader
for parallel data loading and preprocessing:
We can shuffle the sequence of fetching shards by setting shuffle_urls=True
and calling the set_epoch
method at the beginning of every epoch:
The preceding code only shuffles the sequence of shards; the individual training samples within the shards are fetched in the same order. To shuffle the order of training samples across shards, use ShuffleDataset
. ShuffleDataset
maintains a buffer of data samples read from multiple shards and returns a random sample from it. The count of samples to be buffered is specified by buffer_size
. To use ShuffleDataset
, update the preceding example as follows:
This same code is available in the amazon-s3-plugin-for-pytorch GitHub repo. You can run this example with the following code:
Conclusion
In this post, we showed you how to use S3Dataset
and S3IterableDataset
to stream data directly from S3 buckets and perform training with PyTorch. We demonstrated this solution for a computer vision dataset, but you can apply the same methods to other use cases when the dataset is text files, such as natural language processing.
Laying the foundation to access datasets while training can be critical for many enterprises that are looking to eliminate storing data locally and still get the desired performance. With availability of the Amazon S3 plugin for PyTorch, you can now stream data from S3 buckets and perform the large-scale data processing needed for training in PyTorch.
The Amazon S3 plugin for PyTorch was designed for ease of use and flexibility with PyTorch.
To learn more on how to use this package, we recommend starting with our example use cases.
As we further develop and extend the Amazon S3 plugin for PyTorch, we welcome community participation through questions, requests, and contributions. Head over to the aws/amazon-s3-plugin-for-pytorch GitHub repository to get started!
About the Authors
Roshani Nagmote is a Software Developer for AWS Deep Learning. She focuses on building distributed Deep Learning systems and innovative tools to make Deep Learning accessible for all. In her spare time, she enjoys hiking, exploring new places and is a huge dog lover.
Rajesh Parangi Sharabhalingappa is a Senior Software Engineer at AWS Deep Learning. He works on platforms and libraries to make deep learning training easier for customers . Outside of work, he enjoys cycling.
Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for Amazon Machine Learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.
Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to train deep learning models on AWS. In his spare time, he enjoys spending time with his daughter, playing tennis, reading historical fiction, and traveling.