Speed up training on Amazon SageMaker using Amazon FSx for Lustre and Amazon EFS file systems
April 2021 – The Amazon FSx section of this post has been updated to cover changes introduced to mount point names with scratch_2 and persistent_1 deployment options.
Amazon SageMaker provides a fully managed service for data science and machine learning workflows. One of the most important capabilities of Amazon SageMaker is its ability to run fully managed training jobs to train machine learning models.
Now, you can speed up your training job runs by training machine learning models from data stored in Amazon FSx for Lustre or Amazon Elastic File System (EFS). Amazon FSx for Lustre provides a high-performance file system natively integrated with Amazon Simple Storage Service (S3) and optimized for workloads such as machine learning, analytics, and high performance computing. Amazon EFS provides a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.
Training machine learning models requires providing the training datasets to the training job. Until now, when using Amazon S3 as the training datasource in File input mode, all training data had to be downloaded from Amazon S3 to the EBS volumes attached to the training instances at the start of the training job. A distributed file system such as Amazon FSx for Lustre or EFS can speed up machine learning training by eliminating the need for this download step.
In this blog post, we go over the benefits of training your models using a file system, provide information to help you choose a file system, and show you how to get started.
Choosing a file system for training models on Amazon SageMaker
When considering whether you should train your machine learning models from a file system, the first thing to consider is: where does your training data reside now?
If your training data is already in Amazon S3, and your needs do not dictate faster training times for your training jobs, you can get started with Amazon SageMaker with no need for data movement. However, if you need faster startup and training times, we recommend that you take advantage of the Amazon FSx for Lustre file system, which is natively integrated with Amazon S3.
Amazon FSx for Lustre speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds. The first time you run a training job, Amazon FSx for Lustre automatically copies data from Amazon S3 and makes it available to Amazon SageMaker. Additionally, the same Amazon FSx file system can be used for subsequent iterations of training jobs on Amazon SageMaker, preventing repeated downloads of common Amazon S3 objects. Because of this, Amazon FSx has the most benefit to training jobs that have training sets in Amazon S3, and in workflows where training jobs must be run several times using different training algorithms or parameters to see which gives the best result.
If your training data is already in an Amazon EFS file system, we recommend choosing Amazon EFS as the file system data source. This choice has the benefit of directly launching your training jobs from the data in Amazon EFS with no data movement required, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with which fields or labels to include. For example, a data scientist can use a Jupyter notebook to do initial cleansing on a training set, launch a training job from SageMaker, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.
Getting started with Amazon FSx for training on Amazon SageMaker
- Note your training data Amazon S3 bucket and path.
- Create an Amazon FSx file system with the desired size. Expand the Data repository integration. Choose Amazon S3 for Data repository type and specify the Import bucket and Import prefix corresponding to your Amazon S3 training data.
- Once created, note your file system id and mount name.
- Now, go to the Amazon SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
- Create your training job:
- Provide the ARN for the IAM role with the required access control and permissions policy. Refer to AmazonSageMakerFullAccess for details.
- Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow Lustre traffic over port 988 to control access to the training dataset stored in the file system. For more details, refer to Getting started with Amazon FSx.
- Choose file system as the data source and properly reference your file system id, type, and directory path. Note, the directory path begins with the mount name of the file system.
- Launch your training job.
Getting started with Amazon EFS for training on Amazon SageMaker
- Put your training data in its own directory in Amazon EFS.
- Now go to the Amazon SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
- Create your training job:
- Provide the IAM role ARN for the IAM role with the required access control and permissions policy
- Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow NFS traffic over port 2049 to control access to the training dataset stored in the file system.
- Choose file system as the data source and properly reference your file system id, path, and format.
- Launch your training job.
After your training job completes, you can view the status history of the training job to observe the faster download time when using a file system data source.
With the addition of Amazon FSx for Lustre and Amazon EFS as data sources for training machine learning models in Amazon SageMaker, you now have greater flexibility to choose a data source that is suited to your use case. In this blog post, we used a file system data source to train machine learning models, resulting in faster training start times by eliminating the data download step.
Go here to start training machine learning models yourself on Amazon SageMaker or refer to our sample notebook to train a linear learner model using a file system data source to learn more.
About the Authors
Vidhi Kastuar is a Sr. Product Manager for Amazon SageMaker, focusing on making machine learning and artificial intelligence simple, easy to use and scalable for all users and businesses. Prior to AWS, Vidhi was Director of Product Management at Veritas Technologies. For fun outside work, Vidhi loves to sketch and paint, work as a career coach, and spend time with his family and friends.
Will Ochandarena is a Principal Product Manager on the Amazon Elastic File System team, focusing on helping customers use EFS to modernize their application architectures. Prior to AWS, Will was Senior Director of Product Management at MapR.
Tushar Saxena is a Principal Product Manager at Amazon, with the mission to grow AWS’ file storage business. Prior to Amazon, he led telecom infrastructure business units at two companies, and played a central role in launching Verizon’s fiber broadband service. He started his career as a researcher at GE R&D and BBN, working in computer vision, Internet networks, and video streaming.