Posted On: Aug 27, 2019

Amazon SageMaker now supports Amazon Elastic File System (Amazon EFS) and Amazon FSx for Lustre file systems as data sources for training machine learning models on SageMaker. Amazon FSx for Lustre is a high performance file system optimized for workloads, such as machine learning, analytics and high performance computing. Amazon EFS provides a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources. Support for these file systems accelerates and simplifies using Amazon SageMaker to train models with data sets. The file system data source reduces the start-up time by eliminating the data download step of the training process and leveraging the various performance and throughput benefits of the file system to execute the training job faster.

Until today, Amazon SageMaker transparently downloaded a full training set from Amazon S3 to local file storage at the start of a training job, when using the File input mode. Now with Amazon FSx for Lustre, customers can accelerate their File mode training jobs by avoiding the initial Amazon S3 download time. When Amazon FSx for Lustre file system is linked to Amazon S3 buckets, it automatically copies objects from Amazon S3 to the file system when objects are accessed for the first time. The same FSx file system can also be used across multiple SageMaker jobs, preventing repeated downloading of common objects.

Also until today, customers could only use Amazon SageMaker with training sets stored on Amazon S3. Now, customers can also use training sets that are stored on Amazon EFS. Amazon SageMaker interacts directly with Amazon EFS, eliminating the need to copy data sets from Amazon EFS to Amazon S3 for use with Amazon SageMaker.

Most Amazon SageMaker built-in machine learning algorithms support EFS and FSx for Lustre as input data source. This feature is available in all regions where the respective file systems are available. For details on region availability please check the AWS region table.

Visit the documentation for more information and read the blog post for how to use the feature.