AWS Storage Blog

New enhancements for moving data between Amazon FSx for Lustre and Amazon S3

Introduction to Amazon FSx for Lustre

Amazon FSx for Lustre is a high performance file system optimized for workloads such as machine learning, high performance computing, video processing, financial modeling, electronic design automation, and analytics. Amazon FSx works natively with Amazon S3, making it easy for you to process cloud data sets with high performance file systems. When linked to an S3 bucket, an FSx for Lustre file system transparently presents S3 objects as files and allows you to write results back to S3.

Customers also often need the ability to control access to sensitive data on their file systems, process special files such as symbolic links, and maintain these controls and special files as they backup and restore data to long term repositories such as Amazon S3. Amazon FSx provides file system commands such as hsm_archive to export new or changed files from Amazon FSx to S3, but these commands do not copy file permissions and do not provide the ability to monitor or cancel the transfer.

In this blog, we introduce the Data Repository Tasks application programming interface (API), a new AWS API that allows you to easily export files from Amazon FSx to S3. You can use the new API to initiate, monitor, and cancel writing new or changed files to S3. Since it’s an AWS-native API, you can use it to easily orchestrate data export tasks from cloud native workflows such as Lambda-based serverless applications.

In addition to transferring file data and file permissions, this API also allows you to transfer symbolic links, file ownership metadata, and file time stamps to S3. The API minimizes transfer time by only copying files and directories whose contents or permissions have changed. File permissions and other file metadata are stored in S3 in the same format used by AWS DataSync and AWS Storage Gateway. This provides you a consistent mechanism to manage files and data in AWS.

When you create an Amazon FSx for Lustre file system backed by S3, the files in Amazon FSx assume the file permissions, ownership and time stamps stored in S3. Additionally, FSx has quadrupled the speed at which file metadata is imported from S3, allowing you to launch S3-backed FSx file systems up to four times faster.

These enhancements to Amazon FSx for Lustre and the Data Repository Task API make it easier and more cost-effective to process S3 data at high speeds for a broad set of workloads, including workloads that require processing of a large number of small files such as streaming ticker data and financial transactions in the financial services industry, and workloads that require access controls on sensitive data such as DNA sequencing files in the genomics industry.

Amazon FSx for Lustre data processing workloads

Getting started with POSIX Metadata / Data Repository Tasks on FSx for Lustre

In the example below, we will use Amazon FSx for Lustre to run a permissions-sensitive genomics workload. The process includes spinning up an FSx for Lustre file system linked to an S3 bucket containing our dataset, analyzing the dataset within our file system, and finally writing results back to S3.

Our S3 bucket s3://dataset-01 contains hundreds of compressed FASTQ objects comprising nucleotide sequence data, which was uploaded to S3 from an on-premises data-center via AWS DataSync. Note that POSIX Metadata for files will be preserved in S3 objects when uploaded by AWS DataSync, AWS Storage Gateway, and Amazon FSx for Lustre.

S3 Bucket with hundreds of compressed FASTQ objects

Now lets create an Amazon FSx for Lustre file system linked to this S3 bucket. Amazon FSx will import the objects in our S3 bucket as files, and “lazy-load” the file contents from S3 when we first access the file. Note that we can specify any arbitrary path in our S3 bucket using the ImportPath field, and Amazon FSx for Lustre recursively imports all files from that path in a highly parallel fashion.

$ aws fsx create-file-system \
  --file-system-type LUSTRE \
  --storage-capacity 3600 \
  --subnet-ids subnet-0a2b78705852896b8 \
  --lustre-configuration ImportPath=s3://dataset-01,ExportPath=s3://dataset-01  

{
    "FileSystem": {
        "FileSystemId": "fs-00a70d77ae2252abc",
        "Lifecycle": "CREATING",
        "DNSName": "fs-00a70d77ae2252abc.fsx.us-east-1.amazonaws.com",
        ...        
    }
}

Upon mounting the file system (see instructions here) we see that all of our POSIX Metadata has been preserved from on premises. POSIX permissions, UID, and GID are preserved. This is true for files, directories and symlinks. File system administrators can sleep safe knowing that only the users/groups specified by POSIX permissions will have access to sensitive data.

$ ls -lhR
.:
total 73K
drw--w---- 2 algo1 algo1 33K Dec 17 23:04 output
dr-------- 2 algo1 algo1 41K Dec 17 22:49 sequences

./output:
total 0

./sequences:
total 19K
-r-------- 1 scrub algo1 1.0G Dec 17 20:36 G0-1711dd8c-ec7a-4b2c-9010-7585cc0dd9e8.fastq.gz
-r-------- 1 scrub algo1 1.0G Dec 17 20:35 G0-187e2993-9a86-427e-989b-09e768bd9ca8.fastq.gz
-r-------- 1 scrub algo1 1.0G Dec 17 20:36 G0-6f390113-f3a8-4ebc-88a5-3cb0ebc93c5b.fastq.gz
-r-------- 1 scrub algo1 1.0G Dec 17 20:36 G0-712d1f44-dd4b-41be-95e5-930311011131.fastq.gz
...

With our S3 objects imported into our Lustre file system, we can now lazy-load the files we need by simply reading the particular files. After a file is lazy-loaded, its contents are fully copied from S3 onto the Amazon FSx for Lustre file system, where it can be accessed with extremely low latency. Now that our data is available for fast access, we can perform our highly parallel workload to obtain key insights. This workload produced the following files:

$ ls -lhR output
output:
total 512G
-rw------- 1 algo1 algo1 512M Dec 18 15:47 results_00.csv
-rw------- 1 algo1 algo1 512M Dec 18 15:46 results_01.csv
-rw------- 1 algo1 algo1 512M Dec 18 15:47 results_02.csv
-rw------- 1 algo1 algo1 512M Dec 18 15:47 results_03.csv
...

To export these files back to our S3 bucket, we can now use Data Repository Tasks. Data Repository Tasks represent bulk operations between your Amazon FSx for Lustre file system and your S3 bucket, one of which is an operation to export your changed file system contents back to its linked S3 bucket. We can use the create-data-repository-task API below to create a Data Repository Task that exports only the output directory on our file system:

$ aws fsx create-data-repository-task \
  --file-system-id fs-00a70d77ae2252aba \
  --type EXPORT_TO_REPOSITORY \
  --paths output
  --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://dataset-01/reports
  
{
    "DataRepositoryTask": {
        "TaskId": "task-08048701430a981b7",
        "Lifecycle": "PENDING",
        "Type": "EXPORT_TO_REPOSITORY",
        "CreationTime": 1576685114.413,
        "ResourceARN": "arn:aws:fsx:us-east-1:123456789012:task/task-08048701430a981b7",
        "Tags": [],
        "FileSystemId": "fs-00a70d77ae2252abc",
        "Paths": ["output"],
        "Report": {
            "Enabled": true,
            "Path": "s3://dataset-01/reports",
            "Format": "REPORT_CSV_20191124",
            "Scope": "FAILED_FILES_ONLY"
        }
    }
}

Note that this API can be invoked from any AWS workflow including Lambda-based serverless applications, allowing you to orchestrate data transfer tasks between FSx and S3 from cloud-native workflows.

To see the status of a Data Repository Task, we can describe it using the describe-data-repository-task API. In our case, the Data Repository Task has already succeeded in exporting our results back to S3:

$ aws fsx describe-data-repository-tasks

{
    "DataRepositoryTasks": [
        {
            "TaskId": "task-08048701430a981b7",
            "Lifecycle": "SUCCEEDED",
            "Type": "EXPORT_TO_REPOSITORY",
            "CreationTime": 1576685114.413,
            "StartTime": 1576685127.896,
            "ResourceARN": "arn:aws:fsx:us-east-1:123456789012:task/task-08048701430a981b7",
            "Tags": [],
            "FileSystemId": "fs-00a70d77ae2252abc",
            "Paths": ["output"],
            "Status": {
                "TotalCount": 1000,
                "SucceededCount": 1000,
                "FailedCount": 0,
                "LastUpdatedTime": 1576685795.701
            },
            "EndTime": 1576685795.701,
            "Report": {
                "Enabled": true,
                "Path": "s3://dataset-01/reports",
                "Format": "REPORT_CSV_20191124",
                "Scope": "FAILED_FILES_ONLY"
            }
        }
    ]
}    

Note that Amazon FSx for Lustre will maintain our POSIX permissions when exporting files back to S3, meaning our access controls will be preserved if we would like to use these objects in another Amazon FSx for Lustre file system, or using AWS Data Sync or AWS Storage Gateway.

As demonstrated above, Amazon FSx for Lustre provides a seamless experience moving cloud datasets to and from Amazon S3. Customers can use Amazon FSx APIs to create a file system linked to their S3 bucket, and after running their workload, export data back to their S3 bucket.

Cleaning up

To clean up your Amazon FSx for Lustre file system, you can run the following command:

$ aws fsx delete-file-system \
  --file-system-id fs-00a70d77ae2252abc

This will ensure that any example resources created while following along with this post do not incur additional charges.

Summary

Amazon FSx for Lustre’s native integration with Amazon S3 allows customers to easily and quickly process datasets stored on S3. The enhancements outlined here further simplify the processing of datasets stored on S3 by providing:

  1. The ability to transfer POSIX Metadata between your Amazon FSx for Lustre file system and S3, that allows you to maintain POSIX permissions, ownership and time stamps between Amazon FSx and S3. See FSx for Lustre user guide to learn more.
  2. A new family of APIs that provide you controls to initiate, monitor and cancel transfer of data between your Amazon FSx for Lustre file system and its linked S3 bucket. See FSx for Lustre API reference documentation to learn more.
  3. The quadrupling of speed at which file metadata is imported from S3 into your Amazon FSx for Lustre file systems, allowing you to launch S3-backed Amazon FSx file systems up to four times faster.
Tushar Saxena

Tushar Saxena

Tushar Saxena is a Principal Product Manager at Amazon, with the mission to grow AWS’ file storage business. Prior to Amazon, he led telecom infrastructure business units at two companies, and played a central role in launching Verizon’s fiber broadband service. He started his career as a researcher at GE R&D and BBN, working in computer vision, Internet networks, and video streaming.

Justin Kennedy

Justin Kennedy

Justin Kennedy is a Software Development Engineer at FSx for Lustre. He focuses primarily on FSx for Lustre's integration with Amazon S3. Outside of work, Justin loves reading and listening to music.