Automatically import objects from Amazon S3 into Amazon FSx for Lustre

Many enterprises and other AWS customers often store their datasets and build their data lakes on Amazon S3. Since Amazon FSx for Lustre is deeply integrated with S3, customers can create a new FSx for Lustre file system linked to their S3 bucket in minutes. FSx for Lustre file systems transparently present S3 objects as files when linked to a S3 data repository. This enables customers to process data faster using a high-performance, POSIX-complaint, and fully managed file system. After processing is complete, customers can write results back to the linked S3 data repository using the data repository task API.

Until last Thursday, July 23, FSx for Lustre only imported files and directories from your linked Amazon S3 bucket once, during file system creation. Now, with our new import preferences setting, you can configure your FSx for Lustre file system to automatically update its contents as new objects are added, or as existing objects are updated in your S3 bucket. Customers no longer need to create a new file system, or perform manual copy operation, or deploy custom workflow to keep their FSx for Lustre file system updated with new or changed contents on the linked S3 data repository. This setting is only available on FSx for Lustre file systems created after 3:00 pm EDT, July 23, 2020.

In this blog, I show you how to manage import preferences with the new import preferences setting. The aim is to enable you to process fresh data as it is loaded to S3 without needing to manually copy files or create a new file system and restart your workload.

Example use cases and industries

Customers use FSx for Lustre for a wide spectrum of use cases. High-performance files systems are perfect for media processing and transcoding, electric design automation (EDA), autonomous vehicles, big data and financial analytics, machine learning, and high performance computing (HPC).

FSx for Lustre makes it easy and cost effective to launch and run the world’s most popular high-performance file system. FSx for Lustre was built to quickly and cost-effectively process the fastest-growing datasets in the world. It’s the most widely used file system for the 500 fastest computers in the world. It provides submillisecond latencies, hundreds of gigabytes per second of throughput, and millions of IOPS enabling compute intensive workloads to complete faster.

Let’s review some industries that can benefit from this feature:

Autonomous vehicle – customers regularly upload large amounts of sensor data to Amazon S3 from vehicles when they return to garages after conducting periodic test drives.
Oil and gas – customers will upload hydrophone sensor data to Amazon S3 from seismic vessels when they return after conducting periodic acquisition runs across oceans.
Financial services – customers continuously upload large amounts of financial transactions (point-of sale transactions from around the world, stock/bond time-series ticker data from hundreds of global stock exchanges etc.) for analysis and trading.
Life sciences – customers can upload new DNA data from thousands of individuals daily into Amazon S3.
Media and entertainment – customers who render and transcode videos continuously upload new videos to Amazon S3 for processing.

These industries, and many others, can use the import preferences setting to import new data to FSx for Lustre as it is deposited in Amazon S3. By doing so, their workloads can analyze data as it becomes available, without needing to pause, spin down, and re-create an FSx for Lustre file system.

Overview of import preferences

You can configure your FSx for Lustre file system to automatically update its contents as new objects are added, or as existing objects are updated in your S3 bucket. The import preferences setting allows you to:

Import objects that are added to my bucket: (Default) FSx for Lustre automatically imports any new objects added to the linked Amazon S3 data repository that do not currently exist in the file system. FSx for Lustre does not import updates to existing S3 objects in the file system. FSx for Lustre does not delete files from the file system that are deleted from the linked data repository.
Import objects that are added to or changed in my bucket: FSx for Lustre automatically imports any new objects added to the linked S3 data repository and any existing objects that are changed after file system creation, or after you set the import preferences to this option. FSx for Lustre does not delete files from the file system that are deleted from the linked data repository.
Do not import any objects: FSx for Lustre only imports files from the linked data repository when the file system is created. FSx for Lustre does not import any new or changed objects after file system creation.

When FSx for Lustre imports new objects from the linked Amazon S3 bucket, it only downloads the names, prefixes, and permissions (that is, metadata) of those objects and makes them visible as new files and directories in the file system. If the object does not include metadata, then FSx for Lustre uses default permissions of root: root and 755. If, during an import, a changed object in the data repository no longer contained its metadata, FSx for Lustre would maintain the current metadata values rather than using default permissions. The contents of S3 objects are loaded into the file system when first accessed by your application. Subsequent reads of these files are served directly out of the file system with low, consistent latencies. This “lazy-load” behavior for new and changed files is identical to that of new FSx for Lustre file system creation.

You can set import preferences when you create a new S3-linked FSx for Lustre file system using the Amazon FSx API AWS CLI (AWS CLI), or the AWS Management Console. You can update the import preferences for an existing file system using the AWS Management Console, the update-file-system AWS CLI command, or the UpdateFileSystem API.

You can monitor the Lifecycle state of data repository configuration on your file system using the AWS Management Console, the describe-file-systems AWS CLI command, or the DescribeFileSystems API to determine when import preferences have been successfully enabled, or if an error was encountered.

The following are some considerations when setting import preferences to Import objects that are added to my bucket or Import objects that are added to or changed in my bucket:

The FSx for Lustre file system and its linked Amazon S3 data repository should be located in the same AWS Region.
The linked Amazon S3 data repository should not be in the AWS Public Dataset program. If you must use a public dataset to import objects into FSx for Lustre, the import preferences should be set to Do not import any objects.
The data repository should be in the AVAILABLE Lifecycle state.
If you update an existing file system’s import preferences to Import objects that are added to my bucket or Import objects that are added to or changed in my bucket, only files added or changed in the linked data repository after you update the import preferences are reflected in your file system. When FSx for Lustre imports a file that has changed on the linked data repository, it overwrites the local file with the imported version, even if the file is write-locked.

New objects and new and changed objects goign into Amazon S3 and then eventually automatically imported into Amazon FSx for Lustre

To learn more about all the conditions where FSx for Lustre cannot automatically import new or changed files, please review our documentation here.

FSx for Lustre can import thousands of new Amazon S3 objects per second. Larger FSx for Lustre file systems import at a higher rate. Note that files are not guaranteed to appear in FSx for Lustre in the order in which they are added to S3. In most cases, enabling the import preferences has no impact on the performance of FSx for Lustre. With that setting active, you can still to enjoy submillisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS.

During periods when you’re performing reads and writes on FSx for Lustre at a very high rate, it may take longer to import objects into FSx for Lustre. The same is true when you are adding objects to Amazon S3 at a very high rate.

Getting Started with Import preferences

First, create a new FSx for Lustre file system linked to an S3 bucket called autoimport-demo while setting the import preference on the file system to Import objects that are added to my bucket. When using the AWS CLI, you must set the AutoImportPolicy to NEW as shown here:

$ aws fsx create-file-system   --file-system-type LUSTRE   --storage-capacity 1200   --subnet-ids subnet-123a456b   --lustre-configuration DeploymentType=SCRATCH_2,AutoImportPolicy=NEW,ImportPath=s3://autoimport-demo,ExportPath=s3://autoimport-demo

You can see the Lifecycle state of the file system and its data repository are in the CREATING state as shown in the following output:

{
    "FileSystem": {
        "OwnerId": "012345678910",
        "CreationTime": 1595000829.048,
        "FileSystemId": "fs-0f15953eb70fe22a6",
        "FileSystemType": "LUSTRE",
        "Lifecycle": "CREATING",
        "StorageCapacity": 1200,
        "StorageType": "SSD",
        "VpcId": "vpc-a12b34c5",
        "SubnetIds": [
            "subnet-123a456b"
        ],
        "DNSName": "fs-0f15953eb70fe22a6.fsx.us-east-2.aws.internal",
        "ResourceARN": "arn:aws:fsx:us-east-2:012345678910:file-system/fs-0f15953eb70fe22a6",
        "Tags": [],
        "LustreConfiguration": {
            "WeeklyMaintenanceStartTime": "2:09:30",
            "DataRepositoryConfiguration": {
                "Lifecycle": "CREATING",
                "ImportPath": "s3://autoimport-demo",
                "ExportPath": "s3://autoimport-demo",
                "ImportedFileChunkSize": 1024,
                "AutoImportPolicy": "NEW"
            },
            "DeploymentType": "SCRATCH_2",
            "MountName": "ifsh7bmv"
        }
    }
}

You can monitor the Lifecycle state of your file system using the following AWS CLI command:

$ aws fsx describe-file-systems --file-system-id fs-0f15953eb70fe22a6

When the Lifecycle state of the data repository becomes AVAILABLE, the file system is ready for use with the import preferences enabled.

{
    "FileSystems": [
        {
            "OwnerId": "012345678910",
            "CreationTime": 1595001225.485,
            "FileSystemId": "fs-0f15953eb70fe22a6",
            "FileSystemType": "LUSTRE",
            "Lifecycle": "AVAILABLE",
            "StorageCapacity": 1200,
            "StorageType": "SSD",
            "VpcId": "vpc-a12b34c5",
            "SubnetIds": [
                "subnet-123a456b"
            ],
            "NetworkInterfaceIds": [
                "eni-01ab234c56789de01",
                "eni-02ab234c56789de02"
            ],
            "DNSName": "fs-0f15953eb70fe22a6.fsx.us-east-2.aws.internal",
            "ResourceARN": "arn:aws:fsx:us-east-2:012345678910:file-system/fs-0f15953eb70fe22a6",
            "Tags": [],
            "LustreConfiguration": {
                "WeeklyMaintenanceStartTime": "2:09:30",
                "DataRepositoryConfiguration": {
                    "Lifecycle": "AVAILABLE",
                    "ImportPath": "s3://autoimport-demo",
                    "ExportPath": "s3://autoimport-demo",
                    "ImportedFileChunkSize": 1024,
                    "AutoImportPolicy": "NEW"
                },
                "DeploymentType": "SCRATCH_2",
                "MountName": "ifsh7bmv"
            }
        }
    ]
}

You can also use the AWS Management Console to configure the import preferences under the Data repository Import/Export settings at the time of new file system creation.

You can also use the AWS Management Console to configure the import preferences under the Data repository Import-Export settings at file creation

Once the file system is created, you can install the Lustre client and mount the file system on your compute instances as outlined in our documentation.

Upon mounting the new file system on a compute instance, you can see that all the POSIX metadata for the objects in Amazon S3 has been preserved, ensuring secure access to your data. If the object does not include metadata, then FSx for Lustre uses default permissions for the files.

$ sudo mount -t lustre -o noatime,flock fs-0f15953eb70fe22a6.fsx.us-east-2.aws.internal@tcp:/ifsh7bmv /fsx

$ ls -lt /fsx
total 68
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir01
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir02
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-05
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-04
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-03
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-02
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-01

Next, let’s upload two new files, “file-06” and “file-07,” to the S3 bucket.

Upon checking the file system, we can see that FSx for Lustre has automatically imported these new objects into the file system. FSx for Lustre typically imports new objects within seconds of being added to S3, but can sometimes take a minute or longer.

$ ls -lt /fsx
total 69
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 16:22 file-07
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 16:22 file-06
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir01
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir02
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-05
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-04
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-03
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-02

Updating the import preferences

Now, let’s modify the import preferences to Import objects that are added to or changed in my bucket using the update-file-system AWS CLI command and setting the AutoImportPolicy to NEW_CHANGED:

$ aws fsx update-file-system --file-system-id fs-0f15953eb70fe22a6 --lustre-configuration AutoImportPolicy=NEW_CHANGED

You can see that the Lifecycle state is UPDATING. The Lifecycle state changes to AVAILABLE once the update is completed.

{
    "FileSystem": {
        "OwnerId": "012345678910",
        "CreationTime": 1595001225.485,
        "FileSystemId": "fs-0f15953eb70fe22a6",
        "FileSystemType": "LUSTRE",
        "Lifecycle": "AVAILABLE",
        "StorageCapacity": 1200,
        "StorageType": "SSD",
        "VpcId": "vpc-a12b34c5",
        "SubnetIds": [
            "subnet-123a456b"
        ],
        "NetworkInterfaceIds": [
            "eni-01ab234c56789de01",
            "eni-02ab234c56789de02"
        ],
        "DNSName": "fs-0f15953eb70fe22a6.fsx.us-east-2.aws.internal",
        "ResourceARN": "arn:aws:fsx:us-east-2:012345678910:file-system/fs-0f15953eb70fe22a6",
        "Tags": [],
        "LustreConfiguration": {
            "WeeklyMaintenanceStartTime": "2:09:30",
            "DataRepositoryConfiguration": {
                "Lifecycle": "UPDATING",
                "ImportPath": "s3://autoimport-demo",
                "ExportPath": "s3://autoimport-demo",
                "ImportedFileChunkSize": 1024,
                "AutoImportPolicy": "NEW_CHANGED"
            },
            "DeploymentType": "SCRATCH_2",
            "MountName": "ifsh7bmv"
        }
    }
}

You can monitor the Lifecycle state of the file system using the describe-file-systems AWS CLI command.

You can perform the same operation from the FSx for Lustre console by selecting the file system, then choosing the Actions drop-down menu, and finally choosing Update import preference as shown in the following screenshot:

Next, select Import objects that are added to or changed in my bucket from the three choices shown and choose Update.

Before we make changes to a few existing objects in our Amazon S3 bucket, let’s review their contents and permissions:

The file “file-01” has the following contents:

$cat file-01
I am file number 01

The file “file-02” has permissions set to 664 as shown here:

$ ls -ltr file-02
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-02

I have updated the contents of “file-01” and, permissions on “file-02” to 755, and uploaded these changed objects to our S3 bucket. I have also deleted “file-03.”

The following is an updated view of objects in the S3 bucket:

An updated view of objects in the S3 bucket

As you can see in the following output, FSx for Lustre automatically updated the file system with the two changed objects “file-01” and “file-02.” The file system update happened within a few seconds of updating the import preferences:

$ ls -lt /fsx
total 69
-rw-rw-r-- 1 autoimport autoimport    77 Jul 17 16:50 file-01
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 16:22 file-07
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 16:22 file-06
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir01
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir02
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-05
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-04
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-03
-rwxr-xr-x 1 autoimport autoimport    20 Jul 17 14:15 file-02

$ cat /fsx/file-01
I am file number 01
I have been updated for the Automatic Import policy demo

You can see the updated contents of the file “file-01” and updated permission for file “file-02.” You can also notice that “file-03,” which was deleted in our S3 bucket, is still available on the FSx for Lustre file system. Objects deleted in your data repository are not automatically updated in your FSx for Lustre file system.

Disabling the import preferences

Now, let’s modify the import preferences to Do not import any objects by setting the AutoImportPolicy to NONE using the update-file-system AWS CLI command:

$ aws fsx update-file-system --file-system-id fs-0f15953eb70fe22a6 --lustre-configuration AutoImportPolicy=NONE

Output from the preceding command is shown here:

{
    "FileSystem": {
        "OwnerId": "012345678910",
        "CreationTime": 1595001225.485,
        "FileSystemId": "fs-0f15953eb70fe22a6",
        "FileSystemType": "LUSTRE",
        "Lifecycle": "AVAILABLE",
        "StorageCapacity": 1200,
        "StorageType": "SSD",
        "VpcId": "vpc-a12b34c5",
        "SubnetIds": [
            "subnet-123a456b"
        ],
        "NetworkInterfaceIds": [
            "eni-01ab234c56789de01",
            "eni-02ab234c56789de02"
        ],
        "DNSName": "fs-0f15953eb70fe22a6.fsx.us-east-2.aws.internal",
        "ResourceARN": "arn:aws:fsx:us-east-2:012345678910:file-system/fs-0f15953eb70fe22a6",
        "Tags": [],
        "LustreConfiguration": {
            "WeeklyMaintenanceStartTime": "2:09:30",
            "DataRepositoryConfiguration": {
                "Lifecycle": "UPDATING",
                "ImportPath": "s3://autoimport-demo",
                "ExportPath": "s3://autoimport-demo",
                "ImportedFileChunkSize": 1024,
                "AutoImportPolicy": "NONE"
            },
            "DeploymentType": "SCRATCH_2",
            "MountName": "ifsh7bmv"
        }
    }
}

After the Lifecycle status of the data repository became AVAILABLE, I uploaded two new files (“file-08″ and “file-09”) and a directory (“autoimport-dir03”) to our S3 bucket.

Next, I have uploaded two new files “file-08,” “file-09,” and a directory “autoimport-dir03” to our S3 bucket.

Since the import preferences is set to NONE these files are no longer automatically imported into the FSx for Lustre file system.

$ ls -lt /fsx
total 69
-rw-rw-r-- 1 autoimport autoimport    77 Jul 17 16:50 file-01
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 16:22 file-07
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 16:22 file-06
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir01
drwxrwxr-x 2 autoimport autoimport 33280 Jul 17 15:54 autoimport-dir02
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-05
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:16 file-04
-rw-rw-r-- 1 autoimport autoimport    20 Jul 17 14:15 file-03
-rwxr-xr-x 1 autoimport autoimport    20 Jul 17 14:15 file-02

Next, let’s update the import preferences to Import objects that are added to or changed in my bucket. Then, upload a new file (“file-10”) to our Amazon S3 bucket after the import preferences is updated.

Next, let’s update the import preference to Import objects that are added to or changed in my bucket. Then, upload a new file (“file-10”) to our Amazon S3 bucket after the import preference is updated.

Upon checking the FSx for Lustre file system, we can now see “file-10” imported into the FSx for Lustre file system. However, files “file-08,” “file-09,” and the directory “autoimport-dir03” have not been imported. If you update an existing file system’s import preferences to Import objects that are added to my bucket or Import objects that are added to or changed in my bucket, only files added or changed in the linked data repository after you update the import preferences are reflected in your file system. This is the reason the files “file-08,” “file-09,” and the directory “autoimport-dir03” were not imported into the file system. This is an important consideration when you set the import preferences to Do not import any objects.

When you switch the import preferences from Do not import any objects, it takes few minutes for the updated preference to go into effect. You can check the status of the import preferences update using the describe-file-systems AWS CLI command.

Summary

In this blog, I introduced you to the import preferences setting. This great new feature in Amazon FSx for Lustre enables you to automatically import new or changed objects from a linked Amazon S3 data repository into your file system. This enables you to maintain an up-to-date view of your linked S3 data repository in your FSx for Lustre file system.

I looked at the three settings available with import preferences and covered how to configure the import preferences when creating a new FSx for Lustre file system. Next, I covered how to update the import preferences on an existing file system. I also showed you how the different import preferences settings work when new objects are added, or existing objects are changed, in your linked S3 data repository.

Customers can set an import preferences setting to import new data to FSx for Lustre as it is deposited in Amazon S3. This enables their workloads to analyze data as it becomes available, without needing to pause, spin down, and re-create an FSx for Lustre file system. Customers no longer need to perform manual copy operations, or deploy custom workflows to keep their FSx for Lustre file system updated with new or changed contents on the linked S3 data repository. The import preferences setting makes its simple and easy for customers to automatically import new and changed files into the FSx for Lustre file system.

Thank you for reading this blog post. Please leave a comment if you have any questions or feedback.

View Comments

AWS Storage Blog