Use Amazon FSx for Lustre to share Amazon S3 data across accounts

As enterprises evolve their cloud governance practices, multiple teams working in separate accounts may need to share data. One team may oversee an enterprise data lake in one account, while a data science team develops a high-performance computing (HPC) use case in another account. Customers want to take advantage of low-cost object storage and be able to quickly consume this data from a high-performance file system to support HPC use cases without creating additional copies of the data.

Amazon FSx for Lustre has become a critical building block for customers accelerating machine learning (ML) and HPC use cases on AWS. Amazon FSx for Lustre offers a fully POSIX-compliant, high performance file system that delivers sub-millisecond latencies, up to hundreds of gigabytes per second throughput, and millions of IOPS. It seamlessly integrates with Amazon Simple Storage Service (Amazon S3), offering cloud practitioners seamless access to their S3 datasets and cost-efficiency for colder data sets.

In this blog post, we guide you through the process of seamlessly integrating an Amazon FSx for Lustre file system with an Amazon S3 data lake, where the Amazon FSx file system and Amazon S3 bucket reside in different AWS accounts in the same AWS Region. This solution will help you scale your AWS environment by allowing data to be shared from a centralized enterprise data lake to specialized team accounts consuming that data for ML and HPC use cases.

Solution overview

The following solution architecture consists of addressing two primary permissions issues. The first is authorizing Amazon FSx for Lustre to read from an Amazon S3 bucket in another account for the initial load. The second is authorizing the file system to receive bucket put notifications to replicate ongoing changes to keep the data synched.

Solution architecture

Prerequisites

To deploy the solution described in this blog, you will need the following:

Two (2) AWS accounts. You can create an AWS account if you do not already have two accounts available.

The following sections describe how to integrate an Amazon FSx for Lustre file system in ACCOUNT-A with a Data Repository Association (DRA) Amazon S3 bucket in ACCOUNT-B.

Implement the solution

The following sections walk through integrating an Amazon FSx for Lustre file system in ACCOUNT-A with a Data Repository Association (DRA) Amazon S3 bucket in ACCOUNT-B.

Step 1: Create Amazon FSx file system
Step 2: Create source bucket
Step 3: Create data repository association
Step 4: Configure bucket policy

Step 1: Create Amazon FSx file system

In ACCOUNT-A, confirm you are in the US East (N. Virginia) Region and navigate to the Amazon FSx console.

Confirm you are in US East (N.Virgina)

1. Click Create file system. On the next screen, you will be presented with different types of Amazon FSx file systems. Select Amazon Fsx for Lustre and then click Next.

Select File System Type

2. Enter the File system name, Storage capacity and set Data compression type to LZ4 to enable compression as shown in the following image.

Enter file system name

3. In the Network & security section choose the Virtual Private Cloud (VPC), VPC Security Groups, and a Subnet for our new file system.

Network and security

The selected security group must allow inbound access for Amazon FSx for Lustre traffic (TCP ports 988, 1018-1023) to enable Amazon EC2 instances in the same VPC to mount the Amazon FSx file system. For more information, see the documentation on file system access control with Amazon VPC in the FSx for Lustre User Guide.

Amazon FSx does not support backups on file systems linked to an Amazon S3 bucket so we need to disable backups for our new file system.

4. Under the Backup and maintenance section, choose Disabled and then click Next.

Backup and maintenance

5. Review options for accuracy and click Create file system. It will take a few minutes to initialize. When the file system is ready the status will show Available.

Step 2: Create source S3 bucket

Create an Amazon S3 bucket in ACCOUNT-B. You can find the detailed instructions for creating a bucket in the Amazon S3 User Guide. In our example, we choose the US East (N. Virginia) Region and name the bucket “new-lustre-file-system.” After we create the data repository association in the next section, we will return to update the bucket policy.

Step 3: Create data repository association

Now we will create a data repository association (DRA) to link the Amazon FSx for Lustre file system to our Amazon S3 bucket.

1. In ACCOUNT-A, navigate to the Amazon FSx console and select the file system we created. Select the Data repository tab and then choose Create data repository association.

Data Repository Association

2. Enter the File system path and the path to the Amazon S3 bucket. Note that for our example we used the entire bucket, but we could instead restrict the DRA to a specific prefix.

Data repository association information

3. Click Create. It will take a few minutes to initialize before the status shows Available.

File system "available"

4. When the DRA was created, an Amazon FSx service-linked role for Amazon S3 access was created. Navigate to the AWS Identity and Access Management (AWS IAM) console and search for the service role created for our new file system.

Identity, Access, Management

5. Find the Amazon Resource Name (ARN) for the Amazon FSx for Lustre service-linked role and save this somewhere. We’ll need it for the bucket policy in the next section.

Amazon Resource Name

Step 4: Configure the S3 bucket policy

With the ARN from the previous section we’ll apply a bucket policy to our Amazon S3 bucket.

1. In ACCOUNT-B, navigate to the Amazon S3 console and choose the bucket you created. Click on the Permissions tab and in in the Bucket policy section choose Edit. Replace the current policy with the policy below.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Example permissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::ACCOUNT-A:role/aws-service-role/…/AWSServiceRoleForFSxS3Access_fs-XXXXXXX"
            },
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:Get*",
                "s3:List*",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::new-lustre-file-system",
                "arn:aws:s3:::new-lustre-file-system/*"
            ]
        }
    ]
}

2. Replace the value for the AWS principal with the ARN of the service-linked role you saved in Step 3.

3. Replace “new-lustre-file-system” with your own bucket name. Select Save changes.

If you want Amazon FSx to encrypt data when writing to your S3 bucket, you need to set the default encryption on your S3 bucket to either SSE-S3 or SSE-KMS. For more information, refer to the documentation on working with server-side encrypted Amazon S3 buckets.

Testing the solution

Now we have an Amazon FSx for Lustre file system that is syncing with an Amazon S3 bucket in a different AWS account.

Step 1. Create the Amazon EC2 instance

To test the syncing, we need an Amazon EC2 instance so we can mount the file system.

In ACCOUNT-A, navigate to the Amazon EC2 console. Launch an instance using an Amazon Linux 2 AMI in the same VPC as your Amazon FSx file system. For instructions on how to launch an instance, refer to the documentation on launching your instance.

Step 2. Mount the file system

Connect to your Linux instance using one of several methods described in the documentation.

From the terminal window, mount the Amazon FSx file system. You can find instructions for how to mount your file system from the Amazon FSx console. Select the file system and then choose Attach.

Step 3. Create test files

After successfully mounting the file system, create a test file in the mounted directory /fsx/ns1/. We’ll call the file “file1.txt.”

test file in mounted directory

Switch to ACCOUNT-B, and check the Amazon S3 bucket you created. You should find file1.txt.

S3 bucket

Now upload another file directly to your Amazon S3 bucket. Let’s call it “file2.txt.”

Go back to the EC2 terminal and type ls -l. you should see file2.txt in /fsx/ns1/.

EC2 Terminal

You can repeat the testing process with delete and update.

Cleaning up

Now that we tested the solution, execute the following four steps to delete the provisioned resources to avoid incurring unnecessary charges.

Terminate the Amazon EC2 instance you used to mount and test the file system.
Delete the Amazon FSx for Lustre file system you created in ACCOUNT-A.
Delete the sample data and the Amazon S3 bucket you created in ACCOUNT-B.
Delete the IAM service-linked role you created to provide Amazon S3 access to the Amazon FSx for Lustre file system.

Conclusion

Amazon FSx for Lustre’s native integration with S3 provides a proven, easy to deploy solution that leverages the high performance of a scale-out Lustre file system with the benefits of a data lake built on Amazon S3. In this post, we demonstrated how to deploy a solution to keep an Amazon FSx file system in sync with changes made to source data in an Amazon S3 bucket in a different AWS account. This solution helps enterprises scale their AWS environment by allowing data to be shared from a centralized enterprise data lake to specialized team accounts consuming that data for ML and HPC use cases.

Do you have other challenges serving data from an enterprise data lake to ML and HPC teams? Let us know in the comments if this approach improves your delivery times!

AWS Storage Blog