AWS Storage Blog

Optimizing Amazon FSx for Lustre storage consumption using automatic data tiering with Amazon S3

Managing high-performance file storage can be a significant operational and cost challenge for many organizations, especially those running compute-intensive workloads such as high-performance computing (HPC) or data analytics. This is particularly true for organizations with existing data lakes on Amazon S3 who need POSIX-compliant, high-performance file system access. Amazon FSx for Lustre provides a scalable, high-performance file system purpose-built for these workloads that need fast access to large datasets.

The Data Repository Association (DRA) feature of FSx for Lustre allows you to automatically sync data between your FSx for Lustre file system and an Amazon S3 data repository. This enables you to access S3 datasets using a POSIX compliant file system interface. You can optimize file system storage capacity by running a data repository release task to evict file contents from the FSx for Lustre file system once they’ve been synced to Amazon S3 and haven’t been accessed within a specified time period.

In this post, we demonstrate how you can tier data between FSx for Lustre and Amazon S3 by implementing an automated file release mechanism. This approach allows users to reclaim storage capacity on their file systems consumed by colder datasets so that it can be made available for hotter datasets. We also cover how to monitor the file system capacity and trigger notifications when available storage falls below a critical threshold.

Solution overview

The proposed solution uses Amazon FSx for Lustre as a high-performance tier in front of Amazon S3, which enables efficient storage consumption through automated file release, as shown in Figure 1. The key components of this architecture are:

  1. FSx for Lustre file system: The FSx for Lustre file system serves as the primary high-performance storage for frequently accessed data. It provides low-latency, high-throughput access to files.
  2. DRA: The DRA feature of FSx for Lustre allows for automatic synchronization of data between the FSx for Lustre file system and an Amazon S3 data repository.
  3. DRA release task: The DRA release task is a crucial component that allows the eviction of file contents from the FSx for Lustre file system once they have been synced to Amazon S3 and haven’t been accessed within a user-specified duration known as Duration Since Last Access (DSLA). This allows you to reclaim available capacity on the FSx for Lustre file system.
  4. EventBridge scheduler: The Amazon EventBridge scheduler periodically (for example once a day) triggers the DRA release task to make sure that the file system capacity is optimized by evicting cold data.
  5. Capacity monitoring and alerting: The solution includes monitoring of the FSx for Lustre file system capacity using the Amazon CloudWatch available storage metric. When available storage falls below a configurable threshold (for example 25% of total capacity), a CloudWatch alarm is triggered and published to an Amazon Simple Notification Service (Amazon SNS) topic that can be used to notify administrators.
  6. DRA emergency release task: In addition to the scheduled DRA release tasks, the solution includes an “emergency” DRA release task (with a DSLA set to 0) that is triggered by an AWS Lambda function when the available storage falls below a critical threshold (for example 15% of total capacity). While the scheduled daily release task uses a longer DSLA (e.g., 2 days) to ensure only truly cold data is released, this emergency release with its 0-day DSLA is a more aggressive attempt to reclaim space as it releases any files that are not currently in use and have been successfully exported to S3. However, its effectiveness may be limited in certain edge cases discussed further below.

Figure 1 Architecture Diagram for the FSx for Lustre Amazon S3 automated file release

Figure 1: Amazon FSx for Lustre as a high-performance tier in front of Amazon S3

In this architecture, data is primarily stored and accessed from the high-performance FSx for Lustre file system, with the DRA feature automatically synchronizing data to Amazon S3. The EventBridge scheduler periodically triggers DRA release tasks to evict cold data from FSx for Lustre, and capacity monitoring alerts administrators when available storage falls below configured thresholds. In addition, the monitoring solution can trigger emergency DRA release tasks to make sure that ongoing workloads can continue writing to the file system.

This solution is well-suited for workloads primarily dealing with large files. It effectively manages the export and release of large files, and works well for write-intensive workloads with predictable access patterns and data archiving of completed projects. Although it handles large files efficiently, workloads with numerous small files or frequent large-scale directory operations may experience longer export times. This can potentially delay the freeing up of space on the FSx for Lustre file system.

Prerequisites

Before starting, you need the following prerequisites:

  1. Terraform: Have the Terraform CLI installed on your local machine or remote development environment such as AWS Cloud9.
  2. AWS credentials: Terraform needs access to your AWS credentials to create and manage resources. Configure your AWS credentials as environment variables or use the AWS credentials file (~/.aws/credentials).

Walkthrough

We are using Terraform to automate the provisioning and configuration of the required AWS resources. The key components deployed by the Terraform stack are already explained in detail in the Solution overview section. To deploy the solution using the provided Terraform configuration, follow these steps:

  1. Clone the GitHub repository to your local machine.
  2. Navigate to the Terraform directory within the repository.
  3. Review the variables.tf file and adjust the variable values according to your requirements. The following are a few important variables:
    • sns_topic_email: Email to use for SNS notification, update to your email address.
    • alarm_storage_pct_threshold_for_sns_notifications: The file system’s available storage threshold below which email notifications are sent (default: 0.15).
    • alarm_storage_pct_threshold_for_dra_emergency_release: The file system’s available storage threshold below which an emergency release is triggered (default: 0.25).
    • duration_since_last_access_value: This sets the DSLA threshold for the daily release task which impacts what files can be released from the file system (default: 2 days).
  1. Initialize the Terraform working directory by running terraform init
  2. Review the execution plan by running terraform plan
  3. If the execution plan looks correct, then apply the changes by running terraform apply -auto-approve.

Once you apply, Terraform provisions and configures the required resources based on the provided configuration files. When deployment is complete, you should have a new Amazon S3 bucket, an FSx for Lustre file system with automatic data synchronization to the Amazon S3 data repository, scheduled DRA release tasks, capacity monitoring, and notifications set up. To start using the file system, you can launch an Amazon Elastic Compute Cloud (Amazon EC2) instance and mount the created FSx for Lustre file system on it. Data written to the /<fsx_mount>/dra directory maps to the DRA prefix on the S3 bucket.

Cleaning up

When you are done with your testing, run terraform destroy to clean up the deployed resources and avoid incurring charges on your AWS account.

Conclusion

In this post, we demonstrated how to optimize Amazon FSx for Lustre storage consumption for users with existing data lakes on Amazon S3. By using the Data Repository Association (DRA) feature and implementing an automated file release mechanism, organizations can effectively tier their data between FSx for Lustre and Amazon S3. This approach enables businesses to use the FSx for Lustre high-performance capabilities for frequently accessed files while maintaining cost-effective storage for colder data in their existing Amazon S3 data lake. We achieved this by using key AWS services such as FSx for Lustre, Amazon S3, Amazon CloudWatch, Amazon EventBridge, and AWS Lambda, all deployed through Terraform.

Thank you for reading this post. If you have any comments or questions, please leave them in the comments section.

Wassim Benhallam

Wassim Benhallam

Wassim Benhallam is a Sr. Cloud Application Architect with Amazon Web Services based in Houston, Texas. He helps customers architect and build scalable and highly-available cloud solutions to achieve their business outcomes. His areas of interest include serverless and event-driven architectures, HPC workloads, and AI-based web applications.

Qaisar Dar

Qaisar Dar

Qaisar Dar is a Sr. Engagement Manager at AWS, leading strategic cloud initiatives for enterprise clients. He specializes in high-performance computing (HPC), cost optimization, and infrastructure modernization. With deep expertise in AWS services, he helps organizations achieve scalability and efficiency. Qaisar collaborates with cross-functional teams to deliver tailored cloud solutions. His focus is on driving business transformation through innovative and cost-effective cloud strategies.

Vikrant Telkar

Vikrant Telkar

Vikrant Telkar is a Sr. DevOps Engineer with Amazon Web Services based in Dallas, Texas. He helps customers in designing and implementing DevOps technologies. His areas of interest include Elastic Container Services and HPC workloads.