AWS Storage Blog

Unlock higher performance for file system workloads with scalable metadata performance on Amazon FSx for Lustre

Imagine a company like a movie studio, one that works with enormous volumes of video files, scripts, and animation assets. They store these files on a high-performance file system such as Amazon FSx for Lustre, a fully managed shared storage built on the world’s most popular high-performance file system. Each file has metadata, such as POSIX information. As the studio’s projects grow, so does the number of files and directories. When they need to search for files, access them, or even list what’s in a directory, the system must quickly retrieve and manage this metadata. However, in a traditional file system, as the number of files increases, especially into the millions or billions, metadata operations can slow down significantly. This slowdown can create bottlenecks, causing delays in retrieving files and hindering the team’s productivity, which is crucial when working under tight deadlines.

FSx for Lustre is designed for applications that need fast storage that can keep CPUs and GPUs used at their maximum capacity, reaching TB/s throughput and millions of IOPS. Supporting concurrent access to the same file or directory from thousands of compute instances while delivering consistent, sub-millisecond latencies for file operations is an enabler for many workloads. FSx for Lustre uses object storage servers, as shown in Figure 1, to distribute files across multiple nodes. Therefore, each read or write operation is parallelized across the cluster to balance storage capacity and throughput. FSx Lustre uses a dedicated metadata server to support metadata operations.

Imagine a company like a movie studio, one that works with enormous volumes of video files, scripts, and animation assets. They store these files on a high-performance file system such as Amazon FSx for Lustre, a fully managed shared storage built on the world's most popular high-performance file system. Each file has metadata, such as POSIX information. As the studio’s projects grow, so does the number of files and directories. When they need to search for files, access them, or even list what's in a directory, the system must quickly retrieve and manage this metadata. However, in a traditional file system, as the number of files increases, especially into the millions or billions, metadata operations can slow down significantly. This slowdown can create bottlenecks, causing delays in retrieving files and hindering the team's productivity, which is crucial when working under tight deadlines. FSx for Lustre is designed for applications that need fast storage that can keep CPUs and GPUs used at their maximum capacity, reaching TB/s throughput and millions of IOPS. Supporting concurrent access to the same file or directory from thousands of compute instances while delivering consistent, sub-millisecond latencies for file operations is an enabler for many workloads. FSx for Lustre uses object storage servers, as shown in the following figure, to distribute files across multiple nodes. Therefore, each read or write operation is parallelized across the cluster to balance storage capacity and throughput. FSx Lustre uses a dedicated metadata server to support metadata operations.

Figure 1: FSx for Lustre architecture before scalable metadata feature

File system metadata performance determines the number of files and directories that a file system can create, list, read, and delete per second for these operations. Metadata intensive workloads often involve the creation, processing, and manipulation of vast numbers of small files. FSx for Lustre automatically provides a default metadata performance level based on the amount of storage provisioned. Before the release of scalable metadata, users who wanted more metadata operations than the default had to create a larger file system, or even split data over multiple file systems. The release of scalable metadata allows users to increase available metadata IOPS up to 15x, independent of the storage provisioned. This means that even the most metadata-intensive workloads are accommodated. Furthermore, this feature can be enabled during the creation of the file system, or later to increase the quantity of IOPS dedicated to metadata operations to a specific file system, as shown in Figure 2.

Figure 2: FSx for Lustre architecture after scalable metadata feature

Figure 2: FSx for Lustre architecture after scalable metadata feature

Workloads and use cases

This new feature streamlines data management and improves efficiency for use cases that need metadata intensive workloads, such as machine learning (ML), Electronic Design Automation (EDA) and Financial Analytics risk simulations. These workloads, such as foundational design in EDA or researchers renaming datasets during project creation, place significant stress on file systems due to their metadata-intensive nature. We can test the limits of these operations using MDTest.

Metadata performance

Deploying FSx for Lustre file systems for certain workloads involves understanding how metadata IOPS play a critical role in optimizing performance. In Automatic mode, FSx for Lustre streamlines the process by automatically allocating metadata IOPS based on the storage capacity of your file system, observes in Figure 3a. This approach correlates the number of IOPS with the size of your storage, delivering the right level of performance for the majority of workloads without the need for manual adjustments. For example, a file system with 1200 GiB of storage automatically receives 1500 metadata IOPS, while larger systems scale up to 12000 metadata IOPS for capacities exceeding 12000 GiB.

On the other hand, User-Provisioned allows for more granular control. You can manually specify the exact number of metadata IOPS needed, independent of storage capacity. This mode is beneficial for workloads that demand specific IOPS requirements beyond what automatic provisioning offers.

Figure 3a: File system metadata performance

Figure 3a: File system metadata performance

FSx for Lustre categorizes metadata operations, as shown in Figure 3b, such as file creates, deletes, and directory manipulations, each with its own rate of operations per second. This differs from file system IOPS, which can scale into the millions. For example, file creates or opens occur at a rate of two operations per second per provisioned IOPS, while file deletes operate at one operation per second. Aligning the number of provisioned IOPS with the specific types of operations that your workload demands means that FSx for Lustre makes sure of the efficient handling of metadata. This optimization not only enhances overall file system performance but also supports scalability as your storage and operational needs grow.

Figure 3b: File system metadata performance

Figure 3b: File system metadata performance

Prerequisites

For this walkthrough, you should have the following prerequisites:

How to configure metadata performance

When creating a new FSx for Lustre file system, there is a new option to define metadata performance as automatic as shown in Figure 4a, where IOPS is automatically defined based on the capacity of the file system (12000 metadata IOPS per 24 TiB of storage). Or, it can be defined as user-provisioned, as shown in Figure 4b, where users can specify the number of metadata IOPS, starting with 1500, 3000, 6000 or increments of 12000 up to 192000 metadata IOPS. Users can create FSx for Lustre file systems with the ability to increase metadata performance using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or AWS Software Development Kits (SDKs).

Figure 4a: Automatic metadata configuration

Figure 4a: Automatic metadata configuration

Figure 4b: User-provisioned metadata configuration

Figure 4b: User-provisioned metadata configuration

Running the tests

In this section, our goal is to compare metadata operation performance between standard and scaled metadata configurations across various workload scenarios. We also provide instructions to replicate these tests in your AWS account.

Figure 6 shows an example of a file system created in the Console. This particular file system is configured with 12 TiB of storage, delivering 1,500 MB/s of throughput (at 125 MB/s/TiB), and 192,000 metadata IOPS. This configuration is tailored for applications with low throughput requirements but intense metadata operations.

Figure 6: File System creation with 192,000 metadata IOPS

Figure 6: File System creation with 192,000 metadata IOPS

For our test scenario, we created two sets of larger file systems: One is 12 TiB in size, with no scaled metadata, at P250 and P1000 throughput configurations, and the second is 12 TiB, with scaled metadata to 192000 IOPS, with P250 and P1000 scaled throughput configurations.

Using AWS ParallelCluster, we stood up a cluster for running this HPC benchmark. There are 200 c5.9xlarge client EC2 instances with the following configuration:

Instance type: c5.9xlarge
Operating Systems: Amazon Linux 2
Linux Kernel: 5.10.219-208.866.amzn2.x86_64
mpi version: Open MPI 4.1.6
lustre version: 2.12.8_198_gde6dd89_dirty
ENA driver: 2.12.0g
Ranks for job: 800
ParallelCluster: 3.10.1

Lustre client side parameters

sudo lctl set_param osc.*OST*.max_rpcs_in_flight=32
sudo lctl set_param mdc.*.max_rpcs_in_flight=64
sudo lctl set_param mdc.*.max_mod_rpcs_in_flight=50
sudo lctl set_param osc.*.max_dirty_mb=64
Bash

For more information on Lustre performance settings, deep dive the FSx for Lustre performance guide and the online Lustre documentation.

As for a testing tool, in this scenario the open source MDTest is used to achieve I/O simulation for metadata.

Installing and running MDTest

The following commands were run with OpenMPI libraries and compliers in the PATH and LD_LIBRARY_PATH for the user:

$ git clone https://github.com/hpc/ior.git
$ cd ior
$ ./bootstrap
$ ./configure --prefix=/opt/parallelcluster/shared
$ make all
$ make install
Bash

At this point we chose to run an interactive job to launch the MDTest, in the event we want to trouble shoot or actively monitor the tests.

$ srun --nodes=200 --ntasks-per-node=32 --time=12:00:00 --pty bash -i

When the EC2 instances are available, we created a subdirectory with the permissions for our UID to have full access for the tests.

(example)

$ sudo mkdir -p /fsx/mdTestFiles
$ chown 1000:1000 /fsx/mdTestFiles
Bash

Then, we execute the job, which iterates three times and give us the means of the workload:

$ mpirun --mca plm_rsh_num_concurrent 800 --mca routed_radix 800 --mca routed direct --map-by node --mca btl_tcp_if_include eth0 -np 800 /opt/parallelcluster/shared/bin/mdtest -F -v -n 4000 -i 3 -u -d /fsx/mdtestFiles
Bash

MDTest options explanation:

-F:                Perform tests on Files only

-v:                verbose output

-n 4000:      Every rank will create/stat/read/remove

-i 3:              Number of iterations the test will run

-u:                Unique working directory for each rank

-d:                Directory for root of job

Figure 7 shows the output example of first job run on 12 TiB, single MDT filesystem:

Figure 7: Example output of MDTest run

Figure 7: Example output of MDTest run

In this section we break down what we observe from this initial test of a single MDT FSx Lustre 12 TiB file system in the preceding figure. File creation represents a file per process event. This can be interpreted as small write workloads. From a stat perspective, this can be a costly operation for Lustre and represents how a tree walk or large “ls -l” may behave. File read would represent a read workload, and file removal is the total purging of files from a file system. Each event pressures metadata servers differently.

Figure 8 shows a comprehensive chart of multiple runs on difference file system types with the same number of clients, same capacity levels.

Figure 8: Table of MDTest output

Figure 8: Table of MDTest output

The preceding table shows the performance differences between file systems with 12,000 and 192,000 metadata IOPS, respectively. The variance in performance is associated with the operation type, and we observe improvements across all tests with higher metadata IOPS. Depending on the workload requirements, there are clear methods to drive higher metadata IOPS for FSx for Lustre.

Monitoring file system metadata metrics

When your file system is up and running, you can monitor the metadata performance using CloudWatch metrics, and scale up the performance as needed to accommodate growing workload requirements. Monitoring FSx for Lustre is crucial for maintaining system stability and promptly addressing any issues that may arise. CloudWatch serves as a central tool for this task, continuously collecting and processing real-time metrics from FSx for Lustre. Configuring CloudWatch alarms allows administrators to receive immediate notifications through Amazon Simple Notification Service (Amazon SNS) whenever anomalies or performance thresholds are breached, thereby enabling intervention to mitigate potential issues.

FSx for Lustre automatically sends metric data to CloudWatch at one-minute intervals by default, and these metrics are reported in raw bytes. For a deeper dive into CloudWatch and its capabilities, consult the CloudWatch user guide.

FSx for Lustre publishes its metrics into FSx namespace in CloudWatch. With the new Scalable Metadata feature, new metrics as shown in Figures 9a/9b/9c are available to check details about metadata operations. The new metrics are DiskReadOperations, DiskWriteOperations, FileCreateOperations, FileOpenOperations, FileDeleteOperations, StatOperations, and RenameOperations. These metrics enable more granular data, thus you can refine them using FileSystemID (for a specific file system) and StorageTargetID (for a specific MDT – metadata target).

Metadata IOPS 9a

Figure 9a: New file system metadata metric on CloudWatch dashboard

Metadata IOPS 9b

Figure 9b: New file system metadata metric on CloudWatch dashboard

Metadata IOPS 9C

Figure 9c: New file system metadata metric on CloudWatch dashboard

Choose this link to launch an AWS CloudFormation template specific to FSx for Lustre Scalable Meta:

Launch stack

Costs

There are costs associated with testing this solution. The solution runs EC2 instances, FSx for Lustre file systems, and CloudWatch dashboards.

The pricing details are available on the FSx for Lustre, EC2 instance pricing, and CloudWatch pages.

Cleaning up

Back up all necessary performance runs for future review.

Delete the FSx for Lustre file system and associated computer resources to end the billing of that project. Delete the dashboard through the CloudFormation page to remove from your account.

Manage your ParallelCluster as needed. Delete if no longer necessary to save on costs.

Conclusion

The impact of this new capability is far-reaching, particularly for organizations running demanding AI/ML and HPC workloads. These workloads often involve the creation, processing, and manipulation of vast numbers of small files, placing significant strain on metadata performance. The ability to scale metadata performance on Amazon FSx for Lustre allows users to consolidate workloads on a single file system. This is done by running workloads at greater scale without the need to split their workloads, adapt to changing performance requirements by easily increasing metadata performance and, and optimize resource usage by decoupling metadata performance from storage capacity.

Tom McDonald

Tom McDonald

Tom McDonald is a Senior Workload Storage Specialist at AWS. Starting with an Atari 400 and re-programming tapes, Tom began a long interest in increasing performance on any storage service. With 20 years of experience in the Upstream Energy domain, file systems and High-Performance Computing, Tom is passionate about enabling others through community and guidance.