Optimizing MMAP workloads on Amazon FSx for Lustre file systems

One of the primary benefits of using memory mapping (MMAP) in applications is saving memory on the client, sharing data amongst multiple threads and processes, and reducing file system impact to the application. Users are constantly looking for ways to improve application performance, and often this means diving deep into their workloads’ storage profiles to understand what the workloads are attempting to do. When applications leverage memory mapping (MMAP), understanding IO characteristics becomes more difficult. In this situation, applications read directly from memory, bypassing system-level tracing. Storage performance profiling becomes challenging, as the majority of IO is shifted to client-side, in-memory operations.

Amazon FSx for Lustre meets most users’ workload needs with default configurations. However, with MMAP applications, client side tuning adjustments can increase performance once the profile is understood. With these client side changes, we demonstrate increases in performance of up to 27%.

In this post, we examine how to increase the throughput performance capability of FSx for Lustre for MMAP workloads. The analysis leverages profiling tools built into the Linux operating system to identify opportunities for improvement in Lustre to increase the workloads performance. The tuning is performed without needing service-side changes. This increase in performance reduces the time to results and provides a savings in workload costs through Amazon Elastic Compute Cloud (Amazon EC2) and FSx for Lustre runtime reduction.

Example

MMAP is used in traditional workloads such as Database and High Performance Computing applications, and leveraged with modern workloads such as Apache Arrow and in scientific computing languages such as Julia. In the following scenario, the user’s application maps large files, ranging between 1GiB and 30GiBs, into memory with MMAP, performs in-memory analytics on the data, and then continues to the next PCAP file in the sequence. The user wanted to maximize performance to best meet the time-to-market demands in their industry. In production, as seen in the Figure 1, thousands of c6i.8xlarge EC2 instances perform large block, high throughput, 100% read operations against the data store, pushing the performance limits.

Amazon FSx for Lustre architecture

Figure 1: Amazon FSx for Lustre architecture

Understanding FSx for Lustre (FSxL) from the client perspective

To understand how mmap functions with FSx for Lustre filesystems, we need to look into the internals of the FSxL client-servers relationships and mmap calls. FSxL has a different design principle that is distinct from protocols such as NFS, which enables scale out. and is crucial for understanding its efficient operation with mmap.

From a storage client perspective, the “Logical Metadata Volume” module is an abstraction layer for the Metadata Client (MDC) module. The Logical Object Volume (LOV) module is an abstraction layer for the Object Storage Client (OSC) module. There is also a Lustre Network (LNET) module that is vital for successful communications, but it is outside the scope of this post.

MMAP for a file starts with a Remote Procedure Call (RPC) to the Metadata Server (MDS). The MDS returns metadata information about the file request and file object structures to the MDC. Note that these are not Amazon Simple Storage Service (Amazon S3) objects, but FSxL objects on OST servers. Then, the OSC requests the file’s data structure to be returned to the client, based on any FSxL striping implemented.

Let’s take a look at the core components assists in Figure 2. First, there are two Object Storage Servers (OSS) underlying the Storage Servers associated with the scale out architecture of FSx for Lustre. Attached to each OSS are two Object Storage Targets (OST), which are the storage devices where the user data is stored. In this example the user data is stored within four FSxL objects across two OSSs and four OSTs. The objects are read as stripe 0, stripe 1, stripe 2, and stripe 3 in a round-robin fashion across the two OSSs. There are both Metadata and Object Server locking services called “Lustre Distributed Lock Managers” (LDLMs), which make sure of coherency for the file.

This additional overhead is typically non-impactful during large data transfers, as it is trivial in the network conversation. The common stripe sizes of 1MB and 4MB were originally designed for larger files, and can be unfavorable with some smaller file and file operations.

General Lustre client communication

Figure 2: General Lustre client communication

Workload profiling approach

To best understand the workload, the common pattern is to work backward from the storage to the application. As seen in Figure 3, we analyze the usage metrics in Amazon CloudWatch: DataReadBytes, DataWriteBytes, DataReadOperations, DataWriteOperations, and MetadataOperations. This example shows a default workload with a throughput of 6.7GB/sec, in addition to 6,500 IOPS per second.

CloudWatch charts FSx for Lustre

Figure 3: CloudWatch charts FSx for Lustre

From an application perspective, one approach is to use the Linux utility strace on the running process to understand what the Linux operating system is requesting during a single test. See the following example strace command:

strace -fff -p <PID> -o /path/to/output/directory

As represented in the Figure 4, the strace output shows that the large files are being mapped into memory with the Linux (MMAP) function call. The IOs occurring in the mapped memory is inadequate to understand the application profile, as these IO requests bypass the Linux kernel IO subsystem and read directly from the memory addresses. To see these results, execute the following command, where <PID > is the running process:

user$ strace -f -p <PID>

Strace example output

Figure 4: strace example output

After determining that mmap was being called by the application, the approach switched to analyzing the OST RPC page information. This is important to understand whether there are small or large IOs (>1MiB). Lustre, as of version 2.12, reports ‘pages per rpc’ from the operating systems page size. The Linux kernels BUFSIZ was determined to be 8KiB.

The OSC provides a histogram report for every OST it accesses, called rpc_stats. On current versions of the Lustre client, this is available under /sys/fs/lustre.

Figure 5 shows the “pages per rpc”, which is the size of the RPC calls in the number of operating system pages. This is an important clue as to what’s happening. Based on the following histogram, we can see that 93% of the read RPC requests are 16KiB or less, 79% are even smaller than 8KiB, and that the latency impact of small reads is adversely impacting the read performance.

RPC stats Histogram

Figure 5: RPC stats histogram

To gain insights into the access type, analyzing Slabtop’s output is essential, as it necessitates an understanding of the Linux kernel’s operations. Slabtop, a Linux utility, displays real-time kernel cache information, which lets the user better understand what is specifically occurring. See the main page for slabtop(1) for more information.

Two common functions used in Linux for memory management are kmalloc() and vmalloc(). The Slabtop output in figure 6 clearly identifies that the application is using kmalloc(). This is the normal method of allocating memory for objects smaller than the Linux kernels’ page size, thereby re-enforcing the previous histogram analysis. This information helped form the initial hypothesis that increasing the client-side read-ahead size would reduce the overall IO calls to the Lustre filesystem.

slabtop output example

Figure 6: slabtop output example

Tuning the Lustre file system

After identifying the IO Profile, we dive into the Lustre Operations Manual. It became evident that the application is performing small IOs against the large file loaded into memory. In addition, the application is running 16 individual processes on a single host reading the same file. This provides insight into whether pre-loading the file being mapped would result in a positive performance impact by caching data for the various processes reading the file.

We explored the following three client-side parameters in depth:

max_read_ahead_mb –controls the amount of data read-ahead on all files, but is only triggered with sequential reads.
max_read_ahead_per_file_mb – controls the amount of data that should be prefetched by a client, or process, when sequential reads are detected.
max_read_ahead_whole_mb – controls the maximum size of a file in MiB that is read in its entirety upon access, regardless of the size of the read call.

Knowing that this application is using kmalloc, we suspect that this is a sequential read occurring, and max_read_ahead_mb positively impacts the performance. The next step is to run a few commands to understand what the current default configuration is set to:

user$ echo “Getting the max_read_ahead values”

user$ sudo lctl get_param 'llite.*.max_read_ahead_mb'

user$ sudo lctl get_param 'llite.*.max_read_ahead_per_file_mb'

user$ sudo lctl get_param 'llite.*.max_read_ahead_whole_mb'

The preceding commands returned a default of 64MB read-ahead. Since the user has a large file with multiple processes reading it, we hypothesized that increasing the read ahead buffer on the client(s) to 1024MB would provide a performance boost. We tuned the read-ahead buffer of the Lustre file system by running the following commands. Note that these are not persistent over reboots, so the recommendation is to have a boot executable or prologue script to implement this in that event.

user$ echo “Setting the max_read_ahead values”

user$ sudo lctl set_param 'llite.*.max_read_ahead_mb=1024

user$ sudo lctl set_param 'llite.*.max_read_ahead_per_file_mb=64

user$ sudo lctl set_param 'llite.*.max_read_ahead_whole_mb=64

These client side changes as seen in Figure 7, resulted in a 27% performance increase, from 1201 seconds with the default settings to 877 seconds after setting the max_read_ahead_mb to 1024MiB. If users are using FSx for Lustre with applications that perform MMAO to large files, then they should consider increasing this client-side setting. There is no cost associated with this performance boost, resulting in a reduction in time-to-results with a scalable solution.

Figure 7: Performance Deltas

To back out these changes, run the following commands to revert to the default of 64MB:

user$ echo “Reverting the max_read_ahead values back to default of 64MB”

user$ sudo lctl set_param 'llite..max_read_ahead_mb=64

user$ sudo lctl set_param 'llite..max_read_ahead_per_file_mb=64

user$ sudo lctl set_param 'llite.*.max_read_ahead_whole_mb=64 Rebooting the Linux EC2 instance also reverts to the default settings, as these are not persistent across client reboots.

Conclusion

In this post, we demonstrate how to identify a performance bottleneck caused by MMAP IO and how to improve performance to fully optimize compute resources. A no-cost performance tuning technique for Amazon FSx for Lustre and Amazon EC2 enables these applications at scale. The Amazon FSx for Lustre performance documentation provides additional techniques to use in order to gain performance increases, which are workload dependent.

By using standard operating system tools, such as strace and slabtop, you can dive deep into their application profiles and make data driven decisions to optimize their services and increase workload performance.

Amazon for FSx Lustre is a scale-out High Performance Computing (HPC) that is globally available in most AWS Regions. If you have any comments or questions, don’t hesitate to reach out or leave them in the comment section.

AWS Storage Blog