Scaling a read-intensive, low-latency file system to 10M+ IOPs

Introduction

Many shared file systems are used in supporting read-intensive applications. These applications typically exploit copies of datasets whose authoritative copy resides somewhere else. One application of a read-heavy application is financial backtesting. For small datasets, in-memory databases and caching techniques can yield impressive results. However, low latency flash-based scalable shared file systems can provide both massive IOPs and bandwidth. They’re also easy to adopt because of their use of a file-level abstraction.

In this post, I’ll share how to easily create and scale a shared, distributed POSIX compatible file system that performs at local NVMe speeds for files opened read-only. For this configuration, I’ll be using i3en.24xlarge Amazon EC2 instances on AWS. Figure 1 shows the basic architecture, in which there are N file servers acting as file system clients and application servers, as well as one or more remote clients that can access the file system, but which don’t have high-speed local access.

Figure 1: High level architecture of the file system.

In general, this architecture is designed for any type of workflow where some amount of data is pushed to the file system via a single writer and with N clients reading and analyzing the data – eventually producing output that’s highly concentrated thus not requiring a lot of file system writes. In this example, I’ll assume a shared file system size of 54TB, which is the total NVMe capacity of a i3en.24xlarge.

Capacity can be scaled simply by adding row(s) of additional i3en.24xlarge instances that will act as storage expansion servers. We do this by configuring the expansion servers as NVMe over fabric (NVMe-oF) targets, and the the file server instances as initiators. This expansion is shown in figure 1 as the capacity expansion row.

NVMe-oF has been included in mainline Linux kernels for over 5 years. This enables a total shared file system size that is a multiple of 54 TB, with each row of expansion instances adding an additional 54TB of capacity, but not aggregate performance. If the expansion instances are provisioned within the same cluster placement group, the additional “hop” to the expansion NVMe-oF averages less than 200 microseconds, courtesy of Elastic Fabric Adapter (EFA), our high-speed, low-latency network adapter.

While we are on the topic of the goodness of shared file-level abstraction, AWS offers a robust managed Lustre implementation called Amazon FSx for Lustre suitable for the vast majority of high performance workloads. If your application needs a more balanced Read/Write profile, FSx for Lustre is a great choice. For those of you that have an extreme read-only distribution use case like ours, read on for more details on how to roll your own filesystem.

Read-Mostly Distributed File System

For ultra-low latency read access, each file server instance has a local XFS file system, striped via LVM (the Logical Volume Manager) across all 8 local NVMe devices, and optionally concatenated with NVMe-oF drives presented from the capacity row. This arrangement allows the addition of capacity and growth of the shared file system, without disrupting the overall file system structure.

The local file systems that exist on all file servers are all local copies of a shared POSIX-compatible FUSE (Filesystem in USErspace) file system. This is based on open-source GLUSTER, and any number of clients (including the servers themselves) can directly read and write to it through the FUSE mount. When a FUSE client writes to the file system, each write is replicated to all the server nodes, and can be seen immediately on each server and by each client.

Because the file system is designed to scale reads, in a low (or no) write environment, the write performance penalty due to FUSE abstraction and GLUSTER replication to the N underlying file systems isn’t a concern. In practice, the write speed is limited to a few hundred MBytes/s through the remote FUSE client, and reads from a remote client are in the range of a few GBytes/s. However, when accessing files on the file server via the read-only (non-FUSE) mount, read access occurs at PCI backplane speeds. Because each file server is reading directly from a local copy, the aggregate read performance scales perfectly linearly with the number of file servers.

Read-intensive applications and jobs can run directly on the file server, opening the files for read on the local read-only mount point, and writing output files through the FUSE enabled mount point. Both mount are to the same file system directory structure, which are constantly in sync. Non-consistency across the local file systems is avoided by mounting the local copies read-only, and using the shared file system’s consistency for all write and metadata operations. Thus, the full power of a shared file system is at the application’s disposal, but operating at the speed of localized NVMe reads.

Method

To create a read-mostly distributed file system, I provisioned 13 x Amazon EC2 i3en.24xlarge file servers using Red Hat Enterprise Linux 8, but in general, any common Linux distribution can be used. Following the GLUSTER installation guide, I built a single XFS GLUSTER “brick” file system on each server, which was striped using LVM, with a width of 1024 KB. Once all 13 peers (GLUSTER parlance for file server) had been added to the cluster, a volume of type “replica” was created that is automatically replicated to all 13 peers. This configuration leverages GLUSTER to create a shared file system construct that duplicates the entire file system directory and contents to all cluster members. Once created, we start the volume and mount the GLUSTER file system on all peers to an appropriate mount point, in this case, /sharedfs. This is our shared read/write file system.

You can also provision and create as many independent clients as you may need. I created just one, installing GLUSTER, and FUSE mounting /sharedfs on it as well. I used this instance to populate test files, by running fio on the empty file system to generate 48 x 10 GByte files.

To provide local high speed read access on each peer we bind and re-mount the local XFS file system, which the GLUSTER shared file system is replicating to. On each file system server (peer) node, we perform this two-step mount.

First, bind the existing XFS mount to a new directory, like /sharedfsRO. Depending on what naming convention you used in creating your GLUSTER backing store file system, it may mean running something like

mount –bind /data/brick1/brick /sharedfsRO

The second step involves remounting the read/write XFS file system we just bound with a command like

mount -o remount,bind,ro /data/brick1/brick /sharedfsRO

as a read-only file system. You’ll need to do this for each file system server that you want fast reads on.

Now you have access to a local file system for fast local reads mounted at /sharedfsRO. If your application opens a file in this path, all accesses will be at full local speed. Meanwhile, you application can open files for read/write access via the /sharedfs path, if the application needs to write some output after processing. These files will, of course, be automatically replicated to all other file servers. Although it’s possible to use the local read/write XFS mount (/data/brick1/brick), it’s not advisable since any metadata update or writes are all local to the server, and thus will diverge the file system. This should be avoided to maintain file system consistency.

Performance Tests

For benchmarking, a 13 node wide shared file system cluster using i3en.24xlarge instances, each with 54.4 TB of fast-attached NVMe storage was configured. An XFS file system was created on each node and the cluster replicates a copy of each file to each and every i3en performance row instance making the total file system size around 54TB. No capacity expansion row was utilized for benchmarking.

A client-only instance using the FUSE shared file system mount only (/SharedFS) was used to populate the shared filesystem with 48 x 10 GiByte files for testing with FIO.

Local Client Based Results (8k Reads)

Across all 13 nodes, we achieved an aggregate 10,257,000 8k IOPs, with an average latency of 171.34 µsec and an average standard deviation of 95.29 µsec. As expected, this is near-linear read scalability. An example fio run from one of the 13 client/servers that ran in parallel looks like this:

#fio test-local.sh
8k-iops: (g=0): rw=randread, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=8
...
fio-3.19
Starting 16 processes
Jobs: 16 (f=16): [r(16)][100.0%][r=7129MiB/s][r=912k IOPS][eta 00m:00s]
8k-iops: (groupid=0, jobs=16): err= 0: pid=10931: Tue Sep 21 15:07:24 2021
  read: IOPS=912k, BW=7127MiB/s (7473MB/s)(209GiB/30002msec)
    slat (nsec): min=2000, max=202000, avg=3666.69, stdev=1650.87
    clat (usec): min=30, max=10259, avg=135.96, stdev=35.63
     lat (usec): min=77, max=10262, avg=139.72, stdev=35.65
    clat percentiles (usec):
     |  1.00th=[   95],  5.00th=[  101], 10.00th=[  104], 20.00th=[  111],
     | 30.00th=[  117], 40.00th=[  122], 50.00th=[  127], 60.00th=[  133],
     | 70.00th=[  141], 80.00th=[  155], 90.00th=[  184], 95.00th=[  206],
     | 99.00th=[  265], 99.50th=[  289], 99.90th=[  347], 99.95th=[  371],
     | 99.99th=[  429]
   bw (  MiB/s): min= 6985, max= 7198, per=100.00%, avg=7135.42, stdev= 2.97, samples=944
   iops        : min=894145, max=921448, avg=913333.75, stdev=379.67, samples=944
  lat (usec)   : 50=0.01%, 100=3.91%, 250=94.59%, 500=1.50%, 750=0.01%
  lat (msec)   : 10=0.01%, 20=0.01%
  cpu          : usr=11.40%, sys=23.13%, ctx=16323057, majf=0, minf=1453
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=27369714,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: bw=7127MiB/s (7473MB/s), 7127MiB/s-7127MiB/s (7473MB/s-7473MB/s), io=209GiB (224GB), run=30002 -30002msec

Summary

This is a novel method to enable Amazon EC2 instances to access a shared file system via local NVMe, linearly speeding up read-bound distributed applications. Each localized file system can easily handle up to 2 million random-read IOPs, depending on block size, or an average of 800K 8k reads at under 200 microseconds, and thus a payload of over 10 GiBytes per second in large-block reads. In a single 13-node cluster, we saw over 10 Million 8k reads at over 90 GiBytes per second. We maintained file system consistency between local file systems by replicating every write and metadata operation to every file system server. This is a file system that’s ideal for read-intensive workloads that have very low write requirements – a common need in many HPC, batch and deep learning environments. We hope this provokes your imagination. We’d love to know what you use it for.

AWS HPC Blog