AWS Storage Blog
Persistent-50 high-performance file system packs a punch!
Like many things in life, the more choices you have the harder it is to make a decision. While running workloads in the cloud does make some decisions easier, it doesn’t eliminate them altogether, and there are still choices you must make on how to best architect for the cloud. These decisions now center more on which AWS service, resource, or configuration is best for you. Now that Amazon FSx for Lustre has persistent file systems, many of you may wonder which per unit throughput setting is best suited for your workload. In this blog post, I share how FSx for Lustre persistent-50 file systems perform. I use IOR, a commonly used file system benchmarking application typically used to evaluate the performance of parallel file systems, to test different file system components using read and write operations. In doing so, I hope to make you aware how persistent-50 file systems can achieve extremely high throughput, well above the advertised per-unit throughput value of these file systems.
If you’re not familiar with persistent FSx for Lustre file systems, I encourage you to read this recent blog post written by one of my colleagues at AWS.
How high performance is achieved
When creating an FSx for Lustre file system – scratch or persistent – you choose the total storage capacity of the file system. The throughput a file system is able to support is proportional to its storage capacity. For scratch deployment-type file systems, the per unit throughput is 200 MB/s per TiB of storage capacity – easy, no decision to make. For persistent deployment-type file systems, you have a choice of either 50 MB/s (persistent-50), 100 MB/s (persistent-100), or 200 MB/s (persistent-200) per TiB of storage capacity. Each of these per-unit throughput options has a different price point but it gives you flexibility to select a throughput level that best aligns with your budget and performance needs.
There are three components that contribute to a file system’s overall throughput: network throughput, in-memory cache, and disk throughput. The performance section of the Amazon FSx for Lustre user guide has a great table that shows the disk, in-memory cache, and network throughput for all deployment-type file systems. The following (Table 1) is a subset of that table, showing throughput and caching information for only persistent file systems.
Network throughput (MB/s per TiB of file system storage provisioned) | Memory for caching (GiB per TiB of file system storage provisioned) | Disk throughput (MB/s per TiB of file system storage provisioned) | |||
Baseline | Variable | Baseline | Burst | ||
PERSISTENT-50 | 250 | Up to 1300 | 2.2 | 50 | Up to 240 |
PERSISTENT-100 | 500 | Up to 1300 | 4.4 | 100 | Up to 240 |
PERSISTENT-200 | 750 | Up to 1300 | 8.8 | 200 | Up to 240 |
Table 1 – FSx for Lustre persistent file system performance
When you select the per unit throughput of a file system, you’re really selecting the baseline disk throughput available to that file system. This is typically the slowest performing component of a file system. Burst disk throughput, in-memory cache, and the baseline and variable network performance of the file system allows it to operate at substantially higher throughput rates than the baseline disk throughput. This gives you access to much more throughput than the per unit throughput you actually chose.
File-based workloads are typically spiky, driving high levels of throughput for short periods, but driving lower levels of throughput for longer periods. These types of workloads fit great within the burst model of FSx for Lustre. If your workload is more consistent, select a persistent per unit throughput that aligns with your needs, but remember you still have burst throughput available if you need it. These days, you never know what can happen that can drive throughput levels above the norm.
How I tested
Let me show you how a persistent-50 file system bursts above its baseline.
First, I create a 2.4 TiB persistent-50 file system. Based on the values in Table 1, the throughput and in-memory cache available to these persistent file systems is in the following table (Table 2).
2.4-TiB storage capacity | Network throughput (MB/s) | In-memory cache (GiB) | Disk throughput (MB/s) | ||
Baseline | Variable | Baseline | Burst | ||
PERSISTENT-50 | 586 | 3047 | 5.28 | 117 | 563 |
PERSISTENT-100 | 1172 | 3047 | 10.56 | 234 | 563 |
PERSISTENT-200 | 1758 | 3047 | 21.12 | 469 | 563 |
Table 2 – FSx for Lustre persistent file system performance of a 2.4-TiB file system
Second, I launch four m5n.8xlarge instances using the latest Amazon Linux 2 AMI. I purposely select this instance type because of its non-variable network performance. I don’t want the variable network performance of smaller Amazon EC2 instances to affect my three long running tests. I need consistent network performance from my EC2 instances to the file system.
Below is an example of my user data script. It installs the latest AWS CLI, the Lustre client, IOR, a few packages, and mounts the 2.4 TiB persistent-50 file system as /fsx. This script does not change the stripe count or size of the file system so all my tests use the default Lustre stripe configuration – a stripe count of 1 and a stripe size of 1,048,576 bytes.
#cloud-config
repo_update: true
repo_upgrade: all
runcmd:
- curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
- unzip awscli-bundle.zip
- ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
- sudo export PATH=/usr/local/bin:$PATH
- amazon-linux-extras install -y epel lustre2.10
- yum groupinstall -y "Development Tools"
- yum install -y fpart parallel tree nload git libaio-devel openmpi openmpi-devel
- cd /home/ec2-user
- git clone https://github.com/hpc/ior.git
- export PATH=$PATH:/usr/lib64/openmpi/bin
- export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib/
- cd ior
- ./bootstrap
- ./configure
- make
- sudo cp src/ior /usr/local/bin
- cd /home/ec2-user
- filesystem_id=
- mount_point=/fsx
- availability_zone=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
- region=${availability_zone:0:-1}
- mount_name=$(aws fsx describe-file-systems --file-system-ids ${filesystem_id} --query 'FileSystems[*].LustreConfiguration.MountName' --output text --region ${region})
- mkdir -p ${mount_point}
- echo "${filesystem_id}.fsx.${region}.amazonaws.com:/${mount_name} ${mount_point} lustre defaults,noatime,flock,_netdev 0 0" >> /etc/fstab
- mount -a -t lustre
- chown ec2-user:ec2-user ${mount_point}
Third, from all four instances I run an IOR script that writes 512 GiB of data continuously to the persistent-50 file system in parallel. The following is an example of the IOR write script. I bypass the cache on the client EC2 instances and use eight threads per instances.
# IOR - write test
instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
mpirun --npernode 8 --oversubscribe ior --posix.odirect -t 1m -b 1m -s 16384 -g -v -w -i 100 -F -k -D 0 -o /fsx/ior-${instance_id}.bin
The following screenshot of an Amazon CloudWatch graph shows the total throughput of the file system. For a short period of time the IOR write test is using the burst and variable throughput of the file system and the overall throughput is a consistent 604 MB/s. For the remainder of the test, I get 149 MB/s from a persistent-50 file system that has a baseline throughput of 117 MB/s. Refer back to Table 2 above to see how these results compare to the theoretical numbers.
Fourth, instead of waiting for the burst credits to be replenished, I create a new 2.4 TiB persistent-50 file system that is identical to the one I just tested and mount it on all four EC2 instances. This only takes about five minutes so I’m up and running for my next test in no time. From one instance I run an IOR script that creates eight 8-GiB files totaling 64 GiB of data. Then, from all four instances I run an IOR script that reads from these files continuously using eight threads per instance. The following is an example of the IOR script.
# IOR - write test from one instance
mpirun --npernode 8 --oversubscribe ior --posix.odirect -t 1m -b 1m -s 8192 -g -v -w -i 1 -F -k -D 0 -o /fsx/ior.bin
# IOR - read test from all instances
mpirun --npernode 8 --oversubscribe ior --posix.odirect -t 1m -b 1m -s 8192 -g -r -i 10000 -F -k -D 60 -o /fsx/ior.bin
The following graph shows the total throughput of this read test. Again, for a short period of time the IOR read test is using the burst and variable throughput of the file system and the throughput is a consistent 638 MB/s. For the remainder of the test I get 153 MB/s from a persistent-50 file system that has a baseline throughput of 117 MB/s. Refer back to Table 2 above to see how these results compare to the theoretical numbers.
Fifth, once again I create a new 2.4-TiB file system so I start out with full burst capabilities. I want to test just the network performance of the file system so I use IOR to create a dataset that can fully reside in the in-memory cache. From one EC2 instance I create eight 675-MiB files totaling 5.28 GiB of data. Then, from all four instances I run an IOR script that reads from these files continuously using eight threads per instance. All my IOR scripts, including this one, use the –posix.odirect flag to bypass the local cache of the EC2 instance. This makes it so that all my I/O requests come over the network from the file system. The following is an example of the IOR script.
# IOR - write test from one instance
mpirun --npernode 8 --oversubscribe ior --posix.odirect -t 1m -b 1m -s 675 -g -v -w -i 1 -F -k -D 0 -o /fsx/ior.bin
# IOR - read test from all instances
mpirun --npernode 8 --oversubscribe ior --posix.odirect -t 1m -b 1m -g -r -i 2000000 -F -k -D 0 -z -o /fsx/ior.bin
The results are amazing.
For a short period of time the file system delivered a consistent 3170 MB/s of variable network throughput followed by a steady 625 MB/s of baseline network throughput.
It is likely that your workloads are probably much larger than the memory designated for caching. However, portions of your workload should be able to take advantage of this in-memory cache and the high variable network throughput of these file systems.
What are the results
Don’t let the small number of 50 MB/s per TiB of storage capacity fool you, these persistent-50 file systems can deliver a punch. The results of these tests show I was able to burst the file system 5.15, 5.44, and 27.05 times faster than the file system’s per unit throughput during my write, read, and in-memory cache read tests, respectively. Even the non-burst numbers came in substantially higher at 27.34%, 30.77%, and 434.19% greater than the baseline per unit throughput for my write, read, and in-memory cache read tests.
Because file system throughput is proportional to the storage capacity of a file system, I typically recommend customers size their file systems based on the greater of these two values: either the total storage capacity you need, accounting for usual growth; or the total throughput you need based on the different per unit of storage throughput values. Test your workload in a POC environment to see if you can run using a persistent-50 file system. Remember, the throughput you actually get, based on your I/O patterns, could be substantially greater than the throughput you select when creating the file system.
Final thoughts
Next time you’re creating an FSx for Lustre file system and you’re moving your pointer back and forth over the throughput per unit of storage radio buttons, do the math and test using a persistent-50 file system. You’ll be surprised.
To watch a video of me running these tests, watch the Amazon FSx for Lustre: Persistent Storage Overview video on the FSx Lustre Resources web page. To learn more about Amazon FSx for Lustre persistent file systems, visit the Amazon FSx for Lustre site and user guide.
Thanks for reading this blog post, please leave any questions or comments in the comments section!