Build and deploy a 1 TB/s file system in under an hour

High throughput shared files systems are an essential part of any HPC or AI environment. Whether you want to train large language models (LLMs), find effective new drugs, or lead the search for new sources of energy, shared file systems are responsible for feeding compute and storing the hard-won results.

If you manage or use an on-premise environment, you know the complexity, value, and cost of providing a high throughput persistent data repository. You probably also know it typically takes weeks or months to plan and provision a large Lustre environment from scratch.

In this post, I’ll show how to build persistent shared file systems capable of terabytes per second in AWS in under an hour.

Background

The dynamic nature of Amazon FSx for Lustre enables your organization to leverage the massive scale of AWS compute, network, and storage for your HPC and AI applications while only paying for actual usage. If your organization is trying to reduce both time and cost to reach a solution, FSx for Lustre combined with AWS Accelerated Instances gets you off to a great start without waiting for on-premise resources.

While live at the SuperComputing 2023 AWS booth, I demonstrated a 1.1 PiB FSx for Lustre file system capable of over a terabyte per second from a standard AWS account in the Northern Virginia region.

Prior to building the file system, your account quotas for FSx for Lustre must be increased. There is no charge for maintaining an increased quota. But most importantly, I did not have to architect or pre-provision storage, storage servers, interconnect, nor select the proper Lustre configuration to meet my performance goals.

As shown in figure 1, after completing a few questions, the FSx service started creating a high-throughput, persistent, SSD backed POSIX-compliant file system.

Figure 1 – Screenshot of the ease of creating a Persistent FSx for Lustre 1 TB/s file system in progress

By relieving the heavy lift of deployment, not only is there a huge savings in time, but the human effort previously used to plan, evaluate, procure, rack, build, configure, and manage can be redirected to tasks that more directly impact the time to solution.

How long does it take for the file system to become ready?

When a FSx for Lustre file system is created, you can create an empty file system, or you can import existing metadata from data repositories. Data Repositories are simply S3 buckets or S3 Prefixes that exist in AWS.

When creating my 1 PiB file system at SC23, the build completed in far less than one hour – including importing the metadata. You can specify one data repository association (DRA) on build, and you can add up to seven more (as needed) after initial build. Of course, the length of time to load a DRA is directly dependent upon the throughput level, and the volume of metadata.

What sort of environments does FSx for Lustre enable? At SC23, I used a cluster of over 120 i3en.24xlarge Amazon EC2 compute-optimized instances, each with a 100 Gbit/s Elastic Fabric Adapter for network connectivity.

Recently we introduced new HPC optimized instances that support EFA bandwidth up to 300 Gb/s (Hpc7a instance). You can find out more about them from our HPC Tech Shorts channel on YouTube. If you need to scale your compute for heavy duty GPU workloads, FSx for Lustre also can connect to the EC2 UltraCluster.

The particular instances you choose for any given workload should be tailored to your application’s requirements to keep your costs low, and speed of execution high.

Now, once our build and/or load is complete, let’s move on to unit testing of our new file system.

How fast can this go?

Figure 2 – Screenshot of a single cluster node achieving 9.25 GByte/s of file system Read/Write throughput via fio

In figure 2 above, you can see the standard file benchmark fio was used with a 1 megabyte IO size to drive both read and write throughput from a single node, using most of the 100 Gbps Elastic Fabric Adapter (EFA) throughput. Note that 9.25 Gigabyte/s was achieved with absolutely no tuning for optimal block size or network settings, to get close to line speed. The SC23 demonstration was not designed to show best-case benchmarked performance, but to demonstrate non-cached, sustained performance available without any tuning, using default settings. Stay tuned to this channel for more posts explaining how to optimize FSx for Lustre performance using popular Lustre benchmarking tools.

In the background of figure 2, you can see the AWS CloudWatch dashboard I created using a few clicks to monitor file system performance metrics. This chart shows only one node performing IO.

Now let’s move on to look at aggregate file system throughput as shown in figure 3.

Figure 3 – Screenshot of the aggregate 1TB/s of file system Read/Write throughput via fio

Figure 3 shows that our 120-node cluster achieved over 1 TByte/s of FSx for Lustre throughput, which is the maximum throughput we configured our file system for when we launched it. These are impressive results for a file system you can stand up or stand down on-demand, and pay only as you go.

You don’t need to amortize your storage costs for years to justify running a project anymore – you can have 1 TB/s for just the hours or days you need high-speed file level access. During off-peak times, you can stand down your FSx for Lustre file system, and it remains possible to access the data at lower speed in the underlying repositories via the S3 object protocol. This allows you to stage, de-stage or browse your content, and run FSx for Lustre only when needed for application performance.

Conclusion

Developing and deploying HPC and AI workloads is complex and difficult enough without fighting the headwinds of equipment availability, power and cooling issues, datacenter build out, and vendor interoperability. By leveraging FSx for Lustre, you can enable your organization to cloud burst HPC or AI workloads to AWS, or easily migrate workloads to AWS.

But if your organization is just setting out on an HPC or AI development journey – you can instead start today and shave months or years off your project. If you want to stand on the shoulders of others you can use cluster and filesystem stacks created by experts who’ve all done this before. You can launch customizable templates from our HPC Recipes Library, which have all been built to be interoperable, and simple to use. Reach out to us at ask-hpc@amazon.com if you have ideas for how we can make it easier for you.

AWS HPC Blog

Build and deploy a 1 TB/s file system in under an hour

Background

How long does it take for the file system to become ready?

How fast can this go?

Conclusion

Resources

Follow