This Guidance demonstrates how to design computational fluid dynamics (CFD) workloads on AWS. Large-scale aircraft simulations often require a heavy amount of compute for a short period. This Guidance moves CFD workloads to the cloud, where you can spin up thousands of compute cores and terminate them once a workload is complete, allowing you to provide valuable compute resources instantly without incurring the expense and delay of procuring servers. In the architecture, AWS ParallelCluster takes care of the undifferentiated heavy lifting involved in setting up an HPC cluster (including setting up Slurm) and configures autoscaling, mounting filesystems, and tracking costs through tags.

Architecture Diagram

Download the architecture diagram PDF 

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

  • This Guidance is designed to help you respond to incidents and deploy your own changes. The high performance computing (HPC) clusters can be deployed in another Availability Zone (AZ) or Region with minimal changes. But the data must be stored in Amazon S3 and the cluster's configuration kept in the YAML file as Infrastructure as Code (IaC).

    Changes to the infrastructure of this Guidance should be made in the HPC cluster’s YAML configuration file. This file should be versioned controlled and changes reviewed before deployment. This can be turned into a continuous integration and continuous deployment (CI/CD) pipeline. 

    Read the Operational Excellence whitepaper 
  • CFD simulations are often controlled under regulations such as International Traffic in Arms (ITAR) and FedRAMP. Therefore, we recommend running this Guidance in AWS GovCloud (US).

    By default, AWS ParallelCluster allows connections on port 22 from to the HeadNode. We recommend disabling all outside access and connecting through SSM.

    We also recommend following the recommendations found in DoD-Compliant Implementations on AWS. It gives details about deploying Interleukin 4 (IL4), IL5, ITAR, export control deployments on AWS GovCloud US, and IL6 workloads on AWS Secret Regions.

    File system data is protected using standard Amazon Elastic Block Store (Amazon EBS) encryption. In addition, FSx for Lustre is encrypted at rest and in transit. Locking down access from the file system to just the HPC cluster prevents inadvertent data access.

    Files are accessed through standard Portable Operating System Interface for Unix (POSIX) file permissions. Access can be granted and revoked through user and group permissions.

    Read the Security whitepaper 
  • A few key features provide high-availability. These include selecting On-Demand Instances for Compute Nodes and selecting a file system such as FSx for Lustre for fault tolerance and to prevent job failures. FSx for Lustre provides two file system deployment options: scratch and persistent. It also supports two persistent deployment types, Peristent_1 and Persistent_2.

    By defining clusters as IaC and making backups through Amazon S3, the workload can easily be spun up in another Availability Zone or Region should a fault occur. We also recommend using On-Demand instances for tightly coupled CFD jobs and using FSx for Lustre Persistent_2 instead of scratch. All of these parameters prevent avoidable outages and allow for replication in the case of unavoidable outages.

    Tightly coupled CFD simulations are inherently sensitive to instance failures or networking blips. By using On-Demand pricing models and FSx for Lustre Persistent_2, you can lower unexpected interruptions.

    AWS ParallelCluster keeps logs in Amazon CloudWatch by default. It also monitors for idle instances and automatically shuts them down. It defines clusters as IaC, which allows them to be replicated in another Availability Zone. Configure alarms around parameters such as Network Utilization, Disk Usage, and spend. These alarms can then alert you of abnormal behavior so you can address it before an outage occurs.

    Data in FSx for Lustre is backed up to Amazon S3 using a Data Repository Association (DRA). Should the file system fail, another one can be set up and reference the same Amazon S3 bucket.

    Read the Reliability whitepaper 
  • Using FSx for Lustre with HPC instances and an Elastic Fabric Adapter (EFA) provides the optimal performance for tightly coupled CFD simulations.

    The compute nodes should be located in the same Availability Zone as the FSx for Lustre file system and HeadNode. This reduces latency between the file system and instances and reduces cost by reducing inter-Availability Zone data transfer charges.

    Read the Performance Efficiency whitepaper 
  • For CFD, savings are realized in the engineering effort, such as research and development (R&D) costs and the cost of materials incurred in running real tests. 

    There are additional savings by running simulations and not real tests. The cost of a CFD simulation in the cloud is typically less costly than running a real test case when all the engineering time and material supplies are considered.

    Data is kept in the Cloud by using desktop visualization software NICE DCV. This not only reduces data transfer cost, it also speeds up time between meshing and solve as data is kept on the same file system.

    Tightly coupled simulations like CFD are especially sensitive to instances getting terminated, such as a spot reclamation. For this reason, we recommend using On-Demand as the purchasing model. You can reduce cost by using an HPC instance such as hpc6a.48xlarge, which is less expensive than the equivalent non-HPC instance.

    AWS ParallelCluster dynamically scales up the compute nodes only when jobs are running. When those jobs complete, the instances scale down. This ensures no idle resources are running. In addition, this allows much larger bursts than an on-premises cluster provides. Typically, R&D fluctuates, with lots of compute used at key intervals such as the end of a project milestone. The cloud allows you to achieve this capacity when it’s needed and not pay for it when it’s not.

    Read the Cost Optimization whitepaper 
  • Instances are spun up only when needed, then terminated when no longer needed. You can review Slurm accounting logs to view the CPU utilization for the resources you requested. If you aren’t efficiently using the compute resources, you can reduce the amount requested in the next job.

    FSx for Lustre provides a caching layer between Amazon S3 and the HPC cluster, this allows low-latency data access from the compute nodes. When the job is complete, results are stored in Amazon S3 which can be configured for Intelligent-Tiering, allowing for inexpensive data storage for cold data.

    By matching requested cores and memory, Slurm efficiently scales up only the required number of instances and can pack multiple jobs onto the same instance.

    Read the Sustainability whitepaper 

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

AWS Architecture


This post demonstrates how...


The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.