AWS cloud provides a broadest range of scalable, flexible infrastructure services that you can select to match your workloads and tasks. This gives you the ability to choose the most appropriate mix of resources for your specific applications. Cloud computing makes it easy to experiment with infrastructure components and architecture design. The services listed below as HPC solution components are a great starting point to set up and manage your HPC cluster, however, we always recommend testing various instance types, EBS volume types, deployment methods, etc., to find the best performance at the lowest cost.
Data Management & Data Transfer
Running HPC applications in the cloud starts with moving the required data into the cloud. AWS Snowball and AWS Snowmobile are data transport solutions that use devices designed to be secure to transfer large amounts of data into and out of the AWS Cloud. Using Snowball addresses common challenges with large-scale data transfers including high network costs, long transfer times, and security concerns. AWS DataSync is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS). DataSync automatically handles many of the tasks related to data transfers that can slow down migrations or burden your IT operations, including running your own instances, handling encryption, managing scripts, network optimization, and data integrity validation. AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to AWS. Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.
Compute & Networking
The AWS HPC solution lets you choose from a variety of compute instance types that can be configured to suit your needs, including the latest Intel® Xeon® processors, 2nd gen AMD EPYC processors, Arm-based AWS Graviton2 processors, NVIDIA GPU-based instances, and field programmable gate array (FPGA)-powered instances.
Compute intensive: C5n instances feature the Intel Xeon Platinum 8000 series (Skylake-SP) processor with a sustained all core Turbo CPU clock speed of up to 3.5 GHz. C5n instances provide up to 100 Gbps of network bandwidth and up to 14 Gbps of dedicated bandwidth to Amazon EBS. Many customers will find that the C5n instances meet the core requirements for their HPC workloads. These instances are designed to address many common HPC workloads from computational fluid dynamics (CFD), computer aided engineering (CAE), materials science, to reservoir simulation. Applications that demand high-levels of inter-instance communications, like Message Passing Interface (MPI) based workloads, and can benefit from the high inter-instance network bandwidth speed offered by the C5n. For compute intensive applications that don’t need the high networking speeds and can run on AMD processors, the C5a with 2nd Gen EPYC processors are a cost effective solution.
Processor architecture agnostic: Customers who are able to build their application for Arm architecture can benefit from the price performance offered by Arm-based AWS Graviton2 processors. AWS’ Graviton2 based C6g and C6gn instances are optimized for compute intensive workloads. This may include popular open source software like numerical weather forecasting models and fire dynamics simulations or other open source applications.
Accelerated computing: Those running neural network training, inference, or CUDA workloads should consider the P4d instance with the latest NVIDIA A100 GPUs. The P4d provides the highest performance for machine learning (ML) training and HPC applications such a natural language processing, object detection and classification, seismic analysis, and molecular dynamics. Applications for workloads such as the genomics and financial risk models can benefit from the FPGAs offered by the F1 instance. Customers running certain CUDA applications or rendering workloads can benefit from the NVIDIA T4 GPUs offered on the G4 or the G4dn for applications that require high networking throughput.
Single threaded performance: Some workloads, such as electronic design automation, select computational fluid dynamics workflows, or other applications where core-based licensing is the primary driver of cost, benefit from using instances that provide higher processor clock speeds to speed single-threaded performance. The Z1d offers 4.0 GHz clocks speed and a memory/vCPU ratio of 8, making it a good fit for HPC applications that require high single threaded performance and high memory usage. With the M5zn’s 4.5Ghz clock speed and memory/vCPU ratio of 4, it is a good fit for applications like finite element analysis using implicit methods which need high network bandwidth, but don't need the high memory offered by the Z1d.
Networking: Amazon EC2 instances support enhanced networking that allow EC2 instances to achieve higher bandwidth and lower inter-instance latency compared to traditional virtualization methods. Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables you to run HPC applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system (OS) bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling HPC applications. AWS also offers placement groups for tightly-coupled HPC applications that require low latency networking. Amazon Virtual Private Cloud (VPC) provides IP connectivity between compute instances and storage components.
Storage options and storage costs are critical factors when considering an HPC solution. AWS offers flexible object, block, or file storage for your transient and permanent storage requirements. Amazon Elastic Block Store (Amazon EBS) provides persistent block storage volumes for use with Amazon EC2. Provisioned IOPS allows you to allocate storage volumes of the size you need and to attach these virtual volumes to your EC2 instances. Amazon Simple Storage Service (S3) is designed to store and access any type of data over the Internet and can be used to store the HPC input and output data long term and without ever having to do a data migration project again. Amazon FSx for Lustre is a high performance file storage service designed for demanding HPC workloads and can be used on Amazon EC2 in the AWS cloud. Amazon FSx for Lustre works natively with Amazon S3, making it easy for you to process cloud data sets with high performance file systems. When linked to an S3 bucket, an FSx for Lustre file system transparently presents S3 objects as files and allows you to write results back to S3. You can also use FSx for Lustre as a standalone high-performance file system to burst your workloads from on-premises to the cloud. By copying on-premises data to an FSx for Lustre file system, you can make that data available for fast processing by compute instances running on AWS. Amazon Elastic File System (Amazon EFS) provides simple, scalable file storage for use with Amazon EC2 instances in the AWS Cloud.
Automation and Orchestration
Automating the job submission process and scheduling submitted jobs according to predetermined policies and priorities are essential for efficient use of the underlying HPC infrastructure. AWS Batch lets you run hundreds to thousands of batch computing jobs by dynamically provisioning the right type and quantity of compute resources based on the job requirements. AWS ParallelCluster is a fully supported and maintained open source cluster management tool that makes it easy for scientists, researchers, and IT administrators to deploy and manage High Performance Computing (HPC) clusters in the AWS Cloud. NICE EnginFrame is a web portal designed to provide efficient access to HPC-enabled infrastructure using a standard browser. EnginFrame provides you a user-friendly HPC job submission, job control, and job monitoring environment.
Operations & Management
Monitoring the infrastructure and avoiding cost overruns are two of the most important capabilities that can help an HPC system administrators efficiently manage your organization’s HPC needs. Amazon CloudWatch is a monitoring and management service built for developers, system operators, site reliability engineers (SRE), and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, understand and respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. AWS Budgets gives you the ability to set custom budgets that alert you when your costs or usage exceed (or are forecasted to exceed) your budgeted amount.
The ability to visualize results of engineering simulations without having to move massive amounts of data to/from the cloud is an important aspect of the HPC stack. Remote visualization helps accelerate the turnaround times for engineering design significantly. NICE DCV enables you to remotely access 2D/3D interactive applications over a standard network. In addition, Amazon AppStream 2.0 is another fully managed application streaming service that can securely deliver application sessions to a browser on any computer or workstation.
Security and Compliance
Security management and regulatory compliance are other important aspects of running HPC in the cloud. AWS offers multiple security related services and quick-launch templates to simplify the process of creating a HPC cluster and implementing best practices in data security and regulatory compliance. The AWS infrastructure puts strong safeguards in place to help protect customer privacy. All data is stored in highly secure AWS data centers. AWS Identity and Access Management (IAM) provides a robust solution for managing users, roles, and groups that have rights to access specific data sources. Organizations can issue users and systems individual identities and credentials, or provision them with temporary access credentials using the Amazon Security Token Service (Amazon STS). AWS manages dozens of compliance programs in its infrastructure. This means that segments of your compliance have already been completed. AWS infrastructure is compliant with many relevant industry regulations such as HIPAA, FISMA, FedRAMP, PCI, ISO 27001, SOC 1, and others.