AWS HPC Blog

Using Spot Instances with AWS ParallelCluster and Amazon FSx for Lustre

Processing large amounts of complex data often requires leveraging a mix of different Amazon EC2 instance types. These types of computations also benefit from shared, high performance, scalable storage like Amazon FSx for Lustre. A way to save costs on your analysis is to use Amazon EC2 Spot Instances, which can help to reduce EC2 costs up to 90% compared to On-Demand Instance pricing. This post will guide you in the creation of a fault-tolerant cluster using AWS ParallelCluster. We will explain how to configure ParallelCluster to automatically unmount the Amazon FSx for Lustre filesystem and resubmit the interrupted jobs back into the queue in the case of Spot interruption events.

Running Windows HPC Workloads using HPC Pack in AWS

This blog post shows you how to deploy an HPC cluster for Windows workloads. We have provided an AWS CloudFormation template that automates the creation process to deploy an HPC Pack 2019 Windows cluster. This will help you get started quickly to run Windows-based HPC workloads, while leveraging highly scalable, resilient, and secure AWS infrastructure. As an example, we show how to run a sample parametric sweep for EnergyPlus, an open source energy simulation tool maintained by the U.S. Department of Energy’s Building Technology Office.

Accelerating drug discovery with Amazon EC2 Spot Instances

We have been working with a team of researchers at the Max Planck Institute, helping them adopt the AWS cloud for drug research applications in the pharmaceutical industry. In this post, we’ll focus on how the team at Max Planck obtained thousands of EC2 Spot Instances spread across multiple AWS Regions for running their compute intensive simulations in a cost-effective manner, and how their solution will be enhanced further using the new Spot Placement Score API.

Introducing AWS HPC Connector for NICE EnginFrame

Today we’re introducing AWS HPC Connector, a new feature in NICE EnginFrame that allows customers to leverage managed HPC resources on AWS. With this release, EnginFrame provides a unified interface for administrators to make hybrid HPC resources available to their users both on-premises and within AWS. In this post, we’ll provide some context around EnginFrame’s typical use cases, and show how you can use AWS HPC Connector to stand up HPC compute resources on AWS.

Figure 2: CDI transmits the frame buffer using EFA. SRD is a multipath, self-healing transport. This creates a kernel bypass method that effectively enables a memory copy from one framebuffer to another.

How we enabled uncompressed live video with CDI over EFA

We’re going to take you into the world of broadcast video, and explain how it led to us announcing today the general availability of EFA on smaller instance sizes. For a range of applications, this is going to save customers a lot of money because they no longer need to use the biggest instances in each instance family to get HPC-style network performance. But the story of how we got there involves our Elastic Fabric Adapter (EFA), some difficult problems presented to us by customers in the entertainment industry, and an invention called the Cloud Digital Interface (CDI). And it started not very far from Hollywood.

Benchmarking the NVIDIA Clara Parabricks germline pipeline on AWS

This blog provides an overview of NVIDIA’s Clara Parabricks along with a guide on how to use Parabricks within the AWS Marketplace. It focuses on germline analysis for whole genome and whole exome applications using GPU accelerated bwa-mem and GATK’s HaplotypeCaller.

Running a 3.2M vCPU HPC Workload on AWS with YellowDog

OMass Therapeutics, a biotechnology company identifying medicines against highly validated target ecosystems, used Yellowdog on AWS to analyze and screen 337 million compounds in 7 hours, a task which would have taken two months using an on-premises HPC cluster. YellowDog, based in Bristol in the UK, ran the drug discovery application on an extremely large, multi-region cluster in AWS with the AWS ‘pay-as-you-go’ pricing model. It provided a central, unified interface to monitor and manage AWS Region selection, compute provisioning, job allocation and execution. The entire workload completed in 65 minutes, enabling scientists to start work on analysis the same day, significantly accelerating the drug discovery process. In this post, we’ll discuss the AWS and YellowDog services we deployed, and the mechanisms used to scale to 3.2m vCPUs using multiple EC2 instance types across multiple regions in 33 minutes, running at a 95% utilization rate.

Coming soon: dedicated HPC instances and hybrid functionality

This year, we’ve launched a lot of new capabilities for HPC customers, making AWS the best place for the length and breadth of their workflows. EFA went mainstream and is now available in sixteen instance families for fast fabric capabilities for scaling MPI and NCCL codes. We’ve written deep-dive studies to explore and explain the optimizations that will drive your workloads faster in the cloud than elsewhere. We released a major new version of AWS ParallelCluster with its own API for controlling the cluster lifecycle. AWS Batch became deeply integrated into AWS Step Functions and now supports fair-share scheduling, with multiple levers to control the experience. Today we’re signaling the arrival of a new HPC-dedicated instance family – the Hpc6a – and an enhanced EnginFrame that will bring the best of the cloud and on-premises together in a single interface.

How to manage HPC jobs using a serverless API

HPC systems are traditionally access through a Command Line Interface (CLI) where the users submit and manage their computational jobs. Depending on their experience and sophistication, the CLI can be a daunting experience for users not accustomed in using it. Fortunately, the cloud offers many other options for users to submit and manage their computational jobs. In this blog post we will cover how to create a serverless API to interact with an HPC system in the the cloud built with AWS ParallelCluster.