AWS HPC Blog

Welcome to the AWS HPC Blog

This post is written by Deepak Singh, Vice President of Compute Services.

At AWS, we love working with customers to solve their toughest challenges. High performance computing (HPC) is one of those challenges that pushes against the boundaries of AWS performance at scale. HPC is also a personal interest of mine, as I came to know cloud computing while trying to solve problems in computational chemistry. If your system can perform well for a variety of computational- and data-intensive workloads like computational fluid dynamics, weather modeling, machine learning, and genomics, you can pretty much solve most computational problems.

The first product I was involved with at AWS was launching the first generation of compute-optimized instances, the CC1. We started with the goal of providing customers with an “HPC cluster” that would be on-par with what they could build on-premises. To deliver this we needed to allow customers to provision clusters with non-blocking networking and the ability to leverage CPU topology, two things you couldn’t do on AWS at the time. To ship the CC1 instances, we had to design a new network and ship an instance where customers knew which of the host’s processor they were running on. We exposed CPU topology, and made certain flags available to customers for the first time. Within a couple of years, all of EC2 ran on this new network architecture and all instances exposed CPU architecture and topology (with exception of T-instances).

Making our network faster, with higher throughput, and more consistency to enable tightly coupled parallel jobs benefits all of our customers. Over time we introduced more vertically scaled instances — such as instances with multiple high-end GPUs, terabytes (and terabytes) of memory, and NVMe local storage that work equally well in traditional HPC applications and other domains like machine learning and data warehouse query. We also think that introducing new technologies, like the Nitro system or the AWS Graviton2 Arm-based processor, can radically change how our customers leverage AWS for HPC workloads. As a concrete example, Nitro unlocked the door for us to build Elastic Fabric Adapter (EFA) — our high-performance interconnect — which itself unlocked the scaling door to so many HPC codes (including some that surprised us). Examples of better scaling with EFA lead to UltraClusters with P4d instances that can tackle extremely large machine learning problems. As you can see, addressing challenges for HPC creates a flywheel that improves all of our services, which in turn allows us to tackle even more challenges in HPC.

You can get a sense of the progress AWS has made over the years tackling HPC workloads from the HPC on AWS re:Invent 2020 session given by Ian Colle, our General Manager for HPC services.

We hear from our customers that they want more guidance on how to think about HPC in the cloud. Specifically, they ask us about how they can integrate their on-premises systems and applications with AWS. They have questions about best practices for hybrid architectures, enabling workflow portability, and if cloud native technologies like serverless can work alongside or augment current HPC systems. There are also more philosophical topics up for debate in the HPC community, like whether it’s best to optimize for human productivity or resource efficiency? How to balance between these concerns? What is the shared responsibility across users, developers, and operators of HPC infrastructure when you need to use environments that span on-premises and in the cloud? These are some of the big questions we plan to share our perspective on in future posts.

We will use this blog – in addition to our HPC Tech Shorts video series and upcoming HPC-focused workshops – to highlight what we think are the best practices for HPC on AWS,  and showcase interesting solutions from the broader community of customers and partners and the amazing open-source groups we get the privilege to work with.

So welcome, and stay tuned for some exciting posts and let us know what you think.

Angel Pizarro

Angel Pizarro

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.

Deepak Singh

Deepak Singh

Deepak has been with Amazon Web Services since 2008, and currently leads container services, Linux, and HPC. Prior to starting the container service team, Deepak led product management for the Amazon EC2 instance platform.