AWS HPC Blog

Tag: HPC

Figure 1. Architecture of Slurm and user workflows, demonstrating two methods of interacting with Slurm. In the first method, the user accesses the Head Node via SSH and runs helper scripts like sinfo, squeue, sbatch, and scontrol. In the second method, the user issues REST API calls through HTTP to slurmrestd.

Using the Slurm REST API to integrate with distributed architectures on AWS

The Slurm Workload Manager by SchedMD is a popular HPC scheduler and is supported by AWS ParallelCluster, an elastic HPC cluster management service offered by AWS. Traditional HPC workflows involve logging into a head node and running shell commands to submit jobs to a scheduler and check job status. Modern distributed systems often use representational […]

Deep dive into the AWS ParallelCluster 3 configuration file

In September, we announced the release of AWS ParallelCluster 3, a major release with lots of changes and new features. To help get you started migrating your clusters, we provided the Moving from AWS ParallelCluster 2.x to 3.x guide. We know moving versions can be a quite an undertaking, so we’re augmenting that official documentation with additional color and context on a few key areas. With this blog post, we’ll focus on the configuration file format changes for ParallelCluster 3, and how they map back to the same configuration sections for ParallelCluster 2.

Figure 1: High level architecture of the file system.

Scaling a read-intensive, low-latency file system to 10M+ IOPs

Many shared file systems are used in supporting read-intensive applications, like financial backtesting. These applications typically exploit copies of datasets whose authoritative copy resides somewhere else. For small datasets, in-memory databases and caching techniques can yield impressive results. However, low latency flash-based scalable shared file systems can provide both massive IOPs and bandwidth. They’re also easy to adopt because of their use of a file-level abstraction. In this post, I’ll share how to easily create and scale a shared, distributed POSIX compatible file system that performs at local NVMe speeds for files opened read-only.

Running 20k simulations in 3 days to accelerate early stage drug discovery with AWS Batch

In this blog post, we’ll describe an ensemble run of 20K simulations to accelerate the drug discovery process, while also optimizing for run time and cost. We used two popular open-source packages — GROMACS, which does a molecular dynamics simulations, and pmx, a free-energy calculation package from the Computational Biomolecular Dynamics Group at Max Planck Institute in Germany.

The Convergent Evolution of Grid Computing in Financial Services

The Financial Services industry makes significant use of high performance computing (HPC) but it tends to be in the form of loosely coupled, embarrassingly parallel workloads to support risk modelling. The infrastructure tends to scale out to meet ever increasing demand as the analyses look at more and finer grained data. At AWS we’ve helped many customers tackle scaling challenges are noticing some common themes. In this post we describe how HPC teams are thinking about how they deliver compute capacity today, and highlight how we see the solutions converging for the future.

Putting bitrates into perspective

Recently, we talked about the advances NICE DCV has made to push pixels from cloud-hosted desktops or applications over the internet even more efficiently than before. Since we published that post on this blog channel, we’ve been asked by several customers whether all this efficient pixel-pushing could lead to outbound data charges moving up on their AWS bill. We decided to try it on your behalf, and share the details with you in this post. The bottom line? The charges are unlikely to be significant unless you’re doing intensive streaming (such as gaming) and other cost optimizations (like AWS Instance Savings Plans) that will have more impact on your bill.

Figure 4: Relative price-to-performance ratio ($USD/ns) while scaling the simulation across single and multi-GPU instances and comparing to CPU (EFA enabled) performance-to-price (baseline CPU perf).

Running GROMACS on GPU instances: multi-node price-performance

This three-part series of posts cover the price performance characteristics of running GROMACS on Amazon Elastic Compute Cloud (Amazon EC2) GPU instances. Part 1 covered some background no GROMACS and how it utilizes GPUs for acceleration. Part 2 covered the price performance of GROMACS on a particular GPU instance family running on a single instance. […]

Figure 4: Performance scaling as a function of CPU core count increase while number of GPU's remain constant.

Running GROMACS on GPU instances: single-node price-performance

This three-part series of posts cover the price performance characteristics of running GROMACS on Amazon Elastic Compute Cloud (Amazon EC2) GPU instances. Part 1 covered some background no GROMACS and how it utilizes GPUs for acceleration. This post (Part 2) covers the price performance of GROMACS on a particular GPU instance family running on a […]

Figure 2: Work distribution across CPU and GPU for a single simulation timestep

Running GROMACS on GPU instances

Comparing the performance of real applications across different Amazon Elastic Compute Cloud (Amazon EC2) instance types is the best way we’ve found for finding optimal configurations for HPC applications here at AWS. Previously, we wrote about price-performance optimizations for GROMACS that showed how the GROMACS molecular dynamics simulation runs on single instances, and how it […]

AWS Batch Dos and Don’ts: Best Practices in a Nutshell

AWS Batch is a service that enables scientists and engineers to run computational workloads at virtually any scale without requiring them to manage a complex architecture. In this blog post, we share a set of best practices and practical guidance devised from our experience working with customers in running and optimizing their computational workloads. The readers will learn how to optimize their costs with Amazon EC2 Spot on AWS Batch, how to troubleshoot their architecture should an issue arise and how to tune their architecture and containers layout to run at scale.