AWS HPC Blog
Category: AWS ParallelCluster
Introducing GPU health checks in AWS ParallelCluster 3.6
AWS ParallelCluster 3.6.0 can now detect GPU failures in HPC and AI/ML tasks. Health checks run at the start of Slurm jobs and if they fail, the job is requeued on another instance. This can increase reliability and prevent wasted spend.
Elastic visualization queues with NICE DCV in AWS ParallelCluster
In this blog post we’ll show you how to create an elastic pool of visualization nodes, by combining AWS ParallelCluster with NICE DCV in a novel way.
Checkpointing HPC applications using the Spot Instance two-minute notification from Amazon EC2
In this post we show you how to create an HPC cluster and capture the two-minute warning notifications from Amazon EC2 Spot to execute a checkpoint, reactively.
Install optimized software with Spack configs for AWS ParallelCluster
Today, we’re announcing the availability of Spack configs for AWS ParallelCluster. You can use these configurations to install optimized HPC applications quickly and easily on your AWS-powered HPC clusters.
Deploying Open OnDemand with AWS ParallelCluster
In this post, we describe an integration of Open OnDemand with AWS ParallelCluster so admins can provide web-based access to HPC resources beyond what they have at their site, by using the AWS cloud to add new capabilities and extend capacity.
Multiple Availability Zones now supported in AWS ParallelCluster 3.4
In AWS ParallelCluster 3.4, you can now build HPC clusters that span multiple Amazon EC2 Availability Zones. In this post, we describe how the new feature works, how to use it, and some implications for cluster design that it raises.
Leveraging Slurm Accounting in AWS ParallelCluster
Slurm accounting adds flexibility, transparency, and control to operating an #HPC cluster. #AWS #ParallelCluster 3.3.0 can now automatically configure #Slurm accounting whether you are using your own database or Amazon #Aurora.
DCV in 2022: a year in review
In this post we recap all the really significant feature released in DCV from 2022 that delighted our customers. Of course, we’re still not done, so expect more in 2023.
Launch self-supervised training jobs in the cloud with AWS ParallelCluster
In this post we describe the process to launch large, self-supervised training jobs using AWS ParallelCluster and Facebook’s Vision Self-Supervised Learning (VISSL) library.
Support for Instance Allocation Flexibility in AWS ParallelCluster 3.3
AWS ParallelCluster 3.3.0 now lets you define a list of Amazon EC2 instance types for resourcing a compute queue. This gives you more flexibility to optimize the cost and total time to solution of your HPC jobs, especially when capacity is limited or you’re using Spot Instances.