ML | AWS HPC Blog

Enhancing ML workflows with AWS ParallelCluster and Amazon EC2 Capacity Blocks for ML

No more guessing if GPU capacity will be available when you launch ML jobs! EC2 Capacity Blocks for ML let you lock in GPU reservations so you can start tasks on time. Learn how to integrate Caacity Blocks into AWS ParallelCluster to optimize your workflow in our latest technical blog post.

EFA: how fixing one thing, lead to an improvement for … everyone

EFA: how fixing one thing, led to an improvement for … everyone

Today, we’re diving deep into the open-source frameworks that move MPI messages around, and showing you how work we did in the Open MPI and libfabrics community lead to an improvement for EFA users – and everyone else, too.

Conceptual design using generative AI and CFD simulations on AWS

In this post we’ll show how generative AI, combined with conventional physics-based CFD can create a rapid design process to explore new design concepts in automotive and aerospace from just a single image.

How Amazon’s Search M5 team optimizes compute resources and cost with fair-share scheduling on AWS Batch

In this post, we share how Amazon Search optimizes their use of accelerated compute resources using AWS Batch fair-share scheduling to schedule distributed deep learning workloads.

How computer vision is enabling a circular economy

In this post, we show how Reezocar uses computer vision to change the way they detect damage and price used vehicles for re-sale in secondary markets. This reduces landfill and helps achieve the goals of the circular economy.

Improving NFL player health using machine learning with AWS Batch

In this post we’ll show you how the NFL used AWS to scale their ML workloads and produce the first comprehensive dataset of helmet impacts across multiple NFL seasons. They were able to reduce manual labor by 90% and the results beats human labelers in accuracy by 12%!

How to make digital technologies for the circular economy work for your business

In this post, we discuss the benefits of digital technology for the circular economy, and show how businesses can implement these technologies to get the most out of them for the wellbeing of everyone.

Streamlining distributed ML workflow orchestration using Covalent with AWS Batch

Complicated multi-step workflows can be challenging to deploy, especially when using a variety of high-compute resources. Covalent is an open-source orchestration tool that streamlines the deployment of distributed workloads on AWS resources. In this post, we outline key concepts in Covalent and develop a machine learning workflow for AWS Batch in just a handful of steps.

Introducing GPU health checks in AWS ParallelCluster 3.6

AWS ParallelCluster 3.6.0 can now detect GPU failures in HPC and AI/ML tasks. Health checks run at the start of Slurm jobs and if they fail, the job is requeued on another instance. This can increase reliability and prevent wasted spend.

Second generation EFA: improving HPC and ML application performance in the cloud

Since launch, EFA has seen continuous improvements in performance. In this post, we talk about our 2nd generation of EFA, which takes another step in improving Machine Learning and High Performance Computing in the Cloud.

Tag: ML