Machine Learning | AWS HPC Blog

Gang scheduling pods on Amazon EKS using AWS Batch multi-node processing jobs

AWS Batch multi-node parallel jobs can now run on Amazon EKS to provide gang scheduling of pods across nodes for large scale distributed computing like ML model training. More details here.

Improve HPC workloads on AWS for environmental sustainability

Need to cut your carbon footprint without sacrificing productivity? Migrating HPC workloads to the cloud allowed Baker Hughes to reduce emissions by 99%! Get tips for optimizing compute, storage, networking so you can do better.

Simulating autonomous mining operations using Robotec.ai on AWS

Big changes are underway in mining – see how the Boliden Group simulates fleets of autonomous trucks using AWS Batch – for safety and efficiency.

Call for participation: HPC tutorial series from the HPCIC

Interested in getting hands-on experience with cutting-edge HPC tools? Check out this blog post on an upcoming virtual training series from @LLNL and @AWSCloud. Learn emerging technologies from the experts this August.

Securing HPC on AWS: implementing STIGs in AWS ParallelCluster

Want to accelerate creating compliant Amazon EC2 images? Learn how HPC users can leverage cloud-native methods for applying STIG security standards.

Large scale training with NeMo Megatron on AWS ParallelCluster using P5 instances

Large scale training with NVIDIA NeMo Megatron on AWS ParallelCluster using P5 instances

Launching distributed GPT training? See how AWS ParallelCluster sets up a fast shared filesystem, SSH keys, host files, and more between nodes. Our guide has the details for creating a Slurm-managed cluster to train NeMo Megatron at scale.

Building an AI simulation assistant with agentic workflows

Simulations provide critical insights but running them takes specialized people, which can slow everyone down. We show how a Simulation Assistant can use LLMs and agents to start these workflows via chat so you can get results sooner.

Using machine learning to drive faster automotive design cycles

Aerospace and automotive companies are speeding up their product design using AI. In this post we’ll discuss how they’re using machine learning to shift design cycles from hours to seconds using surrogate models.

Accelerate drug discovery with NVIDIA BioNeMo Framework on Amazon EKS

This post was contributed by Doruk Ozturk and Ankur Srivastava at AWS, and Neel Patel at NVIDIA. Introduction Drug discovery is a long and expensive process. Pharmaceutical companies must sift through thousands of compound possibilities to find potential new drugs to treat diseases. This process takes multiple years and costs billions of dollars, with the […]

Optimizing MPI application performance on hpc7a by effectively using both EFA devices

Get the inside scoop on optimizing your MPI apps and configuration for AWS’s powerful new Hpc7a instances. Dual rail gives these instances huge networking potential @ 300 Gb/s – if properly used. This post provides benchmarks, sample configs, and real speedup numbers to help you maximize network performance. Whether you run weather simulations, CFD, or other HPC workloads, you’ll find practical tips for your codes.

Select your cookie preferences

AWS HPC Blog

Tag: Machine Learning