AWS HPC Blog

Customizing your HPC environment: building AMIs for AWS Parallel Computing Service

It's pronounced Ah-meeIn the world of HPC, one size rarely fits all. Each organization, each project, and often each researcher has unique requirements for their computing environment. This is where the flexibility of AWS Parallel Computing Service (PCS) shines, particularly in its approach to how it uses Amazon Machine Images (AMIs). When we designed PCS, we made a deliberate choice to not have one “golden image” that customers needed to adapt to.

Custom AMIs are a powerful tool for tailoring your HPC environment in AWS Parallel Computing Service. By carefully considering your requirements, leveraging automation tools, and following best practices, you can create AMIs that perfectly suit your computational needs while maintaining security and efficiency.

In this post, we’ll dive into why custom AMIs are crucial for PCS, what your options are for creating them, and why mastering this process is key to optimizing your HPC workflows.

Why custom AMIs matter in PCS

PCS takes a “bring your own AMI” approach, and for good reason. HPC environments are incredibly diverse. Just a small number of variables can lead to a … Cambrian level of outcomes.

Operating system preferences – while one team might prefer Ubuntu for its similarity to their data science desktop environments, another may lean towards Red Hat or Rocky Linux for their enterprise compatibility. Other teams might want to start with existing domain-specific AMIs, such as the AWS Deep Learning AMIs or the plethora of community AMIs in the Amazon EC2 AMI catalog.

Domain-specific software – genomics researchers, aerospace engineers, and climate modelers all need different software stacks. Custom AMIs allow each group to have their preferred tools ready to go.

Performance optimization – different processor architectures (x86, Arm64, or GPUs) need specific optimizations. Custom AMIs enable you to fine-tune your environment for your chosen hardware.

Compliance and security – many organizations have specific security requirements or compliance needs that they must bake into their computing environments from the start. They also may need to rapidly update their environments when a bug or vulnerability is encountered.

By building your own AMIs, you can ensure that your PCS clusters are precisely tailored to your workloads, user base, and compliance needs, from the moment you launch an instance.

Building your custom AMI: a high-level overview

Creating a custom AMI for PCS involves several key steps:

First, you start with a base AMI, usually opting for either a preferred operating system or maybe your organization’s “golden AMI” as a starting point.

Next you install core components like recent OS patches, kernel drivers for high performance networking) and file system clients – EFA and Lustre are probably at the top of that list. In our team, we’re big proponents of using Amazon EFS for user home directories (it’s fast, and incredibly scalable, so install EFS utils).

Next up, there’s an agent for PCS which allows a node to “call home” to the PCS control plane and join the cluster. This has to be installed on the AMI, as well as the PCS Slurm package.

There is more software that works with AWS services that you’ll likely want to install once you’re comfortable with how it works and what it offers. We depend every day on AWS Systems Manager (SSM) and Amazon CloudWatch, and think you should have a close look them. SSM make instances way more manageable, and can make your job of achieving a safe security posture a lot simpler. CloudWatch send logs and metrics to help us debug our setup when something doesn’t work. There are installable agents for both of these (SSM, and CloudWatch).

Now that you’ve mostly settled your baseline OS environment, it’s time to think about how your HPC Software Stack is deployed. This is where your AMI becomes truly custom. You have several options.

  1. Hosting your applications on a shared filesystem
  2. Enabling use of containers, via tools like Podman or Apptainer
  3. Manually installing applications and libraries in the AMI
  4. Using a package manager like Spack to install complex software environments

This part can get complicated quickly, especially if you’re building an AMI that needs to support users with diverse workloads or work styles. But as you can see you’ve got several options. Importantly, you can use more than one of them in the same AMI if that works best for you.

Once you’ve installed the required system software and your preferred set of applications, there’s not much left to do but a) test the applications to make sure they work as you expect; and b) ask Amazon EC2 to create an AMI from your running prototype instance.

The power of Spack in custom AMIs

One tool that deserves special mention is Spack, an open-source package manager for supercomputers. Spack can significantly simplify the process of installing complex HPC software stacks. That’s because:

  • It manages dependencies (including compilers) automatically, resolving version conflicts
  • Allows installation of multiple versions of the same software
  • Supports optimized builds for specific CPU architectures
  • Enables the creation of reproducible software environments

In practice, this means you can easily switch between entire collections of software (Spack environments) or between versions of specific software (e.g. OpenFOAM 8 versus OpenFOAM 10) with simple shell commands. If you’re familiar with Environment Modules, Lmod, or Conda, Spack’s command line interface will be quite recognizable. By incorporating Spack into your custom AMI, you can more easily maintain a flexible, powerful HPC environment that can evolve with your needs.

Automating AMI creation: the next step in maturity

Manually installing software onto an instance, then creating an AMI from it is a great starting point, but automation becomes crucial as your HPC environment grows and evolves. You want to avoid having to monitor a shell session and waiting around for an AMI to build, in favor of more productive work. There are some approaches you may want to consider.

Infrastructure as Code (IaC) – Tools like AWS CloudFormation and HashiCorp Terraform allow you to define your AMI-building process in code. By using a version control system like Git, you can manage changes more deliberately and audit them when needed. You don’t have to automate everything all at once. For instance, you could start by automating just the distribution of AMIs to different AWS Regions or accounts, even if the build process remains manual.

Packer – HashiCorp’s Packer is a tool that automates the creation of machine images. It allows you to build images from a single configuration file, providing a consistent approach to creating AMIs across different environments. Packer offers the ability to run scripts, install software, and validate images as part of the build process.

AWS Image Builder – AWS Image Builder is a fully managed service that automates the creation, maintenance, validation, and distribution of AMIs. It provides pre-built pipelines for image creation and integrates with AWS services to simplify tasks like patching and AMI distribution. It supports image validation and security hardening, helping ensure that AMIs are built according to your defined specifications.

Both Packer and AWS Image Builder automate AMI creation, but they differ in their approaches. Packer allows fine-grained customization during the build process, while AWS Image Builder provides a service-driven pipeline with AWS integrations for maintaining and distributing AMIs.

Automation doesn’t just save time and improve reproducibility; it also helps reduce human error, which is especially important when considering application performance and security posture. People are prone to making mistakes during repetitive tasks, and those errors can introduce regressions and vulnerabilities. By casting your infrastructure designs as code, you help ensure that critical processes are handled consistently, without relying on human attention to detail.

Why it’s important that you conquer custom AMIs

Understanding and implementing custom AMIs for PCS is more than just a technical exercise – it’s a strategic choice that will save you (and your users) a lot of effort later.

With custom-tailored AMIs, your researchers can start working immediately, without wasting time setting up compilers, libraries, and apps. You can build environments to match what they’ve been using elsewhere, but with a whole lot of elasticity built in. This improves everyone’s productivity, not least that of your researchers. They’ll see that when they look at their time to result.

Custom AMIs also ensure that all nodes in your cluster have identical environments, which is crucial for reproducible scientific results. They also open up the possibility of having reproducible results between clusters, which is one of the great challenges of our time.

Defining your AMIs (along with the processes to test and distribute them) as code artifacts means that when you add a new capability, improve performance, or fix something that wasn’t working, the solution is auditable and documented. That’s helpful for long-term maintenance and scientific reproducibility.

Another benefit of adopting IaC for building AMIs is that you can define the entire HPC environment this way – from identity to filesystems to instances to the codes that run on the system. Changing aspects of the HPC environment then becomes a DevOps exercise, rather than a series of manually coordinated tasks.

Finally, your HPC needs will inevitably evolve. New instance types, with improved processor architectures may come along; new applications (and versions) will be released; and, of course, software bugs and security improvements will come to light. Being able to create new AMIs (or even entire clusters) via automated pipelines means your HPC infrastructure can evolve just as rapidly.

Conclusion

Custom AMIs are a foundation upon which you can build efficient, effective clusters on AWS Parallel Computing Service. By investing time in creating well-crafted, automated AMI building processes, you’re not just solving today’s HPC challenges – you’re building a flexible, scalable foundation for tomorrow’s scientific breakthroughs.

Whether you’re just starting with manual AMI creation or working towards fully automated pipelines for your images, remember: in the world of HPC, the better tailored your environment, the faster you can move from raw data to groundbreaking insights.

As you can see, we think this is an important topic. So, keep your eyes open for new HPC Tech Shorts, as well as HPC Recipes, to help you get started.

Matt Vaughn

Matt Vaughn

Matt Vaughn is a Principal Developer Advocate for HPC and scientific computing. He has a background in life sciences and building user-friendly HPC and cloud systems for long-tail users. When not in front of his laptop, he’s drawing, reading, traveling the world, or playing with the nearest dog.

Brendan Bouffler

Brendan Bouffler

Brendan Bouffler is the head of the Developer Relations in HPC Engineering at AWS. He’s been responsible for designing and building hundreds of HPC systems in all kind of environments, and joined AWS when it became clear to him that cloud would become the exceptional tool the global research & engineering community needed to bring on the discoveries that would change the world for us all. He holds a degree in Physics and an interest in testing several of its laws as they apply to bicycles. This has frequently resulted in hospitalization.