How Aionics accelerates chemical formulation and discovery with AWS Parallel Computing Service

This post was contributed by Mohamed K. Elshazly, PhD, Kareem Abdol-Hamid, Sam Bydlon, PhD, Aarabhi Achanta, and Mark Azadpour

The decarbonization of our modern economy depends on solving a defining scientific challenge: developing batteries that are both and high performing. From electrical grids to vehicles and aviation, these energy storage devices must provide power stability, range, reliability, and safety—attributes that often involve challenging tradeoffs. Nowhere is this balancing act more critical than in safety-sensitive applications like aviation, where the flammable nature of industry-standard battery electrolytes creates a significant barrier to broader electrification.

Aionics is a Palo Alto-based company founded in 2020. They combine artificial intelligence (AI), quantum simulation, and proprietary data to develop custom, high-performance materials for mission-critical applications. They work with leading companies in automotive, aerospace, defense, and energy. Aionics designs drop-in chemical solutions that improve performance, safety, and sustainability.

In this post, we detail how they design and deploy flexible and performant HPC infrastructures for computational chemistry with AWS Parallel Computing Service (AWS PCS). We’ll emphasize two cases that illustrate this: (1) Plane-wave Density Functional Theory (DFT) simulations with MPI parallelization and (2) GPU-accelerated molecular dynamics with Machine-Learned safe Interatomic Potentials (MLIPs).

Brute force doesn’t work here

Formulating a non-flammable electrolyte without compromising performance is akin to finding a needle in a haystack. To approach this challenge, researchers draw from commercial catalogs, collections of chemical compounds available for purchase from suppliers. Even a small catalog of 50 million molecules, a negligible fraction of all possible compounds, would lead to more than a quadrillion (10^15) two-molecule formulation candidates. The combinatorial explosion becomes even more dramatic with typical electrolyte formulations, which often contain 4 or 5 different molecules.

Brute forcing this space is not a viable option. Instead, digital formulation must be targeted through a variety of methods: mechanistic understanding, simulations, experiments, cheminformatics descriptors, and AI models. Such a diverse toolbox requires an equally flexible HPC infrastructure that can produce the most data points per resource-hour at the lowest cost.

Since 2022, Aionics has been running HPC clusters on AWS for development and production workloads, including using AWS ParallelCluster, an open-source cluster management tool. In August 2025, Aionics migrated to AWS Parallel Computing Service (PCS), a managed service that simplifies cluster operations with managed updates and built-in observability features, allowing their team to work in a familiar Slurm environment while focusing on research instead of infrastructure maintenance.

Design and deployment

A complete HPC infrastructure for Aionics’ computational chemistry workloads needs to support jobs that exploit GPU acceleration (e.g. PyTorch models, atomic-orbital DFT with GPU4PySCF) as well as those that typically benefit from CPU parallelism (e.g. Plane-wave DFT with Quantum ESPRESSO, embarrassingly parallel descriptor generation). To meet these requirements, Aionics deployed two separate AWS PCS clusters:

CPU cluster: This cluster has a login node and two Slurm queues that dispatch jobs to Amazon Elastic Compute Cloud (Amazon EC2) compute nodes designed for tightly coupled HPC workloads. To benchmark performance and understand cost-efficiency tradeoffs, one queue uses hpc6a.48xlarge instances, while the other uses the latest generation hpc7a.48xlarge instances.
GPU cluster: This cluster has a login node and two Slurm queues that dispatch jobs to compute nodes for different workload scales. One queue uses g6e.16xlarge instances (2 NVIDIA L40S GPUs) for smaller workloads, while the other uses g6e.48xlarge instances (8 NVIDIA L40S GPUs) for larger workloads.

Figure 1 shows the general architecture for a single AWS PCS cluster. Both clusters follow this same pattern. Each cluster includes a static login node (a small, persistent Amazon EC2 instance) where users submit jobs. AWS PCS manages the Slurm scheduler in a service-owned account, which coordinates job scheduling across compute nodes in your AWS account. These compute nodes automatically scale up when jobs are submitted and scale down when idle. All instances connect to shared storage via Amazon Elastic File System (Amazon EFS) mounted to the /home directory.

Figure 1: General architecture diagram of a single AWS PCS cluster. Both the CPU and GPU clusters follow this architecture. Users access the cluster through a login node. The managed Slurm scheduler coordinates jobs across Amazon EC2 compute instances that automatically scale based on workload demand. All nodes share a common file system via Amazon EFS.

Deployment involves setting up standard AWS infrastructure (VPC, security groups, IAM roles), creating a PCS-compatible AMI and Amazon EFS volume, configuring launch templates, and building the cluster with its queues. The AWS PCS documentation provides detailed guidance for each step.

Benchmarks and performance insights

Aionics configured each cluster with multiple queues, each backed by a single instance type. This homogeneous queue design follows HPC best practices for tightly coupled workloads and provides flexibility to optimize for different objectives. Computational chemistry workloads often require balancing competing priorities: some jobs prioritize faster completion time, while others prioritize lower cost. Having separate queues for different instance types allows users to select the most appropriate resources for each workload.

This section focuses on benchmarks from the CPU cluster. As an example, let us consider a performance benchmark for plane-wave DFT using Quantum ESPRESSO on the two HPC CPU queues. Figure 2 shows the wall time and cost (based on August 2025 on-demand pricing) for a static PWscf simulation of a 178-atom, 4-atomic species surface + molecule system. Both hpc6a.48xlarge and hpc7a.48xlarge instances have 96 cores, providing an apples-to-apples comparison of the two generations. Note that while the hpc7a family also offers a larger hpc7a.96xlarge instance with 192 cores, this benchmark focuses on the 96-core comparison.

Figure 2: Wall time and cost (based on August 2025 on-demand pricing) for a static PWscf simulation of a 178-atom, 4-atomic species surface + molecule system. Both instance types have 96 cores, providing an equal-core comparison of the two generations. The figure highlights the wall time/cost tradeoff between the two compute instance types available in the HPC CPU cluster.

Figure 2 demonstrates the performance and cost tradeoffs between the two instance generations:

The hpc7a.48xlarge instance delivers 40% faster performance per node compared to hpc6a.48xlarge at 40% higher cost, reflecting the improved per-core performance of the latest generation.
Both instance types show similar strong scaling behavior (1.6x speedup from 1 to 2 nodes), which is excellent for DFT workloads that contain serial bottlenecks in the SCF loop.
The hpc7a.48xlarge instance provides 768 GiB of memory compared to 384 GiB on hpc6a.48xlarge (both with 96 cores), offering 8 GiB per core versus 4 GiB per core. This additional memory capacity allows larger systems to run on a single node, improving job throughput.

The takeaway is that the two queues are complementary. The hpc6a.48xlarge queue is the more cost-effective option for smaller jobs or when cost is the primary constraint. The hpc7a.48xlarge queue has a clear advantage when systems are large enough to be memory-bound or when a batch of simulation jobs is more wall-time-bound than cost-bound.

For the GPU cluster, having separate queues for different instance sizes provides cost optimization opportunities. Consider a molecular dynamics trajectory consisting of 60,000 inference steps using Orbital Material’s OMol25 conservative infinite-neighbour Machine Learned Interatomic Potential (MLIP). For a batch of 100 small-to-medium molecule systems with periodic boundary conditions, such a trajectory requires an average of 45 GPU-hrs per system. The total cost to run one trajectory to completion averages $340 per system on the smaller instance queue (g6e.16xlarge with 2 GPUs per instance), while the same trajectory costs only $169 per system on the larger instance queue (g6e.48xlarge with 8 GPUs per instance). The larger instances provide better cost efficiency per GPU-hour, making them more economical for large-scale workloads despite their higher absolute cost.

Recommendations for managing the software stack

Computational chemistry requires managing complex software packages written in Fortran and C++, in addition to Python packages like ASE and PyTorch. The PCS-compatible Ubuntu 22.04 AMI includes Environment Modules, which simplifies dependency management for compiled packages.

One effective approach is using Easybuild, an HPC package management framework, to compile and manage software modules. For example, compiling Quantum ESPRESSO as a module requires a single command:

eb QuantumESPRESSO-7.2-foss-2023a.eb --robot --modules-tool EnvironmentModules --module-syntax Tcl

This command compiles Quantum ESPRESSO against a free and open source (FOSS) toolchain that includes OpenMPI, FFTW, ScaLAPACK, and ELPA. Once compiled, loading the module is straightforward:

module load QuantumESPRESSO/7.2-foss-2023a

This automatically loads all dependencies (including the FOSS toolchain, OpenMPI, FFTW, ScaLAPACK, and ELPA), making the software immediately available. This approach provides several benefits: optimized performance through proper compilation flags, automated dependency management, and the flexibility to maintain multiple versions of the same package compiled against different toolchains without environment conflicts.

Conclusion and outlook

Computational chemistry requires a diverse stack of methods and software packages, from AI models to computationally demanding simulations. AWS Parallel Computing Service (AWS PCS) provides the flexibility and performance needed to build HPC infrastructure that meets these varied requirements. As demonstrated by Aionics’ deployment, organizations can optimize for different objectives—whether prioritizing cost efficiency or faster time to results—by configuring multiple queues with different instance types. Aionics’ successful migration to AWS PCS has enabled their team to focus on advancing battery electrolyte research rather than managing infrastructure, while maintaining the flexibility to scale resources based on workload demands.

Looking ahead, computational chemistry workloads increasingly rely on large-scale GPU-accelerated ML training and inference—from training machine-learned interatomic potentials to running high-throughput screening campaigns. For organizations planning these resource-intensive campaigns, the recent integration of Amazon EC2 Capacity Blocks for ML in AWS PCS enables reserving GPU instances up to 8 weeks in advance with guaranteed capacity and discounted rates. This capability addresses a key planning challenge: ensuring GPU availability for time-sensitive research milestones without maintaining long-term reservations during idle periods. Combined with PCS’s managed infrastructure and flexible resource allocation, this positions AWS PCS as a comprehensive platform for accelerating computational chemistry research and development.

AWS HPC Blog

How Aionics accelerates chemical formulation and discovery with AWS Parallel Computing Service

Brute force doesn’t work here

Design and deployment

Benchmarks and performance insights

Recommendations for managing the software stack

Conclusion and outlook

Resources

Follow

Learn

Resources

Developers

Help