Create a Slurm cluster for semiconductor design with AWS ParallelCluster

Chips are fingernail-sized pieces of glass that power all your electronic devices like cell phones, computers, and TVs. Chip designers use electronic design automation (EDA) tools to draw billions of microscopic switches and connect them with wires that are smaller than the finest hair.

EDA engineers measure device sizes in nanometers (1×10^-9m or 1 billionth of a meter) and time in picoseconds (1×10^-12 s, or one-trillionth of a second). They analyze their designs to ensure that they meet power, performance, and area (PPA) goals and that they meet rigorous design rules so that silicon foundries can manufacture them. The designers send the completed drawings to a foundry that manufactures them in automated factories that reliably create structures at atomic scale. The entire design and manufacturing process requires vast amounts of high-performance storage and high-performance computing (HPC) to run millions of EDA jobs for large teams of engineers.

Today, we’re going to show you how to create a cluster on AWS, using AWS ParallelCluster, that’s purpose-built for this environment, and ready for the kinds of workloads that EDA users work with every day.

A world of complex constraints

EDA workflows impose complex requirements on the compute cluster. A workflow may consist of hundreds of different licensed EDA tools with very different requirements. The tool licenses are typically much more expensive than the infrastructure they run on so the scheduler must ensure that jobs only run when a license and the required compute resources are available.

Large teams must share these critical resources so it’s also essential that the scheduler can enforce a fair-share allocation policy to prevent some users from monopolizing resources at the expense of others. The EDA jobs themselves have widely differing requirements, so the compute cluster must support a very diverse set of Amazon EC2 instance sizes and families from compute-optimized to high-memory variations.

A day in the life

Figure 1 shows the basic workflow for designing a chip. The front-end and back-end tasks have quite distinct requirements for compute and storage. An advantage of AWS is the diversity of compute and storage solutions available so design teams can tailor their infrastructure for the needs of each step in the workflow.

The process is also extremely iterative. If engineers find a problem in the late stages of the project, the fix may require architectural changes that require the team to rerun all the steps of the workflow. This is where design teams can get into capacity crunches – and risk catastrophic schedule delays if they can’t access enough compute and storage capacity.

Figure 1 – Chip design workflow diagram. Architects create and edit the architecture. Designers create and edit the design and create RTL and circuit diagrams. Design verification engineers run simulations to verify that the design works and gather coverage information to verify completeness of tests. Back end design engineers synthesize RTL to create netlists. Then they do physical layout that converts the netlists to layout and GDSII. Then they run physical verification and power and signal analysis to make sure the layout meets power, performance, and area requirements and that the GDSII meets all design rules. After verification is complete, the developers tape out the GDSII to a silicon foundry for manufacturing.

At the beginning of a project, compute usage is typically low and sporadic. Usage peaks around project milestones and at the end of a project when it runs hot for several months – and is usually in the critical path for project completion. Figure 2 shows the number of jobs running on different instance types in an EDA cluster over a typical 24-hour window. Utilization would likely be higher and more sustained near the end of a project.

Notice the variability and the diverse mix of instance types that the cluster uses ranging from high-frequency general purpose m5zn to memory-optimized and high-memory r6i and x2iezn instances. The scalability of AWS can ensure that jobs are able to run on instance types that are ideal for the jobs without infrastructure capacity constraints.

Figure 2 – The EDA job profile is highly variable. This graph shows peaks and troughs of usage throughout a 24-hour window. Engineers run jobs that use different instance types as required by the job requirements.

Enter Slurm and AWS ParallelCluster

Design teams have traditionally used commercial schedulers, but are increasingly showing interest in Slurm. Slurm is open-source, free to use, and meets all the requirements for running EDA workloads. AWS ParallelCluster is an AWS service that allows Slurm to automatically scale up and down fleets of compute instances on AWS. It supports high job throughput and job capacity for even the most demanding workloads. It also supports the high security demands of the semiconductor industry by not requiring any internet access for its functionality.

Starting with version 3.7.0, AWS ParallelCluster has all the features required to easily configure and deploy an EDA Slurm cluster that takes full advantage of the scalability of AWS to design the most advanced chips. It offers Slurm accounting, which allows administrators to configure license-sharing and enables fair-share allocation for cluster users. It can schedule jobs based on the number of cores and the amount of memory that the job requires, too. ParallelCluster also increased the number of instance types that it can support in a cluster. It added support for Redhat Enterprise version 8 which all EDA tools will require, starting in 2024. It added support for custom instance type weighting so Slurm can schedule the lowest cost instance type that meets a job’s specific requirements. And finally, ParallelCluster added a Python management API that you can use in a Lambda Layer to completely automate the deployment and updating of your cluster.

Putting this all together

The aws-eda-slurm-cluster repository on GitHub uses all these new ParallelCluster features to quickly and easily create an EDA-specific ParallelCluster in the VPC of your choosing. The cluster supports CentOS 7, RHEL 7 and 8, x86_64 and arm64 architectures, On Demand and Spot Instances, heterogeneous queues, and up to 50 instance types. You can easily configure EDA license counts and fair share allocations using simple configuration files. By default, it selects the instance types that are best for EDA workloads. The cluster sets compute node weights based on the cost of instance types so that Slurm chooses the lowest cost node that meets job requirements.

In addition to the normal ParallelCluster partitions, it defines a batch and an interactive partition that each contain all the compute nodes. The batch partition is the default and the interactive partition is identical except that it has a much higher weight. If you need a job to run quickly, for example to debug a simulation failure, and the batch partition is full, you can jump to the head of the line by submitting your job to the interactive queue and Slurm will schedule it using the next available license and compute node.

ParallelCluster and Slurm typically expect you to ssh to a login node or the head node to use the cluster, but semiconductor engineers expect to be able to use the cluster from a shell on their virtual desktops. With aws-eda-slurm-cluster, you can configure submitter hosts so that they can directly access one or more clusters. The cluster configures submitter hosts as Slurm login nodes and creates modulefiles you load to set up the shell environment to use the cluster. It also supports submitters that have a different OS or CPU architecture than the cluster.

Deployment

The aws-eda-slurm-cluster GitHub page documents the simple deployment process. The cluster uses the AWS Cloud Development Kit (CDK) to create a custom application that reads a configuration file and creates an AWS CloudFormation stack that creates and configures the cluster in less than 30 minutes.

The CloudFormation stack creates and configures your customized ParallelCluster. When you no longer need the cluster, you simply delete the CloudFormation stack and CloudFormation deletes the cluster for you. If you need to update the cluster, then update the configuration file and rerun the CDK application.

The cluster uses a YAML configuration file with a different schema than ParallelCluster’s configuration file. The following basic configuration will use ParallelCluster to create a RHEL 8 Slurm cluster configured for EDA workloads. Note the prerequisite items highlighted in red. It requires an AWS VPC, subnet, and EC2 key pair. If you don’t already have these, this tutorial shows how to create them. This tutorial shows how to create a Slurm accounting database stack. Include the stack name in the configuration. The licenses section allows the configuration of 1 or more software licenses that Slurm will track to make sure that jobs don’t use more than the number of configured licenses.

StackName: eda-pc-3-7-2-rhel8-x86-config
Region: <region>
SshKeyPair: <ec2-key-pair>
VpcId: vpc-xxxx
SubnetId: subnet-xxxx
slurm:
  ClusterName: eda-pc-3-7-2-rhel8-x86
  MungeKeySecret: /slurm/munge_key
  ParallelClusterConfig:
    Version: '3.7.2'
    Image:
      Os: 'rhel8'
    Architecture: 'x86_64'
    Database:
      DatabaseStackName: <parallel-cluster-database-stack>
  SlurmCtl: {}
  InstanceConfig:
    NodeCounts:
      DefaultMaxCount: 10
Licenses:
  <license-name>:
    Count: 10

Deployment is as simple as executing the following commands from the root of the git repository.

$ source setup.sh
$ ./install.sh --config-file <config-filename> --cdk-cmd create

This will create a CloudFormation stack named eda-pc-3-7-2-rhel8-x86-config that will create and configure a ParallelCluster called eda-pc-3-7-2-rhel8-x86. If you need to update the configuration, for example to add a custom compute node AMI, then you just edit the configuration file and run the following command:

$ source setup.sh
$ ./install.sh --config-file <config-filename> --cdk-cmd update

You can use they cluster by connecting to the head node or a login node. If you would like the convenience of accessing the cluster directly from a submitter host, the output of the eda-pc-3-7-2-rhel8-x86-config stack has commands for mounting the cluster’s NFS export on submitter hosts and configuring them to use the stack. After you configure the submitter host, you can easily use the cluster. Simply load the provided modulefile and run Slurm commands like in the following example which will open an interactive bash shell on a compute node with 1 GB of memory and 1 CPU core for, at most, an hour.

$ module load eda-pc-3-7-2-rhel8-x86
$ srun -p interactive --mem 1G -c 1 --time 1:00:00 --pty /bin/bash

The module file sets up the environment for the Slurm cluster and configures Slurm defaults for the path, number of cores, default amount of memory, job timeout, and more. This makes it so that users must override the defaults to get more than minimal cluster resources. By default, their jobs will only get 1 core, 100 MB of memory, and a time limit of 1 hour.

Another nice feature of the cluster is that it creates custom ParallelCluster AMI build configuration files. You can use them to create AMIs with all the packages typically required by EDA tools. You can find the build configuration files in the repo at source/resources/parallel-cluster/config/<parallel-cluster-version>/<ClusterName> or on the head node or submitter host at /opt/slurm/<ClusterName>/config/build-files. The GitHub page documents the process for building custom EDA AMIs.

Conclusion

You can easily configure and deploy a Slurm cluster on AWS using AWS ParallelCluster and aws-eda-slurm-cluster that can run your most demanding EDA workloads and take advantage of the performance and scalability of AWS to meet your project needs.

Contact your AWS account team and schedule a meeting with our semiconductor industry specialists for more information and help getting your EDA workloads running on AWS.

AWS HPC Blog