Scaling EDA Workloads using Scale-Out Computing on AWS

Introduction

Semiconductor and electronics companies using electronic design automation (EDA) applications can significantly accelerate their product development lifecycle and time to market by taking advantage of the near infinite compute, storage, and other resources available on AWS. In this workload enablement blog post, I provide architectural and system-level guidance to build out an environment capable of scaling EDA applications to 30,000 cores or more.

EDA workloads often require a compute cluster, a process scheduler that orchestrates job distribution to compute nodes, and a high-performance shared file system. The shared file system is typically required to sustain throughput requirements that vary anywhere from 500 MB/sec to 10 GB/sec depending on the EDA workload use case, design size, and total number of cores. Other important components of the EDA infrastructure stack include license management, remote desktops and visualizations, as well as user management including identity and access controls, budgeting, and monitoring.

Service and solution overview

To help me quickly launch an EDA environment on AWS, I use the official AWS Solution Scale-Out Computing on AWS. This Solution leverages many AWS services, including Amazon Elastic Compute Cloud (EC2 Spot and EC2 On-Demand Instances), Amazon Simple Storage Service (S3), Amazon Elastic File System (EFS), and Amazon FSx for Lustre.

Scale-Out Computing on AWS
Scale-Out Computing on AWS is an official AWS Solution that helps customers more easily deploy and operate a multiuser environment for computationally intensive workflows. The Solution deploys a cluster, provides automated cluster provisioning orchestration, and features a large selection of compute resources, to include a fast network backbone; nearly unlimited storage; and budget and cost management directly integrated with AWS.

Amazon Elastic Compute Cloud (Amazon EC2)
Amazon EC2 is a web service that provides secure, resizable compute capacity in the cloud. Amazon EC2 offers the broadest and deepest choice of instances, built on the latest compute, storage, and networking technologies and engineered for high performance and security. When running fault tolerant EDA workloads, Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. AWS also offers Amazon EC2 Reserved Instances and Savings Plans, which provide significant discount compared to On-Demand Instance prices in exchange for a commitment of usage for a 1 or 3 year term.

Amazon Simple Storage Service (Amazon S3)
For persistent data, such as libraries, tools, and design specifications, EDA workflows can leverage Amazon S3. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. Amazon S3 is designed for 99.999999999% (11 9’s) of durability, and stores data for millions of applications for companies all around the world. Traffic between Amazon EC2 and Amazon S3 can leverage up to 25 Gbps of bandwidth, as wells as use Cross-Region Replication, and data tiering (see Amazon S3 Features and Amazon S3 FAQs for more info).

Amazon Elastic File System (Amazon EFS)
Amazon EFS provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. Amazon EFS is the file system used for home directories and automation scripts when building the environment on AWS. It is built to scale on demand to petabytes without disrupting applications, growing and shrinking automatically as you add and remove files, eliminating the need to provision and manage capacity to accommodate growth.

Amazon FSx for Lustre (Amazon FSx for Lustre)
If your EDA tools require a high-performance shared file system, FSx for Lustre is a fully managed, high-performance file system, optimized for high performance computing (HPC) and EDA workloads. FSx for Lustre provides seamless integration with Amazon S3 making it easy and cost effective to persistently store the data in Amazon S3 and present it with a high-performance POSIX file system that can be mounted on compute instances for data processing with sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS.

Solution Deployment

For deployment, I go through the following steps:

Launch and configure an EDA capable infrastructure stack using Scale-Out Computing on AWS
Customize an Amazon Machine Image (AMI)
Setup a license server, with the option to setup multiple license servers
Create an Amazon S3 bucket to upload the test cases and setup a shared file system
Connect to a NICE DCV remote desktop session
Install the EDA application

Step 1: Launch and configure infrastructure using Scale-Out Computing on AWS
Scale-Out Computing on AWS can be deployed using the 1-click installer from the official AWS Solutions page https://aws.amazon.com/solutions/scale-out-computing-on-aws, which starts the deployment in AWS CloudFormation. The Solution is highly customizable, and core functionality is enabled with a scheduler running on an EC2 instance, which then leverages AWS CloudFormation and Amazon EC2 Auto Scaling to automatically provision resources necessary to execute user tasks such as scale-out compute jobs and remote visualization sessions using NICE DCV.

On the AWS CloudFormation parameter screen, you need to provide the following items:

The CloudFormation stack will take about 20 minutes to deploy then an additional 20 minutes are needed for the scheduler node to complete configuration. Once the configuration is complete, you should be able to access the Scale-Out Computing on AWS web interface using the HTTPS link under CloudFormation Stack -> Outputs -> WebUserInterface.

Detailed instructions for deploying Scale-Out Computing on AWS Solution with screenshots are available at: https://awslabs.github.io/scale-out-computing-on-aws/install-soca-cluster/

Step 2: Customize the Amazon Machine Image (AMI)

For a large cluster, it is recommended to start with a modern operating system image from AWS Marketplace then customize it per the application requirements. In this example, I started with a CentOS 7.7 base AMI and customized it by installing system packages needed for Scale-Out Computing on AWS solution to reduce compute node launch time. The steps to customize an AMI for the Solution are detailed here: https://awslabs.github.io/scale-out-computing-on-aws/tutorials/reduce-compute-node-launch-time-with-custom-ami/

However, before creating the custom AMI, I also needed to install system packages required for the EDA application. Here is an example of packages required for a certain workload:

# yum install -y \
	vim vim-X11 xterm compat-db47 glibc glibc.i686 openssl098e \
	compat-expat1.i686 dstat epel-release motif libXp\
	libXaw libICE.i686 libpng.i686 libXau.i686 libuuid.i686 libSM.i686 \
	libxcb.i686 plotutils libXext.i686 libXt.i686 libXmu.i686 \
	libXp.i686 libXrender.i686 bzip2-libs.i686 freetype.i686 \
	fontconfig.i686 libXft.i686 libjpeg-turbo.i686 motif.i686 \
	apr.i686 libdb libdb.i686 libdb-utils apr-util.i686 libXp.i686 \
	qt apr-util gnuplot

You’ll want to check that all system libraries needed for your application are preinstalled on the AMI. You can check by going to the binaries directory of the application, then issue this command:

# ldd <application_path>/linux64/bin/<binary_name> | grep 'not found'

The command should return a list of any library dependencies that are not found on the operating system library paths. Then you can use this command to identify the name of the package that provides the missing library and install it:

# yum whatprovides */libname.*
# yum install -y packagename

Next, I needed to turn off SELinux by editing /etc/selinux/config and changing SELINUX=enforcingto SELINUX=disabled.

Next, it is also recommended to increase system limits specially if the application deals with a large number of files, which is common for many EDA applications. This can be accomplished by editing the following files and adding the subsequent entries:

/etc/sysctl.conf:
	net.core.somaxconn=65535
	net.ipv4.tcp_max_syn_backlog=163840
	net.core.rmem_default=31457280
	net.core.rmem_max=67108864
	net.core.wmem_default = 31457280
	net.core.wmem_max = 67108864
	fs.file-max=1048576
	fs.nr_open=1048576

/etc/security/limits.conf:
	*		hard 	memlock 	unlimited
	*		soft 	memlock 	unlimited
	*		soft 	nproc 	 	3061780
	*		hard 	nproc 		3061780
	*		soft	sigpending	3061780
	*		hard	sigpending	3061780
	*		soft	nofile		1048576
	*		hard	nofile		1048576

/opt/pbs/lib/init.d/limits.pbs_mom:
	ulimit -l unlimited
	ulimit -u 3061780
	ulimit -i 3061780
	ulimit -n 1048576

Next, I installed Amazon FSx for Lustre client in the base image. You can follow the steps at: https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html for CentOS 7.7. Since we’ve updated the kernel and rebooted the instance in the first step in this section while we were installing the system packages needed for Scale-Out Computing on AWS to reduce compute node launch time, we don’t need to restrict the aws-fsx.repo to stay on version 7.7.

Now that I have completed all of the AMI customizations, I can create the new AMI to be used when launching instances. I do this by going to the Amazon EC2 console, select the instance I have customized, click on Actions -> Image -> Create Image and note the AMI ID.

Finally, I edited /apps/soca/<CLUSTER_ID>/cluster_manager/settings/queue_mapping.yml to update the AMI ID as indicated below. This would enable my compute cluster to use this image for all compute and desktop nodes.

queue_type:
  compute:
    queues: ["high", "normal", "low"] 
    instance_ami: "<YOUR_AMI_ID>" # <- Add your new AMI ID
    base_os: "centos7"
    root_size: "10" 		 # <- Add the size corresponding to your AMI 
    instance_type: ...
...
  desktop:
    queues: ["desktop"] 
    instance_ami: "<YOUR_AMI_ID>" # <- Add your new AMI ID
    base_os: "centos7"
	root_size: "10" 		 # <- Add the size corresponding to your AMI    
	instance_type: ...

Step 3: Setup a license server, with the option to setup multiple

It is possible to use on-premises license server(s) when running EDA workloads in AWS but this requires a stable network connection between the on-premises network and the Amazon VPC. This can be established using AWS Site-to-Site VPN connection or by establishing a dedicated network connection using AWS Direct Connect. However, for large scale testing smaller latencies between compute nodes and the license server(s) is preferred. So, I decided to deploy a license server in my Amazon VPC.

For the license server, I provisioned a c5.2xlarge instance running CentOS 7.7 in the private subnet corresponding to the same Availability Zone deployed by the Scale-Out Computing on AWS Solution for the scheduler instance.

You should check with your EDA vendor on their specific requirements for setting-up a license server on AWS. At the time of writing this blog, I’m aware of two different requirements from EDA vendors so I’ll cover both.

An EDA vendor might require that you use an Elastic Network Interface (ENI) attached to the EC2 instance to generate a license file. You must create the network interface in the same subnet then attach it to the EC2 instance. Step by step details about creating and attaching an elastic network interface are available at: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#working-with-enis

Alternatively, an EDA vendor might require that you request a license file based on virtual machine universal unique identifier (VM UUID) which translates to an EC2 instance ID and the format is VM_UUID=<AWS EC2 Instance ID> For example, VM_UUID=i-0af01c0123456789a. You might want to double check the minimum version needed for the license daemon that supports license generation based on VM UUID.

For both cases, it is recommended to attach the Security Group named ComputeNodeSecurityGroup created by Scale-Out Computing on AWS to the license EC2 instance, as it opens all TCP traffic between instances that have the same security group attached.

Now I login to my license server instance, and configure and launch the licensing software. The following system packages must be installed on the license server, as they are required for FlexLM:

# yum install -y \
	vim glibc.i686 \
	glibc-devel.i686 \
	redhat-lsb

Next, I needed to turn off SELinux by editing /etc/selinux/configand changing SELINUX=enforcing to SELINUX=disabled.

Next, you need to install the license application provided by the EDA vendor and follow the instructions to identify the host ID to generate the license.

After receiving a license file, I edited /etc/rc.local to enable the license daemon to start automatically whenever I start the license server instance by adding the following line:

/etc/rc.local:
	su centos -c "/vendor/version/linux64/bin/lmgrd -c /vendor/license/license.lic
	-l /vendor/logs/license.log - reuseaddr"

Then invoked these commands to disable firewall daemon and enable execution of /etc/rc.local upon instance reboot:

systemctl stop firewalld
systemctl disable firewalld
chmod +x /etc/rc.d/rc.local
systemctl enable rc-local
systemctl start rc-local

Next, I increased some system limits, especially because I’m planning to do a large scale activity with more than 30,000 cores, which requires additional file handles and network connections. This can be accomplished by editing the following files and adding the subsequent entries:

/etc/sysctl.conf:
	net.core.somaxconn=65535
	net.ipv4.tcp_max_syn_backlog=163840
	net.ipv4.tcp_keepalive_time=300
	net.ipv4.tcp_keepalive_intvl=60 
	net.ipv4.tcp_keepalive_probes=5
	net.ipv4.tcp_retries2=3
	net.core.rmem_default=10485760
	net.core.rmem_max=10485760
	fs.file-max=1048576
	fs.nr_open=1048576


/etc/security/limits.conf:
	*		soft	nofile		1048576
	*		hard	nofile		1048576

Finally, I reboot the license server instance so that all changes are applied.

Note: if the license server will be serving more than 10,000 license features, the EDA vendors recommend splitting the licenses on more license servers, so you might need to deploy another license server by repeating the Step 3 section again on another EC2 instance.

Step 4: Create an Amazon S3 bucket to upload the test cases and Setting-up a shared file system

I create an S3 bucket to serve as our persistent storage, then upload the EDA application software and the required test case files to the bucket. From there, I go through the steps for setting up an FSx for Lustre file system. Although beyond the scope of this blog post, there are alternate file system options for running EDA tools on AWS. For more information, please reach out to your Solutions Architect.

Before creating an S3 bucket, you must install the AWS CLI. To install the AWS CLI, follow the steps shown here: Installing the AWS CLI.

Next, configure the AWS CLI by providing credentials that allow you to create an S3 bucket and upload data to the bucket.

$ aws configure
AWS Access Key ID [None]: 
AWS Secret Access Key [None]: 
Default region name [None]: 
Default output format [None]:

Step 4.1: Create an S3 bucket

Next, verify that AWS CLI is working by creating an S3 bucket:

$ aws s3 mb s3://app-testcases-<unique_id>
make_bucket: app-testcases-<unique_id>

Note: As an example, I created an S3 bucket called app-testcases-<unique_id>. The <unique id> is required to make sure that the S3 bucket name is globally unique.

Next, assuming you’re using a laptop or a server outside AWS that already has the EDA application installation files, and the required test cases, then you can easily upload them to the S3 bucket as follows:

$ aws s3 cp Installer.bin s3://app-testcases-<unique_id>/
$ aws s3 cp App-version-common.ext s3://app-testcases-<unique_id>/
$ aws s3 cp App-version-linux64.ext s3://app-testcases-<unique_id>/
$ aws s3 cp design-testcases.tar.gz s3://app-testcases-<unique_id>/

Note: I’m using .bin and .ext as a generic file extensions for the vendor installation files.

If the machine you’re using doesn’t have public internet access, you can create an EC2 instance in AWS, then use it to download the software from the vendor’s FTP server.

Step 4.2: Create Amazon FSx for Lustre file system

There are a few considerations that you must be aware of before creating an Amazon FSx for Lustre file system. The throughput capacity of FSx for Lustre file system is dependent on storage capacity. In this deployment, I’ll be using the “Scratch 2” Amazon FSx deployment type file system, which drives 200 MB/sec/TiB and my test cases require an aggregate throughput 1.5 GB/sec, so I need to create a 9.6 TiB file system or larger. If you’re not sure about test case aggregate throughput requirement, it is recommended to start with a reasonably large FSx for Lustre file system then experiment with an initial number of EC2 instances, collect aggregate throughput requirements, and choose proper size for the file system as needed for the required number of instances at full scale. To support workloads with high metadata operations, it is recommended to have an even larger file system.

Using the AWS Management Console, under Services go to Amazon FSx -> Create File System. Then select “Amazon FSx for Lustre”, then on the next screen, provide a name for the file system, select Scratch, keep Scratch 2 selected, for Storage capacity, type 45.6 and notice that Throughput Capacity will be calculated as 8906 MB/s.

In the Network & security section,

under VPC select soca-<stack name>-VPC,
under VPC Security Groups, select the security group that corresponds to soca-<stack-name>-ComputeNodeSG,
under Subnet, select the subnet that corresponds to soca-<stack-name>-Private1

In the Data repository integration section, change Data repository type to: Amazon S3, and type s3://app-testcases-<unique_id> under Import Bucket, and select “The same prefix that you imported from (replace existing objects with updated ones)” under Export prefix. You can provide any tags then click Next to review the settings, then click on “Create file system”.

The file system should be ready within 5 mins and the AWS Management Console will indicate the status once the file system is created and ready for use.

Finally, login to the scheduler instance (if not already) and edit /apps/soca/<CLUSTER_ID>/cluster_manager/settings/queue_mapping.yml to include the Amazon FSx for Lustre DNSName using fsx_lustre option under compute and desktop queues:

fsx_lustre: "fs-xxxxxxxxxxxxxxxxx.fsx.<region>.amazonaws.com"



queue_type:
  compute:
    queues: ["queue1", "queue2", "queue3"] 
    fsx_lustre: "fs-xxxxxxxxxxxxxxxxx.fsx.<region>.amazonaws.com" # <- Add your Amazon FSx for Lustre DNSName 

  desktop:
    queues: ["desktop"]
    fsx_lustre: "fs-xxxxxxxxxxxxxxxxx.fsx.<region>.amazonaws.com" # <- Add your Amazon FSx for Lustre DNSName

Step 5: Connecting to a remote desktop session using NICE DCV
Login to the Scale-Out Computing on AWS web user interface using the username/password used in Step 1 then click on “Graphical Access” on the left sidebar. Under Your Session #1, you can select the session validity and the size of the virtual machine in terms of CPUs and memory then click on “Launch my Session #1”. A new “desktop” job is sent to the queue, which creates a new instance based on the specified requirements.

You will see an informational message asking you to wait up to 20 minutes before being able to access your remote desktop. You can check the status of the desktop job by clicking on “My Job Queue” on the left sidebar. Once the session is ready, the information message on “Graphical Access” -> “Your session #1” will be updated with the connection information.

You can access the session directly from your browser or download the NICE DCV native client for Mac / Linux / Windows and access your session through the native client.

More details for Graphical Access are available at: https://awslabs.github.io/scale-out-computing-on-aws/access-soca-cluster/#graphical-access-using-dcv

Step 6: Installing the EDA application
After you’ve logged into the remote desktop session, you should be able to see the files you uploaded to the Amazon S3 bucket automatically visible on the FSx for Lustre file system, which is mounted under /fsx.

$ cd /fsx
$ ls
Installer.bin App-version-common.ext App-version-linux64.ext design-testcases.tar.gz

Next step would be to setup the EDA vendor installer application:

$ chmod 755 Installer.bin
$ ./Installer.bin

When prompted for an installation directory, type /fsx/<vendor_name>/installer then wait until the installation completes.

Next step would be to use the vendor installer application to install the EDA application:

$ /fsx/<vendor_name>/installer/installer

The installer will prompt for the path to the source directory containing the downloaded EFT file(s), enter /fsx which contains App-version-*.ext . Then the installer will prompt for the full path where we want to install the EDA vendor application, so we’ll type /fsx/<vendor_name>. The installer verifies the integrity of the *.ext files and proceed with the installation.

Once the installation completes, I edited ~/.bash_profile to add the following:

export EDATOOL_HOME=/fsx/<vendor_name>/<app_name>/<version>
source $EDATOOL_HOME/setup.sh
export LM_LICENSE_FILE=27020@ip-a-b-c-d 	# <- Modify to point to the license   
                                              # sever(s) created in step 3

Conclusion
In this blog post, I’ve used Scale-Out Computing on AWS to setup an EDA environment capable of running the entire semiconductor design workflow. In an upcoming post, I’ll use this infrastructure to scale-up an example EDA workload to more than 30,000 cores. Stay tuned!

AWS for Industries

Scaling EDA Workloads using Scale-Out Computing on AWS

Resources

Follow