Building Elastic HPC Clusters for EDA with Altair on AWS

Introduction

Chip design is not only compute and memory intensive, it also requires varying amounts of resource in different Integrated Circuit (IC) design phases: some frontend tools are single threaded and CPU bound, while backend tools rely on high performance storage and large memory. Fixed-size compute farms on-premises result in jobs waiting in queue for either a license or the right-size compute node.

Example Key Performance Indicators (KPIs) for Electronic Design Automation (EDA) workflows impacted by a fixed-sized cluster are:

License utilization: EDA tool licenses can be one of the most expensive line-items, better license utilization reduces costs while accelerating time-to-market for new products;
Engineering productivity: Hiring and training engineers is time consuming and expensive. Increasing engineering productivity by reducing job wait time (queue length) and job run time, results in engineers delivering products to market faster;
Infrastructure cost: Reducing infrastructure costs frees up resources to drive innovation and introduce new products to market.

Companies often run multiple IC design projects concurrently, complicating resource planning. Under-sizing the cluster can result in slower time to market, while oversizing leads to inefficient use of resources. Many chip design companies are using Altair Accelerator, and can leverage its integration with AWS to build an elastic cluster that grows and shrinks automatically as needed. This elastic cluster also allows a flexible choice of compute types, optimizing all three KPIs.

In this blog I’ll review best practices to optimize EDA using AWS services, and the AWS partner Altair Engineering’s Altair Accelerator.

Right-size your compute cluster size

Amazon Elastic Compute Cloud (EC2) offers a resizable compute capacity – customers only pay for the time they use the computing resources. This enables chip design companies to scale their compute resources at any time, adding resources when more jobs need to run, and removing them when they are no longer required. Altair Accelerator is a common scheduler for EDA workloads, and can manage cluster scaling using its Rapid Scaling feature, going beyond job scheduling to compute resource scheduling and optimization.

Altair Rapid Scaling triggers EC2 resource provisioning when jobs wait in queue more than a set time period (E.g., 5 minutes), reducing job wait times and improving engineering productivity, while also reducing the risk of a license being available without a compute resource to run it (license underutilization). Similarly, when jobs finish and instances are idle for a set time period (E.g., 1 minute), Rapid Scaling will terminate them to optimize cost. The chart below shows the number of compute instances, with the baseline number of instances continuously running, and additional compute instances added or terminated depending on the number of jobs waiting in the queue. Altair can correctly size the cluster at each moment as it manages the job queues while having visibility to available licenses.

When running EDA workloads on EC2 instances, Altair Accelerator can assign each task to the compute that will minimize its run time and license utilization. Jobs requiring faster CPUs can benefit from 4.5GHz CPUs in the m5zn and x2iezn instances, while jobs requiring higher memory can utilize the r6i instances with up to 1024GB RAM, the x2g and other high-memory instances to accelerate their final signoff. Historically, EDA teams have optimied compute infrastructure only through CPU frequency and RAM/core ratio, but with EC2 offering Intel, AMD, and Arm-based AWS Graviton3 processors, you can select the optimal configuration for each job, balancing cost and performance.

Choosing the right EC2 instance type for each EDA tool and testing with new instance types as they become available is a good practice to improve license utilization, reduce turnaround times further and allow more jobs to run before tapeout, resulting in higher quality designs and faster time to market.

Optimize for cost

EDA licenses cost more than the compute resources, so cutting job run times reduces license use time, and improves the total cost of developing your next chip. To optimize costs further, Altair Accelerator enables customers to manage access to licenses across on-premises and AWS, reducing unused licenses.

When comparing costs, it’s important to look at Total Cost of Ownership (TCO), including EDA licenses, compute, engineering time and time to market. For example, a compute instance with faster CPU may cost 30% more, and result in a 15% better job run time for Design Verification. Since infrastructure costs increased more than job performance this may appear wasteful. However, looking at the TCO, the shorter license usage means either more jobs can run or buying fewer licenses In either case, the TCO is better as licenses cost more than their compute. Additionally, with the faster turnaround time, increased engineering productivity, and the potential for faster time to market – you should be able to estimate the decrease in TCO while running on AWS.

Altair Accelerator can leverage Amazon EC2 Spot Instances, and Amazon EC2 Reserved Instance to optimize costs further. Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud at up to a 90% discount compared to On-Demand prices, however they have the risk of interruption with a 2-minute notice (more details here: Spot Instance interruptions). Spot Instances fit fault tolerant EDA workloads, and Altair Accelerator can relaunch any job that was interrupted on a new Spot instance. EDA workflows can also leverage Reserved Instances that provide up to a 72% discount compared to On-Demand prices, and an optional capacity reservation in specific Availability Zones.

Altair Accelerator offers additional cost controls, from its License First Scheduling (provisioning resources only when the required license is available), to the ability to have budgets set for each project to keep individual teams from over-spending. For further cost control, you can also use AWS budgets, which allows you to set custom budgets to track your cost and usage.

Optimize for engineering utilization

Engineering is one of the largest costs for chip design companies, and can be the hardest workflow to scale. Optimizing for the engineers’ time improves test coverage, resulting in higher quality products. It can also be used to cut time to market.

Altair Accelerator adapts the cluster size dynamically to optimize test run times, enabling engineers to start analyzing test results and submitting new jobs at a faster pace. It also offers engineers a self-service dashboard to locate the root cause of job queue times. Analyzing why jobs are not running allows engineers to quickly resolve issues and focus on time to market. An example, is the case study for Annapurna labs (An Amazon company), cutting HPC costs by at least 50%. Annapurna also reduced turn-around times, resulting in higher engineering productivity and accelerating time to market.

Conclusion

Engineers are builders, seeking to innovate and build the next exciting product. When lack of compute resources, or a missing software license delays the engineers, your time-to-market is delayed. Optimizing license utilization, engineering productivity and cost requires an elastic cluster, that scales up and down quickly. Altair Accelerator provides EDA customers this elasticity through Rapid Scaling and is accessible through the AWS Marketplace. To learn more about EDA on AWS, read the Semiconductor Design on AWS white paper, and for more technical details, see the Run Semiconductor Design Workflows on AWS implementation guide.

AWS for Industries