Slash the costs of semiconductor design by up to 80% using Amazon EC2 Spot Instances

The challenges of semiconductor design in the cloud

In semiconductor design, the cornerstone of all modern electronics, precision meets the cutting edge of technology. Designing the chips that power the devices of our daily lives is an intricate process, one which requires billions of calculations, simulations, and verifications—all within parameters measured in nanometers and picoseconds and using vast amounts of computing power.

Faced with these constraints, companies face several significant challenges when implementing electronic design automation (EDA) solutions both in the cloud and on premises:

Cost management: Migrating to the cloud can cause sticker shock due to high initial costs. However, calculations based on traditional on-premises data centers often omit expenses associated with the computing unit. With a cloud service, rent, heating, cooling, staffing, and maintenance are all bundled into the pricing. Thus, by contrast with on-premises solutions, cloud solutions transform capital expenditure into operational expenditure, allowing companies to access the latest hardware and maintain cutting-edge performance without marginal investment.
Hybrid cloud strategy:Many companies adopt a hybrid approach, utilizing cloud resources for burst computing with a view to managing costs effectively. This strategy, while cost efficient, produces complexities in resource allocation and the maintenance of operational consistency across on-premises and cloud environments.
Choosing cloud providers and services: For success with EDA workloads, selecting the right cloud provider and optimal instances is crucial. Companies must evaluate providers based on performance, cost efficiency, and suitability for specific EDA requirements.
Network performance and integrity: Maintaining robust network performance during high-load operations and migrations is essential. Innovations like secondary elastic network interfaces (ENIs) are developed to mitigate potential disruptions.

Despite these challenges, there is significant potential for cost savings with cloud-based EDA. Key services from Amazon Web Services (AWS) that are particularly relevant to cloud-based EDA include Amazon Elastic Compute Cloud (Amazon EC2), secure and resizable compute capacity to support virtually any workload; AWS Batch, batch processing for machine learning model training, simulation, and analysis at any scale; Amazon Elastic File System (Amazon EFS), which companies use to share file data without provisioning storage; and Amazon Simple Storage Service (Amazon S3), object storage built to retrieve any amount of data from anywhere. These services offer scalable computing resources, efficient job scheduling, and high-performance shared storage, which are essential for managing the computational demands of EDA processes.

Semiconductor design workflows are comprised of compute-intensive workloads, particularly in backend processes such as place and route (P&R), physical verification, and tapeout. These workloads use vast amounts of high-performance computing (HPC) resources, which can mean quickly escalating costs. Effectively managing these costs while maintaining optimal performance and scalability is critical for managing chip development budgets, especially when utilizing cloud-based EDA tools.

Using Amazon EC2 Spot Instances (Spot Instances)—which let companies utilize unused EC2 capacity in the AWS Cloud—is a strategic solution that can reduce such escalating costs by up to 80 percent. Nevertheless, using Spot Instances comes with its own set of challenges, such as availability management and disruption. Exostellar addresses these challenges by optimizing the deployment and administration of Spot Instances, providing reliable performance similar to that of Amazon EC2 On-Demand Instances (On-Demand Instances)—which let companies pay for compute capacity by the hour or second with no long-term commitments—but at a fraction of the cost and without compromising on computational needs.

Cost optimization for semiconductor workloads with Amazon EC2 Spot Instances

Spot Instances provide substantial cost savings, offering unused EC2 capacity at discounts up to 90 percent compared with On-Demand Instances. Nevertheless, while Spot Instances are suitable for various workloads that can tolerate interruption, they are less ideal for long-running, compute-intensive tasks such as semiconductor design, due to their unpredictability. For example, Spot Instances can be interrupted by AWS with a 2-minute warning when capacity is needed elsewhere, which poses a risk for tasks requiring sustained, uninterrupted computation.

Semiconductor design involves extensive simulation and computation that cannot accommodate unexpected interruptions, making Spot Instances suboptimal for workloads that require longer processing times. Exostellar, which is architected to be fault tolerant, preempts the termination of Spot Instances with its artificial intelligence (AI) advisory service, which live-migrates workloads to other available Spot Instances while maintaining network IP and address. In scenarios where Spot Instances are not available, Exostellar’s solution seamlessly live-migrates the workload back to On-Demand Instances or Savings Plans, which offer a flexible pricing model that provides savings on AWS usage. Having these fallbacks in place enhances the reliability in using Spot Instances.

Exostellar’s innovative approach empowers semiconductor companies to utilize the cost savings of Spot Instances while safeguarding continuous, reliable operations. This solution not only reduces cloud compute costs but also maintains the high-performance standards essential for effective and efficient semiconductor design workflows.

Cost simulations

This blog details the results of testing an enterprise-grade, compute-intensive EDA tool used in semiconductor design. We explored a variety of cloud service configurations, including On-Demand Instances, Spot Instances, and Savings Plans, both with and without the integration of Exostellar, as illustrated in Figure 1. Our aim was to determine the most cost-effective cloud solution for running semiconductor workflows.

AWS facilitated a comprehensive, large system test, carried out by a third party and designed to evaluate Exostellar’s ability to scale and enhance cloud resource utilization for extensive customer deployments using Spot Instances. This method has proven to reliably offer potential savings of up to 80 percent on Amazon EC2 costs. Furthermore, this effort underscores the shared commitment of AWS and Exostellar to boost the efficiency and cost-effectiveness of EDA workflows.

Our cost simulation covered a period of over 21 months and compared various configurations of On-Demand Instances, Spot Instances, and Savings Plans options. The results showed that Exostellar’s optimization, with 80 percent utilization of Spot Instances, could save hundreds of thousands of dollars without the necessity of upfront commitments or long-term contracts, as depicted in Figure 1.

Figure 1. Cost simulation outcomes from extensive testing of an enterprise-grade EDA tool on AWS, with and without Exostellar

Method of testing

The test bed prepared by AWS used an enterprise-grade, compute-intensive EDA tool for the purpose of simulating a complex environment typical of semiconductor design workflows. With a workflow optimized for eight cores and 32 GB of RAM, the EDA tool ran jobs sequentially under the high performance container (HPC) scheduler SLURM. The process of integrating SLURM with Exostellar’s solution involved the following steps:

Figure 2. The process of integrating and optimizing Exostellar with SLURM

Integrate the Exostellar control plane and auto-scaling plugin for the existing SLURM node.
The plugin initiates scaling of the SLURM compute cluster by sending requests to the Exostellar control plane.
Exostellar provisions nested virtual machines (VMs) that join SLURM and run jobs, just as an end user would expect from a regular Amazon EC2 VM.
These nested VMs are visible in the job queue and support all existing SLURM functionalities.
Exostellar employs live migration during a job run, dynamically adjusting the SLURM compute nodes to the spot market, optimizing costs without network connection interruptions or job terminations.

The implementation of Exostellar‘s nested virtualization, depicted in figure 2, offers a seamless integration with SLURM—which has all the functionality of load sharing facility (LSF) software—thus enhancing dynamic resource management, simplifying VM migrations, boosting the scalability of VMs, and improving overall workload performance. Importantly, this setup facilitates the efficient use of Spot Instances, specifically optimizing the usage of C6i instances, M6i instances, and R6i instances, without affecting performance.

Achieving unparalleled cost savings and reliability on AWS for EDA

The results of the large-scale testing demonstrate unparalleled efficiency and reliability in cloud infrastructure management:

Cost efficiency: The testing showed a 32 percent cost savings as compared with traditional job allocation strategies on cloud resources. This significant reduction, achieved without up-front payment or commitment, underscores the financial viability and superiority of Exostellar’s Infrastructure Optimizer.
Robust stress testing: In the course of migrating workloads every 110 seconds across 1,990 migrations and handling 427 jobs, the system maintained a remarkable 7 percent additional overhead, without any failures or workload behavior changes—a testament to the solution’s reliability and transparency.
Seamless integration: The solution significantly reduced onboarding friction, facilitating a smooth transition for EDA tools onto the cloud platform, thus streamlining the semiconductor design process.

Impact of the Infrastructure Optimizer on Live Migration

The worst-case scenario analysis involved 427 runs on a 32 GB machine, encompassing a total of 1,990 migrations. Despite the high frequency of migrations, the impact on runtime was minimal, with an average increase of only 40 seconds per run, translating to an added 8.6 seconds per migration. This efficiency underscores the scalability of Infrastructure Optimizer virtualization, which is designed to adapt in a linear manner to varying memory sizes and usage patterns. Beyond runtime efficiency, the tests also explored the broader implications of this technology on licensing and network performance. The use of a secondary ENI to conceal migrations from licensing mechanisms and the strategic management of additional IP addresses meant that migration throughput remained unaffected by production loads, maintaining the integrity and reliability of network operations.

Conclusion

The large-scale testing facilitated by AWS to validate Exostellar’s capabilities yielded results that show the potential for substantial advancements in cloud infrastructure optimization for semiconductor design. In addition to these results, Exostellar was also recognized as a Gartner Cool Vendor for enabling efficient cloud operations. Collaborating with Arm and AWS, Exostellar optimized Arm’s cloud infrastructure using its AI-powered Infrastructure Optimizer that autonomously migrates EDA workloads to the optimal instance type, ensuring scalability, cost-efficiency and operational stability. This enables semiconductor companies to maintain reliable and performant cloud environments for EDA.

Through comprehensive testing and strategic innovations, such as the use of secondary ENIs and the Infrastructure Optimizer, the pairing of Exostellar’s solution with the AWS Cloud meets the stringent demands of EDA processes. The ability to sustain strong network performance and manage licensing efficiently during high-load operations and migrations highlights the practical benefits of this method.

As semiconductor manufacturers increasingly turn to cloud solutions, the integration of AWS services and Exostellar’s solution is an example of how cloud infrastructure can be effectively tailored to meet specific industry needs while offering significant cost benefits. This approach is enhancing the operational capabilities of semiconductor design, affirming the value of cloud technology in this highly specialized field.

AWS for Industries