Executive Summary
For over 30 years, NVIDIA has been building the advanced graphics and compute technologies that power everything from autonomous vehicles, data centers, and gaming to the generative AI revolution. Each NVIDIA GPU and AI accelerator is a result of years of chip design that relies on advanced electronic design automation (EDA) workflows and massive compute, storage, and network resources. EDA tools automate the design, verification, and testing of complex integrated circuits, running demanding concurrent workloads that can span tens of thousands of processor cores and access hundreds of billions of files. Traditionally, these workloads were supported by on-premises infrastructure in global data centers.
When the COVID-19 pandemic strained supply chains, slowed the building of new data centers, and impacted project timelines, NVIDIA needed additional capacity to test its complex workflows. The company realized that the best way to accomplish this would be to scale for capacity using the cloud. NVIDIA used Amazon Web Services (AWS) to supplement its on-premises environment with a hybrid solution, gaining access to a multitude of resources—including memory- and performance-optimized instances—to help the company scale and test its future chipsets.
About NVIDIA
Founded in 1993, NVIDIA builds advanced graphics, compute, and AI technologies. Its products power industries worldwide, such as data science, AI research, and high-performance computing.
Opportunity | Using AWS to extend infrastructure for NVIDIA
NVIDIA, an AWS Partner, had a chip design environment that relied entirely on on-premises data centers. Although this setup offered control and efficiency, significant supply chain risks had the potential to disrupt NVIDIA’s EDA capacity build program. To mitigate this risk, the company developed a cloud-based optionality strategy that would provide additional compute capacity if needed.
Before this project, moving EDA compute to the cloud was impractical because the total cost of ownership for cloud EDA was estimated to be as much as 10 times higher than on premises when using on-demand public pricing. In addition, many EDA workflows relied on specific storage features that cloud offerings did not fully provide. NetApp, a leading enterprise storage vendor and AWS Partner, was integral to these workflows. At the same time, supply chain shortages made adding physical infrastructure challenging, and global work patterns pushed NVIDIA’s design workloads higher than ever. The company needed a way to scale without building new data centers, refactoring applications, or breaking engineering workflows for critical projects.
Solution | Extending design capacity with a hybrid infrastructure
Rather than replacing its existing infrastructure, the company created a hybrid approach. NVIDIA used Amazon Elastic Compute Cloud (Amazon EC2) for secure and resizable compute capacity and Amazon FSx for NetApp ONTAP (FSxN) for fully managed shared storage that’s built on NetApp’s popular ONTAP file system. “We got into this as an opportunity to scale our compute into more places beyond what we could build on premises,” says Sharon Clay, vice president of GPU engineering at NVIDIA. The company’s adoption of Amazon EC2 and FSxN infrastructure helped teams scale large simulation workloads without completely rewriting design workflows. The goal wasn’t to move everything to the cloud but to add flexibility where it mattered most. NVIDIA kept compilation and sensitive workflows in its data centers while running large simulation jobs in the cloud. Using this hybrid model helps design teams scale in the cloud for the highly compute-intensive stages while still relying on their existing investments for regular production. Storage was the central challenge, so NVIDIA selected FSxN to preserve caching and other NetApp capabilities that are essential to EDA workflows. The team discovered that some workflow modifications were necessary to achieve optimal performance. Certain workflows required inline FlexCache support, and to maintain high performance, engineers needed to redirect file output to a local write-shunt filer on AWS. The write-shunt filer on FSxN could only support up to 5,000 parallel jobs, requiring the use of five target filers per cluster. NVIDIA conducted months of testing to identify bottlenecks and drive optimizations that increased throughput.
AWS provided custom instance configurations and special-purpose Amazon EC2 instances that were tuned for NVIDIA’s EDA workloads. NVIDIA also ran full-scale tests that validated compute and storage performance. “When we needed credits, test hardware, executive support, or extra capacity, the AWS team was there to help,” says Anoop Jayadevan, vice president of infrastructure and enterprise security at NVIDIA. The result was a validated hybrid architecture. The company can scale compute elastically using Amazon EC2 and preserve storage compatibility using FSxN while minimizing disruption to engineers. “Our innovative approach helped us identify what infrastructure we needed to get the most out of what was available at AWS,” says Bill Steinmetz, distinguished engineer and manager of ASIC infrastructure at NVIDIA.
Outcome | Making capacity gains with a long-term vision
NVIDIA’s design teams now regularly run 15–20 types of EDA design flows simultaneously in the cloud. This helps them respond dynamically to peaks in demand without investing in overprovisioned storage or waiting for new infrastructure to come online. “The hybrid approach gave critical teams the extra capacity they needed without moving storage,” says Steinmetz. Storage throughput improvements delivered tangible gains: Engineers have more flexibility to run workloads across AWS Availability Zones, reducing wait times and improving resource usage for large jobs at a global scale. The company’s progress demonstrates that EDA on AWS delivers massive scalability and is both practical and productive. Looking ahead, NVIDIA plans to continue optimizing cost and performance to increase efficiency in the cloud even further. The company is also exploring GPU acceleration for EDA workloads, extending the same innovations that transformed AI and graphics into the chip design process to show what’s possible for the wider semiconductor industry. “The cloud can be an outstanding player alongside on-premises systems,” says Clay.
Our innovative approach helped us identify what infrastructure we needed to get the most out of what was available at AWS.
Bill Steinmetz
Distinguished Engineer and Manager of ASIC Infrastructure, NVIDIAAWS Services Used
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages