Western Digital Performs Cloud-Scale Simulation Using AWS HPC and Amazon EC2 Spot Instances
Western Digital is a leading data infrastructure company that provides hard-disk drives (HDDs), solid state drives (SSDs), fabrics, and storage platforms. In the competitive HDD market, the company's R&D team constantly seeks to innovate and meet the demands of hyperscale data centers. One purpose for its research is to create the highest capacity drives with the best total cost of ownership (TCO) and get those drives to market sooner. When creating HDDs, Western Digital strives to find ways to pack ever-more data into the same-size form factor. The company investigates thousands of material combinations, rotational speeds, and operational characteristics, trying to find the optimal solution that will deliver the highest data density and fastest, most reliable read-write times.
“Our ability to model and simulate more and more complex combinations allows us to produce only drives that will succeed. That knowledge ultimately benefits our customers and, in turn, their customers,” says David Hinz, senior director of global engineering services and cloud computing for Western Digital.
Executing HPC simulations on AWS infrastructure allows Western Digital to produce a higher quality product and achieve a faster time-to-market.”
Senior Director, Global Engineering Services and Cloud Computing, Western Digital
Testing the Outer Limits of Cloud HPC
Western Digital built a cloud-scale, high-performance computing (HPC) cluster with Amazon Web Services (AWS) and used it to simulate elements of new head designs for its next-generation HDDs as a way to limit-test cloud HPC. “Executing HPC simulations on AWS infrastructure allows Western Digital to produce a higher quality product and achieve a faster time-to-market,” says Hinz.
For years, Western Digital has been running large virtual CPU (vCPU) clusters powered by Amazon Elastic Compute Cloud (Amazon EC2) using Amazon EC2 Spot Instances, escalating from 8,000 to 16,000 to 32,000 vCPUs. An advantage of running an HPC simulation in the cloud is the ability to flex to use an extreme scale of infrastructure without upfront investment or long-term commitment.
Always looking to accelerate time-to-market, the R&D team wanted to significantly expand its available compute resources. A team of IT and R&D leads decided to enlarge the available cluster to one million vCPUs as a path to finalize designs for future HDDs on a reduced development schedule. The million-core run modeled a large number of head designs to discover if, when energy was added as heat or microwaves, the area to store bits of information could be shrunk. Successful simulations would mean storage could be increased on an HDD while keeping the same physical form factor. “The goal is to deliver higher density products to store more information and consume less power within data centers, so everyone benefits,” says Hinz.
Performing Weeks of Work in Hours, All at a Discount
As part of its simulation environment, Western Digital used Univa Grid Engine and AWS Batch. AWS Batch can efficiently run hundreds of thousands of batch computing jobs on AWS. The tool dynamically provisions resources based on the requirements of the batch jobs submitted. “Our million-core run helped us get stronger and more efficient in our use of AWS Batch,” notes Hinz. “We’ve found this product to offer a powerful capability in natively supporting large-scale, high-performance computing.”
Western Digital used AWS Batch to help plan, schedule, and execute its HPC workload on Amazon EC2 Spot Instances, which let users purchase unused Amazon EC2 capacity at up to a 90 percent discount compared to Amazon EC2 On-Demand Instances. Amazon EC2 Spot Instances provided significant cost savings to the company on this large job. “Only by using Amazon EC2 Spot Instances could we work out the economics for such a large computational workload,” says Hinz.
The data that powered this HPC simulation was read from Amazon Simple Storage Service (Amazon S3), which allows the storage and retrieval of any amount of data from anywhere. Amazon S3 scaled seamlessly during the simulation run, supporting the fast rate of data access without the need for additional tuning. Run on AWS infrastructure, the simulation performed 2.5 million tasks using 1 million vCPUs. When using Amazon EC2 Spot Instances for any of the HPC cluster sizes (8,000/16,000/32,000/1,000,000 vCPUs), the total cost to successfully complete the simulation was half the cost of running the simulation on an on-premises cluster. This run used a combination of C-series, R-series, and M-series Amazon EC2 instances and spanned six Availability Zones in a US East Region.
During the simulation, 1.5 percent of instances terminated and were then automatically replaced. The vast majority, 98.5 percent, ran continuously. “This was only possible because of AWS,” says Hinz. “We could not have replicated this on premises. The compute capacity this simulation required shrank 20 calendar days of work down to 8 hours. That’s a compute footprint that Western Digital does not have the resources to build out, maintain, and use on a regular basis.”
Using HPC Experiments to Deliver Better Products Faster
Hinz compares such far-reaching experiments to lunar or Mars missions, in which satellites or spacecraft collect and send data continuously. Then it’s the job of humans to analyze the data and determine what to do next. The same is true for these large simulation runs, which deliver data that is mined for information for months after the actual simulations are complete, providing a steady stream of insights that lead to product improvements. “Collecting data in a timely fashion allows us to figure out what should be our next set of experiments to run,” says Hinz. “For our teams, executing HPC experiments is about time-to-product. The faster we complete our research, the sooner teams can complete engineering and design, and the sooner products can get to market.”
A longtime customer of AWS, Western Digital sees more opportunities for future collaboration as AWS develops new products and expands the capabilities of existing ones. Hinz says, “We’re always asking if AWS products can help solve a business problem that we’re encountering. I expect our continued relationship to center on the computing power and structured solutions provided by AWS that allow Western Digital to keep innovating, experimenting, and delivering the best possible products.”
About Western Digital
Western Digital is a data infrastructure company that provides hard-disk drives (HDDs), solid state drives (SSDs), fabrics, and storage platforms. As part of the competitive HDD market, the company’s R&D teams constantly search for ways to innovate, create better drives, and go to market faster.
Benefits of AWS
- HPC simulation performs 2.5 million tasks, using 1 million vCPUs, in only 8 hours
- HPC simulation tests products efficiently, enabling faster time-to-market
- Cloud HPC compute power shrinks workload processing time from weeks to hours
AWS Services Used
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
Amazon EC2 Spot Instances
Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices.
AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.