We ran a few thousand test simulations and then scaled up to millions of test simulations running on Rescale. We were able to integrate proprietary code bases seamlessly, quickly identify errors within the scaling code, track each error down, and correct each error using Rescale much more quickly than we were able to run simulations on our own.
Mike Gahan Senior Data Scientist, Metabiota

The 1918 Spanish Influenza pandemic stands as one of the deadliest flu pandemics in human history, killing 50 million and infecting an estimated 500 million worldwide. Many experts believe it’s not a question of if, but when, an epidemic of even greater magnitude will occur. The flu is but one example of an infectious disease with the power to cause catastrophic human and economic losses across a global population.

While it may be impossible to prevent future epidemics, analyzing trends and events of the past while controlling for the conditions of the present can assist in modeling and mitigration of risk. Metabiota sits at the intersection of this combination of epidemiology and data science. The company helps its clients conduct risk analysis and assess the probability of human and financial losses caused by a potential epidemic through its comprehensive infectious disease platform.

“You can’t necessarily eliminate pandemics or epidemics, but you can minimize their impact,” says Mike Gahan, senior data scientist at Metabiota.

Metabiota uses hundreds of different data sources to build a range of models that run tens of millions of plausible event simulations. Each simulation represents a scientifically plausible, hypothetical scenario that accounts for variables that can affect the severity of epidemics. “With each client, our goal is to put ourselves in the client's shoes and think about what they’re trying to accomplish,” says Cathine Lam, data scientist at Metabiota. “For example, how are governments hoping to use insights from our data to understand and mitigate risk to better address citizens’ needs in the face of an epidemic? How are insurance companies looking to leverage insights from our data to design a comprehensive insurance policy that mitigates economic losses and covers appropriate pathogens?”  

To develop the company’s infectious disease model library, which includes a 1-million-year stochastic event catalog, Metabiota and its university partners need to test the different infectious disease data catalogs they build by running probabilistic simulation models at massive scale. For example, testing the influenza catalog requires running tens of millions of simulations in a week's time. Metabiota and its university partners also develop proprietary source code for their models that is not shared between organizations.

Metabiota initially sought to manage its custom code creation and simulation modeling while simultaneously managing and monitoring the High Performance Computing (HPC) environment running on AWS and used for simulation testing by key collaborators. Managing the underlying infrastructure meant Metabiota’s data science team had less time to focus on research and model development. Furthermore, pinpointing the source of errors encountered while running millions of simulations became a drain on both Metabiota’s time and its resources.

“In the past, scaling issues occurred, which cost substantial amounts of time and money,” says Gahan. “We realized we needed a more organized and robust way to approach this process. Then entered Rescale.”

Rescale, an AWS Partner Network (APN) Advanced Technology Partner, has one key mission: to help organizations seamlessly run compute-intensive workloads of any size and scale by harnessing the power of cloud computing and the enterprise readiness of ScaleX, Rescale’s HPC software as a service (SaaS) platform that integrates with AWS. “What we are ultimately seeking to do is make HPC resources as accessible and enterprise-ready as possible,” says Matt McKee, director of Americas sales at Rescale. Building integration of the Rescale platform with AWS was, according to the Rescale team, a no-brainer. "We use somewhere around 30 different AWS services to deliver the full SaaS solution on AWS. AWS has made it very easy for us to integrate with its platform and utilizes its services to deliver value to our clients."  

Metabiota approached Rescale to understand how the Rescale platform could address its particular challenges. Rescale evaluated Metabiota’s requirements, taking into consideration the team’s need for data access on the platform to be strictly governed according to the different organizations involved. “To test out the solution, we gave Rescale our code, one of our university partners gave Rescale their code, and we attempted to do simulation testing for one of our virus catalogs,” explains Gahan. “We ran a few thousand test simulations and then scaled up to millions of test simulations running on Rescale. We were able to integrate proprietary code bases seamlessly, quickly identify errors within the scaling code, track each error down, and correct each error using Rescale much more quickly than we were able to run simulations on our own.”

Metabiota’s environment is unique because of the collaborative nature of the work the company does with their key collaborators to run the simulations at scale on Rescale. Most of the scientific coding work is done in the C++ language and R programming languages.

“We have to start with all of these different parameters and characteristics that define an outbreak, such as how quickly it spreads from person-to-person, where the outbreak starts, and how long individuals stay infected. We have hundreds of thousands of different combinations of these characteristics that we run. Each combination is run one at a time and is typically a combination of C++ and R,” says Gahan. As the reproducibility of its environments is crucial for government and insurance clients, each run is wrapped in a Docker image.

“Metabiota’s collaborator is the user running the job on Rescale. The organization has its buckets in Amazon Simple Storage Service (Amazon S3), and those buckets have all of the input files for their simulation," says Ryan Kaneshiro, chief architect at Rescale. “The input files are grouped by run, so there’s a prefix for all of the input files that are part of a particular run. When the company submits a job to Rescale, they provide us with a .csv file that has a list of the runs they want to execute. They also specify the instance type they want to run on and the number of those instance types. Rescale uses Amazon Elastic Compute Cloud (Amazon EC2) Compute Optimized Instances, powered by Intel® Xeon® processors, and Amazon Simple Queue Service (Amazon SQS) to run these jobs. The rows within the .csv file Metabiota's collaborator provides us are batched into chunks and become messages on an SQS queue. We then provision many Amazon EC2 instances that connect to that SQS queue and pull down the run information,” says Kaneshiro.

The Rescale Platform then downloads input files from their Amazon S3 bucket, runs the simulation, and when the simulation is finished, the output is uploaded to Metabiota’s Amazon S3 bucket for further downstream processing. “Using Rescale is critical for us to be able to do this in a matter of a week or a few days, rather than a few months,” says Gahan. Metabiota’s pandemic influenza event catalog required 18 million simulations, which took over 90 thousand compute hours, created over 11.4 billion i/o requests, and produced 100 TB of uncompressed data. Rescale also helps Metabiota take advantage of Amazon EC2 Spot Instances to run jobs at off-peak times for optimal cost savings.

By using Rescale on AWS, the Metabiota data science team and their collaborators can focus on their data and simulation modeling rather than infrastructure procurement and management. “I don’t believe we’d be able to take on a project like this without Rescale and AWS,” says Gahan. “The ability to experiment and to create something that may not work initially, but to be able to try again until we get it right without enormous cost consequences is an enormous benefit for us. Using Rescale on AWS makes it so much easier for us to test and fail, test and fail, and eventually succeed to build a robust solution. It’s liberating.” The Rescale platform tracks failed runs and provides an API method to retrieve a failed run csv. This makes it easy for Metabiota's collaborators to re-execute failed runs on a different instance type with more memory.

Metabiota has also driven significant cost savings by using Amazon EC2 Spot Instances to run simulations at various times of the day. “Using Amazon EC2 Spot instances has been a source of tremendous savings for us,” explains Gahan. Metabiota has been able to save upwards of 60-70 percent of their compute costs by taking advantage of Amazon EC2 Spot Instances.

“Today, we can focus more of our time on building robust models and allow Rescale to use a cluster of AWS Instances to scale our production up to tens of millions of simulations,” says Kierste Miller, data scientist at Metabiota. “One of the biggest advantages we have in our market is our speed and the agile methods we use. As an organization focused on optimizing the user experience, we're able to be more client-focused and modify our methods in order to suit our clients' timelines better. That's where our compute power is critical. We wouldn't have been able to create multiple solutions as quickly as we did without high performance computing.”

Rescale, an AWS Partner Network (APN) Advanced Technology Partner, is a global leader for HPC in the cloud, helping companies worldwide innovate and perform groundbreaking research and development faster at a lower cost. Rescale platform solutions integrate with AWS to transform traditional fixed IT resources into flexible resources running on AWS. Rescale offers hundreds of turnkey software applications on the platform which are instantly cloud-enabled for the enterprise.

For more information, contact Rescale through its listing on the APN Partner Solution Finder or on their website.   

Learn more about High Performance Computing (HPC)