AWS HPC Blog

Category: Life Sciences

Benchmarking NVIDIA Clara Parabricks Somatic Variant Calling Pipeline on AWS

Somatic variants are genetic alterations which are not inherited but acquired during one’s lifespan, for example those that are present in cancer tumors. In this post, we will demonstrate how to perform somatic variant calling from matched tumor and normal genome sequence data, as well as tumor-only whole genome and whole exome datasets using an NVIDIA GPU-accelerated Parabricks pipeline, and compare the results with baseline CPU-based workflows.

Read More
Figure 2: Identification of redun jobs and grouping them into Array Jobs to run on AWS Batch. (Top) redun represents the workflow as an Expression Graph (top-left), and identifies reductions (red boxes) that are ready to be executed. The redun Scheduler creates a redun Job (J1, J2, J3) for each reduction and dispatches those jobs to Executors based on the task-specific configuration. The Batch Executor allows jobs to accumulate for up to three seconds (default) in order to identify compatible jobs for grouping into an Array Job, which are then submitted to AWS Batch (top-right). (Bottom) As jobs complete in AWS Batch, the success (green) and failure (red) is propagated back to Executors, the Scheduler, and eventually substituted back into the Expression Graph (bottom-left).

Data Science workflows at insitro: how redun uses the advanced service features from AWS Batch and AWS Glue

Matt Rasmussen, VP of Software Engineering at insitro, expands on his first post on redun, insitro’s data science tool for bioinformatics, to describe how redun makes use of advanced AWS features. Specifically, Matt describes how AWS Batch’s Array Jobs is used to support workflows with large fan-out, and how AWS Glue’s DynamicFrame is used to run computationally heterogenous workflows with different back-end needs such as Spark, all in the same workflow definition.

Read More
Figure 1: Evaluating a sequence alignment workflow using graph reduction.** In redun, workflows are represented as an Expression Graph (left) which contain concrete value nodes (grey) and Expression nodes (blue). The redun scheduler identifies tasks that are ready to execute by finding subtrees that can be reduced (red boxes), substituting task results back into the Expression Graph (red arrows). The scheduler continues to find reductions until the Expression Graph reduces to a single concrete value (grey box, far right). If any reduction has been done before (determine by comparing an Expression's hash), the redun scheduler can replay the reduction from a central cache and skip task re-execution.

Data Science workflows at insitro: using redun on AWS Batch

Matt Rasmussen, VP of Software Engineering at insitro describes their recently released, open-source data science framework, redun, which allows data scientists to define complex scientific workflows that scale from their laptop to large-scale distributed runs on serverless platforms like AWS Batch and AWS Glue. I this post, Matt shows how redun lends itself to Bioinformatics workflows which typically involve wrapping Unix-based programs that require file staging to and from object storage. In the next blog post, Matt describes how redun scales to large and heterogenous workflows by leveraging AWS Batch features such as Array Jobs and AWS Glue features such as Glue DynamicFrame.

Read More

Running a 3.2M vCPU HPC Workload on AWS with YellowDog

OMass Therapeutics, a biotechnology company identifying medicines against highly validated target ecosystems, used Yellowdog on AWS to analyze and screen 337 million compounds in 7 hours, a task which would have taken two months using an on-premises HPC cluster. YellowDog, based in Bristol in the UK, ran the drug discovery application on an extremely large, multi-region cluster in AWS with the AWS ‘pay-as-you-go’ pricing model. It provided a central, unified interface to monitor and manage AWS Region selection, compute provisioning, job allocation and execution. The entire workload completed in 65 minutes, enabling scientists to start work on analysis the same day, significantly accelerating the drug discovery process. In this post, we’ll discuss the AWS and YellowDog services we deployed, and the mechanisms used to scale to 3.2m vCPUs using multiple EC2 instance types across multiple regions in 33 minutes, running at a 95% utilization rate.

Read More

Virtual Screening of Novel Active Drug Compounds on AWS with Orion®

Computer-aided drug discovery (CADD) has been a key player in lowering the cost and speeding up the timeline for drug development. CADD uses high performance computing (HPC) resources to virtually screen databases with billions of molecules. It can speed up the searching of potential drug molecules, and filter out molecules and compounds that are unsuitable. OpenEye Scientific developed Orion®, a cloud-based molecular design platform for CADD. Orion provides computational chemists with virtually unlimited HPC resources. These include data visualization, collaboration, and workflow management tools that help them perform calculations more efficiently. In this post, we describe the Orion architecture on AWS, and it’s capabilities to address the challenges in drug development.

Read More
GROMACS price-performance optimizations header image

GROMACS price-performance optimizations on AWS

Molecular dynamics (MD) is a simulation method for analyzing the movement and tracing trajectories of atoms and molecules where the dynamics of a system evolve over time. MD simulations are used across various domains such as material sciences, biochemistry, biophysics and are typically used in two broad ways to study a system. The importance of […]

Read More