AWS for Industries

How Takeda Pharmaceuticals Accelerates Large-Scale Bioinformatics with AWS HealthOmics

In the rapidly evolving field of bioinformatics, transitioning from on-prem high-performance computing (HPC) environments, like Slurm, to cloud-based solutions, like AWS HealthOmics, presents opportunities to reinvent for scale, speed, and cost. Jeff Lee is Associate Director of R&D Cloud Architecture at Takeda Pharmaceuticals, a patient-focused, R&D driven biopharmaceutical organization founded in 1781. In this blog post, we show how, at Takeda, we migrated custom Nextflow pipelines from on-prem Slurm compute clusters to HealthOmics in two weeks. This enabled a reduction in analysis time for 20,000 RNAseq samples from 6 weeks to two days, allowing us to produce more science while reducing costs by 70%.

Opportunity

Takeda R&D’s north star is to accelerate clinical research in biomarkers that can be used to improve patient outcomes. The Computational Oncology team at Takeda had an initiative to analyze >20,000 RNA sequencing samples using Nextflow, in order to enable biomarker research for diagnosis and treatment of cancer. However, this undertaking would have required us to increase our on-prem compute capacity by 10X or forced us to out-source our analysis, which would have been too expensive. Both options incurred delays for analysis results of up to several weeks and high per sample costs. We needed a scalable, cost-effective solution that enables data provenance and reduces analysis time. We collaborated closely with the Computational Oncology team to understand their workload, and decided to use HealthOmics.

HealthOmics, a fully managed service which helps customers store and generate insights from bioinformatics data to improve health and accelerate scientific discoveries, provided us an option that satisfied needs for scale, cost, speed, tooling compatibility, and data provenance.

First, HealthOmics enables reproducibility for each sample analysis. Analysis workflows are given unique IDs and runtime configurations are recorded in Amazon CloudWatch Logs and immutable. This is a critical requirement for genomic research and facilitates qualifying HealthOmics for GxP workloads at Takeda in the future.

Second, HealthOmics allows Takeda scientists to execute analysis pipelines autonomously and on-demand, with scale capable of processing thousands of samples concurrently, eliminating HPC queuing and capacity planning. We went from processing 300 RNAseq samples per day to 10,000 samples per day, and could do even more if needed. This increased throughput and accessibility put critical data in scientists’ hands within 24 hours instead of weeks.

Additionally, HealthOmics allows Takeda to add metadata tags to the sequencing results. This is critical for being able to search for data across multiple studies and therapeutic areas, thus eliminating data silos within studies. HealthOmics makes automated data tagging easy. This enables us to join and view data based on multiple factors, leading to more insights and discoveries. It also sets us up for downstream ML model training.

Finally, we achieved over 70% cost savings per sample by using compute metrics logging in HealthOmics to optimize compute utilization in analysis workflows. Takeda’s research teams can easily forecast costs per sample with the straightforward cost calculation that HealthOmics provides.

“This also increases our net analysis capacity – for the same amount of budget our scientists can analyze more samples than previously planned and have a clearer view of their cost runway”. – Jeff Lee, Associate Director of R&D Cloud Architecture, Takeda Pharmaceuticals

Solution

We leveraged HealthOmics workflows as a robust environment for executing RNAseq analysis using workflow languages like Nextflow at scale.

We started by migrating our custom, Nextflow-based, RNAseq analysis workflow to run on HealthOmics. The workflow definition (scripts that use Nextflow DSL2) describes steps that use containerized tooling to convert raw sequence data to gene expression values. Container images used by the workflow were already stored in private image repositories in Amazon Elastic Container Registry (ECR). Using the workflow definition, we created a private workflow in HealthOmics.

Next, we created workflow runs to test the workflow and process samples. A run is a single invocation of a workflow, with a unique identifier. Each run generates several AWS CloudWatch Logs log-streams for the overall run invocation (e.g. parameters and input data used), Nextflow engine, and individual workflow tasks, facilitating reproducibility, data provenance, and streamlining development.

Figure 1: Architecture for running workflows with HealthOmics. At a high level, it is simply, input data S3 locations, container images in Amazon ECR Private, a HealthOmics workflow, a HealthOmics workflow run, and an output data S3 location.Figure 1: Architecture for running workflows with HealthOmics. At a high level, it is simply, input data S3 locations, container images in Amazon ECR Private, a HealthOmics workflow, a HealthOmics workflow run, and an output data S3 location.

In total, it took us about 1 week to migrate Takeda’s RNAseq workflow and scale up processing for 20,000 RNAseq samples. Specifically, this was 2 days making minor modifications to the workflow definition (which we’ll discuss in the next section), and the rest ramping up the number of concurrent samples processed.

We also optimized compute resources for the runs, using the data provided by HealthOmics (more on this in the next section). To start, we were already saving ~50% of our per sample costs by running on HealthOmics, compared to on-prem and out-sourced analysis. We were able to reduce per sample costs further by an additional 20% for a total cost savings of >70% by inspecting run metrics and using HealthOmics’ rightsizing recommendations.

We ultimately were able to process ~15,000 samples in one day, demonstrating our desired target capacity of 10,000 samples per day (plus some buffer) and enabling analysis of 20,000 samples in two days.

Key considerations for design and development

Tooling and data governance
Each HealthOmics workflow run executes in its own segregated network environment with no access to the public Internet. This is one of the ways HealthOmics helped us meet our high security bar.

Access to data is enabled via an IAM role specified at run invocation and allows either in-region S3 buckets or HealthOmics storage. Workflow tasks that used hard-coded file paths for reference and input data were refactored to use parameterized cloud storage URIs. Moving to HealthOmics unlocked previously unseen scale, so we decided to migrate to cloud storage for improved access, governance, and durability. This also encourages least-privilege data access, maintaining known versions of reference data for results immutability, and keeping source data close to where compute will analyze it.

HealthOmics also requires using container images from ECR Private. This enables best practices for tooling access and security through capabilities like per repository permissions policies, image tag immutability, and image vulnerability scanning.

Reproducibility
HealthOmics workflows allow you to specify parameters that are required for each run. Similarly, parameter values used for each run invocation are captured by HealthOmics in CloudWatch Logs as a “manifest” log stream. You can parameterize a workflow that both minimizes overall development effort and makes a workflow reusable for multiple use cases – e.g. enabling one RNAseq workflow to process data from multiple model organisms. Since invocation parameters are recorded, you have a means to generate thorough data analysis audit trails, enabling you to exactly reproduce a prior analysis as needed.

Debugging, testing, and optimizing
If you have a run failure, the HealthOmics Console and AWS CLI commands simplify identifying the reason via listing tasks of a run, their completion status, and finding associated logs in CloudWatch Logs.

For debugging, it is good to have comprehensive STDOUT logging within tools used by workflow tasks. STDOUT and STDERR streams are captured per task to CloudWatch log-streams, enabling real-time monitoring. Workflow tasks that use Bash commands should enable exiting on error (e.g. set -e) to avoid silent failures and Bash tracing (e.g. set -x) which prints out the materialized commands that each task executes. Additionally, custom metadata can be added by echoing key-value pairs to STDOUT. This data can later be collected using CloudWatch Logs Insight queries on the logs generated by the run.

Verifying that a workflow works correctly can be challenging. Workflows can be complex, consisting of multiple serial or parallel tasks. A best practice is to test with small “happy path” datasets, which help you to iterate quickly and ensure that all the primary logic is in place. You can also have test data for multiple key cases to catch potential edge cases. HealthOmics makes it easy to run multiple test cases at once, and the logs and information provided by HealthOmics APIs can be used to verify data flow through a workflow.

Importantly, “manifest” log streams that HealthOmics generates for each run includes compute utilization metrics for each task. This recently released feature helps you easily right-size tasks and maximize cost savings and optimize workflow performance.

What’s Next for Takeda

We are actively scaling HealthOmics across our other therapeutic areas. Efforts are underway to enhance CI/CD integration with HealthOmics, improve metric capture and observability, streamline development and optimization, and validate HealthOmics for GxP workloads.

Conclusion

By migrating our genomics pipelines to HealthOmics, we were able to significantly reform and accelerate our cancer biomarker research. Through a close collaboration with the AWS team and the Computational Oncology team at Takeda, we piloted this project with an existing Nextflow RNA sequencing pipeline, and completed the migration in under two weeks. As a result, we eliminated several key constraints, such as scale, time, and cost. We are now able to process over 30X more samples per day, get this data in scientists’ hands within 24 hours, compared to previously 5-6 weeks, and spend 70% less than before. Additionally, the data that we generate is fully reproducible, and searchable across many features, eliminating data silos. The impact of this work will continue to grow, as more workloads are migrated to HealthOmics across Takeda R&D.

We’re excited to see what HealthOmics will continue to do for our organization.

Jeff Lee

Jeff Lee

Jeffrey Lee is an Associate Director of Cloud Architecture at Takeda, specializing in AWS Architecture and high-performance computing for genomic platforms. With over a decade of experience in BioTech, Higher Education, and Retail, he expertly aligns complex technical solutions with strategic business goals to boost organizational efficiency and scalability. Jeff is also an influential speaker, contributing to the industry dialogue on future cloud technologies and best practices.

Lena Rozov

Lena Rozov

Lena Rozov is a Senior Solutions Architect at AWS, specializing in Life Sciences and Genomics. She has an M.S. in Bioinformatics and over 20 years of application development and architecture experience in the Life Sciences industry. She loves to help her customers dissect complex problems to achieve their business goals. Outside of work, she enjoys traveling and cooking.

Lee Pang

Lee Pang

Lee is a Principal Bioinformatics Architect with the Health AI services team at AWS. He has a PhD in Bioengineering and over a decade of hands-on experience as a practicing research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing to compliant software for clinical data capture and analysis.

Sidharth Rampally

Sidharth Rampally

Sid is a Customer Solutions Manager at AWS driving GenAI acceleration for Life Sciences customers. He writes about topics relevant to his customers, focusing on data engineering and machine learning. In his spare time, Sid enjoys walking his dog in Central Park and playing hockey.