AWS HPC Blog

How AstraZeneca improved their genomics processing to be 60% faster, 70% more cost-effective

This post was contributed by Manu Pillai (AWS), Marissa E Powers PhD (AWS), Sean O’Dell (AstraZeneca), Gabriel Hernandez (AstraZeneca), Heejoon Jo (Illumina), Shyamal Mehtalia (Illumina), Joe Warren (AWS), and Natalia Jimenez PhD (AWS)

Genomic research demands massive computational power to process and analyze DNA sequences that unlock insights into human health and disease. For pharmaceutical companies like AstraZeneca, processing millions of genomes efficiently and sustainably is a crucial factor to accelerate discovery and development of future therapies to help people to live longer, healthier lives.

As cloud infrastructure evolves, organizations face the challenge of migrating to newer, more powerful instances while maintaining research integrity and managing costs. When Amazon Elastic Compute Cloud (Amazon EC2) F1 instances approach end-of-service in 2025, organisations relying on FPGA to process genetic data need a clear path forward that doesn’t compromise their critical workloads.

In this post, we’ll show you how AstraZeneca’s Centre for Genomics Research successfully transitioned to Amazon EC2 F2 instances, achieving substantial performance improvements while significantly reducing costs.

The Challenge: Future-Proofing Genomic Infrastructure

The Centre for Genomics Research (CGR) implements AstraZeneca’s Genomics Initiative, aiming to analyze up to two million genomes by 2026. It integrates large-scale genomic and clinical data to uncover novel drug targets, understand disease mechanisms, and inform all stages of the AstraZeneca drug discovery pipeline with human genetic insight. CGR’s work leverages multi-omics approaches—genomics, transcriptomics, proteomics, and metabolomics—at population scale, making it one of the largest and most diverse datasets globally.

The CGR Data Platforms team relies heavily on Illumina DRAGEN®[1] (Dynamic Read Analysis for GENomics) workloads for processing Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) data. With F1 instances approaching end-of-service, they needed to evaluate F2 instances as the next-generation solution while ensuring no disruption to their research operations.

“The performance improvements we’ve seen with F2 instances are transformative for our genomic processing capabilities,” said Sean O’Dell from AstraZeneca’s Centre for Genomics Research. “This collaboration with AWS and Illumina provides us with a clear path forward to continue to enable scientific research more efficiently and cost-effectively.”

The Solution: Comprehensive F1 to F2 Migration Testing

AWS, AstraZeneca, and Illumina collaborated to conduct extensive testing comparing F1 and F2 instance performance across multiple genomic workloads and AWS Regions.

Testing Methodology

The team designed a comprehensive evaluation using:

  • Sample Analysis: 11 genomic samples (5 WES and 6 WGS) representing typical research workloads
  • Architecture: AWS Batch with separate compute queues for F1 and F2 instances
  • Illumina DRAGEN Versions: v3.7.8 for WGS processing, v4.3.6 for WES processing
  • Multi-Region Validation: Testing across three AWS regions to ensure global scalability
  • Performance Metrics: Runtime comparison and cost analysis for identical workloads
  • Validation: A formal method for validating genomic equivalency

The architecture leveraged the Nextflow workflow system with AWS Batch (a fully managed batch computing service) to orchestrate DRAGEN pipeline execution, enabling consistent testing conditions across both instance types.

Figure 1 - This diagram shows the cloud testing environment used to compare F1 and F2 instance performance for genomic processing. Users submit DRAGEN workflows through Nextflow, which orchestrates separate AWS Batch compute environments for Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) samples on F2.6xlarge instances. The standardized architecture enabled precise performance comparisons by maintaining identical workflow orchestration while only changing the underlying compute infrastructure, ensuring the 60% performance improvements and 70% cost reductions could be accurately attributed to F2 instances.

Figure 1 – This diagram shows the cloud testing environment used to compare F1 and F2 instance performance for genomic processing. Users submit DRAGEN workflows through Nextflow, which orchestrates separate AWS Batch compute environments for Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) samples on F2.6xlarge instances. The standardized architecture enabled precise performance comparisons by maintaining identical workflow orchestration while only changing the underlying compute infrastructure, ensuring the 60% performance improvements and 70% cost reductions could be accurately attributed to F2 instances.

Results: Dramatic Performance and Cost Improvements

The testing revealed significant improvements when migrating from F1 to F2 instances:

Key Benefits

  • 60% faster processing: Significant reduction in genomic analysis runtime
  • 70% cost reduction: Substantial savings on compute costs
  • Sustainability improvement: Reduced energy use and associated carbon emissions by completing workloads more efficiently
  • Enhanced scalability: Improved performance across multiple AWS regions
  • Maintained accuracy: No compromise in research data integrity
Figure 2 - This chart compares F1.4xlarge and F2.6xlarge instance performance for six Whole Genome Sequencing samples, showing both runtime (in minutes) and cost per run (in dollars). F2 instances consistently deliver up to 60% faster processing times and 70% lower costs across all samples, demonstrating the substantial performance improvements and cost savings organizations can achieve when upgrading from F1 to F2 instances for genomic workloads. On Demand pricing for f1.4xlarge in eu-west-1 and f2.6xlarge in eu-west-2 as of 9/16/2025.

Figure 2 – This chart compares F1.4xlarge and F2.6xlarge instance performance for six Whole Genome Sequencing samples, showing both runtime (in minutes) and cost per run (in dollars). F2 instances consistently deliver up to 60% faster processing times and 70% lower costs across all samples, demonstrating the substantial performance improvements and cost savings organizations can achieve when upgrading from F1 to F2 instances for genomic workloads. On Demand pricing for f1.4xlarge in eu-west-1 and f2.6xlarge in eu-west-2 as of 9/16/2025.

Figure 3 - This chart compares F1.4xlarge and F2.6xlarge instance performance for five Whole Exome Sequencing samples, showing both runtime (in minutes) and cost per run (in dollars). F2 instances consistently deliver up to 63% faster processing times and 74% lower costs across all samples, demonstrating the substantial performance improvements and cost savings organizations can achieve when upgrading from F1 to F2 instances for exome sequencing workloads. On Demand pricing for f1.4xlarge in eu-west-1 and f2.6xlarge in eu-west-2 as of 9/16/2025.

Figure 3 – This chart compares F1.4xlarge and F2.6xlarge instance performance for five Whole Exome Sequencing samples, showing both runtime (in minutes) and cost per run (in dollars). F2 instances consistently deliver up to 63% faster processing times and 74% lower costs across all samples, demonstrating the substantial performance improvements and cost savings organizations can achieve when upgrading from F1 to F2 instances for exome sequencing workloads. On Demand pricing for f1.4xlarge in eu-west-1 and f2.6xlarge in eu-west-2 as of 9/16/2025.

Validating Equivalency

To assess Illumina DRAGEN output equivalency across AWS F1 and F2 instance families, we evaluated the output for five samples processed on each instance type and compared outputs at three levels: pipeline metrics, command provenance, and variant calls. For like-for-like runs, we expect identical results across instances. In practice, every value and variant should match exactly, with differences limited to non-deterministic metadata such as timestamps, file paths, or sample labels.

Command Provenance: To verify that the pipelines were invoked identically, we extracted the DRAGEN command-lines from the VCF headers and compared them using bash tools. The commands and flags matched exactly between F1 and F2 runs, ensuring the analysis was conducted with the same configuration and parameters.

Metrics Comparison: We compared DRAGEN-generated metrics using standard bash utilities (e.g., diff) across the following files: gc_metrics, mapping_metrics, roh_metrics, sv_metrics, and vc_metrics.  Across all five samples, the metrics were identical between F1 and F2 outputs except for the values we expect to vary (e.g., names, date/timestamp). This confirms that alignment, variant calling, and summary statistics are reproducible across both instance types.

VCF Concordance: We evaluated variant-level concordance using bcftools isec, which performs set operations on VCF files by treating each file as a set of variant records. It computes intersections (variants present in both files) and differences (variants unique to one file), making it a practical way to quantify concordance between outputs. In our workflow, we ran bcftools isec on paired VCFs (F1 vs F2) for each sample. The tool writes out separate files corresponding to the following:

  • Variants unique to the first VCF (e.g., 0000.vcf)
  • Variants unique to the second VCF (e.g., 0001.vcf)
  • Variants common to both VCFs (the intersection, e.g., 0002.vcf, and 0003.vcf)

We then counted variants in each output to measure overlap and differences, excluding header lines that begin with # or ##. Under full equivalence, the unique outputs should be empty, and the intersection files should contain all variants, so intersection record counts match the originals.

# Run bcftools isec

$> bcftools isec -p results sample1_f1.vcf.gz sample1_f2.vcf.gz

# Count variant records (exclude header lines)

$> grep -c -v "^#" results/*.vcf
0000.vcf:0
0001.vcf:0
0002.vcf:299131
0003.vcf:299131

Across all five samples, bcftools isec showed that the intersection contained all variants, and the unique sets were empty, indicating no variant-level differences between F1 and F2 outputs. In other words, the VCFs were fully concordant.

In conclusion, under identical DRAGEN configurations, Amazon EC2 F1 and F2 instances produce equivalent  outputs at both the metrics and variant levels. Metrics differed only in expected, non-deterministic metadata (timestamps and sample labels), and bcftools isec showed that variant intersections contained all calls with no unique differences. These results support running DRAGEN genomic pipelines interchangeably on F1 and F2.

The Impact: Accelerating Genomic Discovery

This successful migration provides AstraZeneca with transformative advantages that extend far beyond simple infrastructure upgrades. The immediate benefits include establishing a clear, validated migration path from F1 to F2 instances, ensuring business continuity and achieving significant cost and CO2 emissions optimization for large-scale genomic workloads.

The long-term value of this transition positions AstraZeneca’s Centre for Genomics Research with future-ready infrastructure capable of supporting their rapidly growing genomic datasets, which will soon  exceed 2 million human genomes.

Conclusion

The collaboration between AWS, AstraZeneca, and Illumina demonstrates how cloud infrastructure evolution can drive significant efficiency improvements in genomic research capabilities. The 60% performance improvement, 70% cost reduction, as well as improved sustainability achieved through F2 instance migration provides a compelling case for organizations looking to optimize their genomic processing workloads.

For genomics teams currently using F1 instances, this validation study offers confidence in migrating to F2 instances while maintaining research integrity and achieving substantial operational benefits.

To learn more about AWS F2 instances for genomics workloads, visit Genomics on AWS. For DRAGEN implementation guidance, explore the Illumina DRAGEN product information page.

[1] DRAGEN® is a registered trademark of Illumina Inc.

TAGS: ,
Manu Pillai

Manu Pillai

Manu Pillai is a Sr. Specialist Solutions Architect at AWS, focused on high-performance computing and cloud solutions for life sciences organizations. With over 12 years of experience in the life sciences industry, he brings deep domain expertise to help organizations leverage cloud technologies to advance scientific discovery. He enjoys exploring new destinations with his family when he's not diving into the latest tech innovations.

Gabriel Hernandez

Gabriel Hernandez

Gabriel Hernandez is a genomics and bioinformatics scientist at AstraZeneca’s Centre for Genomics Research (CGR). He solves problems by combining computational and biological expertise to find answers efficiently. His work centers on understanding genomic data to improve how it is processed and analyzed.

Heejoon Jo

Heejoon Jo

Heejoon Jo is a Staff Software Engineer at Illumina Inc., working within the Software Solutions Architecture department. He focuses on customer collaboration and innovation, with a strong emphasis on DRAGEN. Heejoon leads pipeline development and maintenance for DRAGEN and provides technical support to sales and commercial teams. He brings extensive experience in cancer genomics and the application of statistical methods. Heejoon holds a Masters degree in Biostatistics from the University of North Carolina at Chapel Hill.

Joe Warren

Joe Warren

Joe Warren is a Global Account Manager at AWS, specializing in Data, AI & ML solutions for life sciences. Over 4+ years, he has driven scientific workload transformation across AstraZeneca's value chain, delivering cloud-native innovations that accelerate research outcomes. Based in NYC since relocating from London in 2023

Natalia Jimenez

Natalia Jimenez

Natalia Jimenez is a life sciences high performance computing specialist at AWS. She has a PhD in Molecular Biology and over 25 years of experience in the research and tech sectors. She is originally from Spain but lives in Cambridge, UK and enjoys socializing, cycling, and gardening.

Marissa Powers

Marissa Powers

Marissa Powers is a specialist solutions architect at AWS focused on high performance computing and life sciences. She has a PhD in computational neuroscience and enjoys working with researchers and scientists to accelerate their drug discovery workloads. She lives in Boston with her family and is a big fan of winter sports and being outdoors.

Sean O’Dell

Sean O’Dell

Sean O’Dell is an experienced practitioner in cloud and genomics data, specializing in designing, building, and operating highly scalable cloud solutions for genomics and life sciences. With a mission to harness technology to drive innovation, Sean leads a team of software developers, engineers, and architects at AstraZeneca’s Centre for Genomic Research (CGR). His primary goal is to empower scientists with actionable insights derived from vast datasets.

Shyamal Mehtalia

Shyamal Mehtalia

Shyamal Mehtalia is the Director of Technical Product Management at Illumina where in close collaboration with development teams and customers, defines the features and roadmap for the Illumina flagship secondary analysis solution, DRAGEN. Shyamal joined Illumina in May 2018 as part of the Edico Genome acquisition, which invented the DRAGEN platform where he served as Director of Operations for three years. Prior to this he held a leadership role in the mobile wireless space to develop firmware. Shyamal holds a master’s degree in electrical engineering from Texas A&M University.