AWS HPC Blog
How AstraZeneca improved their genomics processing to be 60% faster, 70% more cost-effective
This post was contributed by Manu Pillai (AWS), Marissa E Powers PhD (AWS), Sean O’Dell (AstraZeneca), Gabriel Hernandez (AstraZeneca), Heejoon Jo (Illumina), Shyamal Mehtalia (Illumina), Joe Warren (AWS), and Natalia Jimenez PhD (AWS)
Genomic research demands massive computational power to process and analyze DNA sequences that unlock insights into human health and disease. For pharmaceutical companies like AstraZeneca, processing millions of genomes efficiently and sustainably is a crucial factor to accelerate discovery and development of future therapies to help people to live longer, healthier lives.
As cloud infrastructure evolves, organizations face the challenge of migrating to newer, more powerful instances while maintaining research integrity and managing costs. When Amazon Elastic Compute Cloud (Amazon EC2) F1 instances approach end-of-service in 2025, organisations relying on FPGA to process genetic data need a clear path forward that doesn’t compromise their critical workloads.
In this post, we’ll show you how AstraZeneca’s Centre for Genomics Research successfully transitioned to Amazon EC2 F2 instances, achieving substantial performance improvements while significantly reducing costs.
The Challenge: Future-Proofing Genomic Infrastructure
The Centre for Genomics Research (CGR) implements AstraZeneca’s Genomics Initiative, aiming to analyze up to two million genomes by 2026. It integrates large-scale genomic and clinical data to uncover novel drug targets, understand disease mechanisms, and inform all stages of the AstraZeneca drug discovery pipeline with human genetic insight. CGR’s work leverages multi-omics approaches—genomics, transcriptomics, proteomics, and metabolomics—at population scale, making it one of the largest and most diverse datasets globally.
The CGR Data Platforms team relies heavily on Illumina DRAGEN®[1] (Dynamic Read Analysis for GENomics) workloads for processing Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) data. With F1 instances approaching end-of-service, they needed to evaluate F2 instances as the next-generation solution while ensuring no disruption to their research operations.
“The performance improvements we’ve seen with F2 instances are transformative for our genomic processing capabilities,” said Sean O’Dell from AstraZeneca’s Centre for Genomics Research. “This collaboration with AWS and Illumina provides us with a clear path forward to continue to enable scientific research more efficiently and cost-effectively.”
The Solution: Comprehensive F1 to F2 Migration Testing
AWS, AstraZeneca, and Illumina collaborated to conduct extensive testing comparing F1 and F2 instance performance across multiple genomic workloads and AWS Regions.
Testing Methodology
The team designed a comprehensive evaluation using:
- Sample Analysis: 11 genomic samples (5 WES and 6 WGS) representing typical research workloads
- Architecture: AWS Batch with separate compute queues for F1 and F2 instances
- Illumina DRAGEN Versions: v3.7.8 for WGS processing, v4.3.6 for WES processing
- Multi-Region Validation: Testing across three AWS regions to ensure global scalability
- Performance Metrics: Runtime comparison and cost analysis for identical workloads
- Validation: A formal method for validating genomic equivalency
The architecture leveraged the Nextflow workflow system with AWS Batch (a fully managed batch computing service) to orchestrate DRAGEN pipeline execution, enabling consistent testing conditions across both instance types.
Figure 1 – This diagram shows the cloud testing environment used to compare F1 and F2 instance performance for genomic processing. Users submit DRAGEN workflows through Nextflow, which orchestrates separate AWS Batch compute environments for Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) samples on F2.6xlarge instances. The standardized architecture enabled precise performance comparisons by maintaining identical workflow orchestration while only changing the underlying compute infrastructure, ensuring the 60% performance improvements and 70% cost reductions could be accurately attributed to F2 instances.
Results: Dramatic Performance and Cost Improvements
The testing revealed significant improvements when migrating from F1 to F2 instances:
Key Benefits
- 60% faster processing: Significant reduction in genomic analysis runtime
- 70% cost reduction: Substantial savings on compute costs
- Sustainability improvement: Reduced energy use and associated carbon emissions by completing workloads more efficiently
- Enhanced scalability: Improved performance across multiple AWS regions
- Maintained accuracy: No compromise in research data integrity
Figure 2 – This chart compares F1.4xlarge and F2.6xlarge instance performance for six Whole Genome Sequencing samples, showing both runtime (in minutes) and cost per run (in dollars). F2 instances consistently deliver up to 60% faster processing times and 70% lower costs across all samples, demonstrating the substantial performance improvements and cost savings organizations can achieve when upgrading from F1 to F2 instances for genomic workloads. On Demand pricing for f1.4xlarge in eu-west-1 and f2.6xlarge in eu-west-2 as of 9/16/2025.
Figure 3 – This chart compares F1.4xlarge and F2.6xlarge instance performance for five Whole Exome Sequencing samples, showing both runtime (in minutes) and cost per run (in dollars). F2 instances consistently deliver up to 63% faster processing times and 74% lower costs across all samples, demonstrating the substantial performance improvements and cost savings organizations can achieve when upgrading from F1 to F2 instances for exome sequencing workloads. On Demand pricing for f1.4xlarge in eu-west-1 and f2.6xlarge in eu-west-2 as of 9/16/2025.
Validating Equivalency
To assess Illumina DRAGEN output equivalency across AWS F1 and F2 instance families, we evaluated the output for five samples processed on each instance type and compared outputs at three levels: pipeline metrics, command provenance, and variant calls. For like-for-like runs, we expect identical results across instances. In practice, every value and variant should match exactly, with differences limited to non-deterministic metadata such as timestamps, file paths, or sample labels.
Command Provenance: To verify that the pipelines were invoked identically, we extracted the DRAGEN command-lines from the VCF headers and compared them using bash tools. The commands and flags matched exactly between F1 and F2 runs, ensuring the analysis was conducted with the same configuration and parameters.
Metrics Comparison: We compared DRAGEN-generated metrics using standard bash utilities (e.g., diff) across the following files: gc_metrics, mapping_metrics, roh_metrics, sv_metrics, and vc_metrics. Across all five samples, the metrics were identical between F1 and F2 outputs except for the values we expect to vary (e.g., names, date/timestamp). This confirms that alignment, variant calling, and summary statistics are reproducible across both instance types.
VCF Concordance: We evaluated variant-level concordance using bcftools isec, which performs set operations on VCF files by treating each file as a set of variant records. It computes intersections (variants present in both files) and differences (variants unique to one file), making it a practical way to quantify concordance between outputs. In our workflow, we ran bcftools isec on paired VCFs (F1 vs F2) for each sample. The tool writes out separate files corresponding to the following:
- Variants unique to the first VCF (e.g., 0000.vcf)
- Variants unique to the second VCF (e.g., 0001.vcf)
- Variants common to both VCFs (the intersection, e.g., 0002.vcf, and 0003.vcf)
We then counted variants in each output to measure overlap and differences, excluding header lines that begin with # or ##. Under full equivalence, the unique outputs should be empty, and the intersection files should contain all variants, so intersection record counts match the originals.
# Run bcftools isec
$> bcftools isec -p results sample1_f1.vcf.gz sample1_f2.vcf.gz
# Count variant records (exclude header lines)
$> grep -c -v "^#" results/*.vcf
0000.vcf:0
0001.vcf:0
0002.vcf:299131
0003.vcf:299131
Across all five samples, bcftools isec showed that the intersection contained all variants, and the unique sets were empty, indicating no variant-level differences between F1 and F2 outputs. In other words, the VCFs were fully concordant.
In conclusion, under identical DRAGEN configurations, Amazon EC2 F1 and F2 instances produce equivalent outputs at both the metrics and variant levels. Metrics differed only in expected, non-deterministic metadata (timestamps and sample labels), and bcftools isec showed that variant intersections contained all calls with no unique differences. These results support running DRAGEN genomic pipelines interchangeably on F1 and F2.
The Impact: Accelerating Genomic Discovery
This successful migration provides AstraZeneca with transformative advantages that extend far beyond simple infrastructure upgrades. The immediate benefits include establishing a clear, validated migration path from F1 to F2 instances, ensuring business continuity and achieving significant cost and CO2 emissions optimization for large-scale genomic workloads.
The long-term value of this transition positions AstraZeneca’s Centre for Genomics Research with future-ready infrastructure capable of supporting their rapidly growing genomic datasets, which will soon exceed 2 million human genomes.
Conclusion
The collaboration between AWS, AstraZeneca, and Illumina demonstrates how cloud infrastructure evolution can drive significant efficiency improvements in genomic research capabilities. The 60% performance improvement, 70% cost reduction, as well as improved sustainability achieved through F2 instance migration provides a compelling case for organizations looking to optimize their genomic processing workloads.
For genomics teams currently using F1 instances, this validation study offers confidence in migrating to F2 instances while maintaining research integrity and achieving substantial operational benefits.
To learn more about AWS F2 instances for genomics workloads, visit Genomics on AWS. For DRAGEN implementation guidance, explore the Illumina DRAGEN product information page.
[1] DRAGEN® is a registered trademark of Illumina Inc.