Benchmarking the Oxford Nanopore Technologies basecallers on AWS

This blog post was contributed by Guilherme Coppini, Bioinformatician and Javier Quilez, Associate Director – Bioinformatics at G42 Healthcare; and Chris Seymour, Vice President of Advanced Platform Development at Oxford Nanopore; and Doruk Ozturk, Senior Solutions Architect, Container Technologies, and Michael Mueller, Senior Solutions Architect, Genomics at AWS and Stefan Dittforth, Senior Solutions Architect, Healthcare at AWS.

[Update 2023-11-20]: The source code for the automated deployment of the architecture described in section “Architecture” is now available as open source on GitHub.

Oxford Nanopore sequencers enables direct, real-time analysis of long DNA or RNA fragments. They work by monitoring changes to an electrical current as nucleic acids are passed through a protein nanopore. The resulting signal is decoded to provide the specific DNA or RNA sequence by virtue of compute-intensive algorithms called basecallers. This blog post presents the benchmarking results for two of those Oxford Nanopore basecallers — Guppy and Dorado — on AWS. This benchmarking project was conducted in collaboration between G42 Healthcare, Oxford Nanopore Technologies and AWS.

We ran Guppy and Dorado on 20 different Amazon Elastic Compute Cloud (Amazon EC2) instance types with GPU accelerators. The top performance was achieved on a p4d.24xlarge instance type which delivered 490 million samples/second with Dorado, and 250 million samples/second with Guppy. A sample is one measurement of the current flowing through the nanopore. Typically, the current signal is sampled at 10 times the speed at which the bases passing through the nanopore. For example, at a rate of 400 bases per second (bps) passing through the nanopore, the sampling rate is 4,000 samples per second. The Dorado basecaller outperformed Guppy by a factor of 3.8 x when performing methylation calling with the 5-hydroxymethylcytosine group (5hmCG). Our cost evaluations revealed that the g5.xlarge instance delivers the lowest cost for basecalling a whole human genome (WHG) with the Guppy tool.

Background

In recent years significant technological advancements in Oxford Nanopore’s sequencers have made them a viable tool for sequencing DNA and RNA molecules. Oxford Nanopore offers various benefits compared to other sequencing technologies such as much longer read lengths (up to hundreds of kilobases), extracting information on methylation status, and real-time sequencing. Longer reads allow for characterization of relevant genetic features relatively inaccessible for short-read technologies (such as structural variants). Methylation is a biochemical modification within the DNA that turns genes on or off. Detecting methylation requires additional sequencing on other platforms. Real-time sequencing reduces the turnaround time in clinical applications.

Basecalling is the conversion of the raw electrical signal generated by the sequencer, stored in FAST5 files, into character-based sequences and associated metadata (for example sample identifier and sequences quality) stored in the standard FASTQ format. Basecalling Oxford Nanopore data is computationally intensive and requires GPUs. Therefore, enabling and optimizing the basecalling step is critical to deploy bioinformatic pipelines for the analysis of Oxford Nanopore data in the cloud.

G42 Healthcare offers Oxford Nanopore sequencing as part of its Omics as a Service, which also includes short-read technologies Illumina and MGI. G42 Healthcare has built one of the world’s most advanced high-throughput sequencing laboratories in Abu Dhabi, United Arab Emirates as part of the The Emirati Genome Programme. Today G42 Healthcare has the capacity to deliver approximately 50,000 whole genome sequences (WGS) per month across all three technologies.

These services address the demand from healthcare regulators, research organizations and industry for ‘omics sequencing and analysis at population scale. G42 Healthcare’s offerings enable customers to quickly scale up large genome programs without the need for a capital intense and time-consuming setup of genome laboratories. Furthermore, G42 Healthcare delivers complete population health programs starting from outreach and sample collection to biobanking and sequencing as well as bioinformatics, interpretation, knowledge transfer and personnel training.

Performance results

The results presented below are based on the CliveOME 5mC dataset. This dataset is made available as open dataset by Oxford Nanopore. The data represents one human whole genome sequence from one subject sequenced at 30x coverage. The dataset comprises 584 FAST5 files with a total data volume of 745 GiB.

We successfully ran the Guppy and Dorado basecallers on 20 different Amazon EC2 instance types. Dorado delivers significantly higher performance compared to the Guppy basecaller. Dorado shows its strength when performing methylation calling with the 5-methylcytosine group (5mCG) and 5-hydroxymethylcytosine group (5hmCG). On the most powerful EC2 instance type tested, the p4d.24xlarge, we see a 3.8 x performance increase when performing methylation calling for 5hmCG.

The observed performance gains with the Dorado basecaller are the results of a new software architecture that makes better use of GPUs. Whereas Guppy required CPUs for calling methylated bases, Dorado uses GPUs for this process. The fact that Dorado requires little to no additional runtime when calling methylated bases, means that Oxford Nanopore Technologies enables adding base modifications to standard sequencing with no extra cost: neither specific library preparation are required upstream sequencing nor computational overhead downstream, when basecalling. Figure 1 and Figure 2 show the results of each tool across a set of EC2 instances for throughput (samples per second) and total runtime, respectively. Shortly before publication of this blog post Oxford Nanopore released version 0.3.0 of Dorado. This version achieves significant performance gains on Ampere based NVIDIA A100 GPUs. We included Dorado v0.3.0 for the p4d.24xlarge instance type in the benchmark results below.

Figure 1 – Comparison of the performance of the Oxford Nanopore basecallers Guppy vs. Dorado. The performance is expressed in samples/s. Higher values indicate higher performance. Dorado clearly outperforms Guppy across all instance types. Basecalling with methylated bases benefits from modern GPUs such as the NVIDIA A100. Tested EC2 instance types are listed to the left and ordered by performance for basecalling without methylation calling. The three facets of the plot present, from left to right: basecalling without calling for methylated bases, basecalling including methylated bases with 5mCG, basecalling including methylated bases with 5mCG and 5hmCG.

Figure 2 – Caller time in hours for basecalling the CliveOME 5mC dataset. The term “caller time” is interchangeable with “runtime”. The highest performing instance, the p4d.24xlarge, completed the basecalling process in less than an hour with the Dorado basecaller. With Dorado there is little to no increase in runtime when methylation calling is included. The Guppy basecaller requires significantly longer runtimes when performing methylation calling.

Runtime and computing cost for Oxford Nanopore basecalling

When looking at the computing cost for basecalling per gigabase of DNA sequence it turns out that it is not always required to use high performance, multi-GPU instance types to perform basecalling at low cost. The first five ranks in the tables below include small instance types that are optimized for cost effective machine learning inference: a g5.xlarge instance can perform basecalling at equivalent cost compared to a p4d.24xlarge instance running Dorado v0.3.0 and no modification calling.

Table 1 – Runtime and cost for basecalling the CliveOME 5mC dataset with Dorado across different EC2 instance types and calling without and with methylated bases per gigabase of DNA sequence and for a whole human genome (WHG) at 30X coverage (96 gigabases). The instance types are ranked by cost per WHG without modification calling (column with red border). Lowest cost ranked first. Numbers are for On-Demand pricing in the us-west-2 AWS Region. Cost effective basecalling is possible with smaller instance types such as the g5.xlarge, g5.2xlarge and g4dn.xlarge. *WHG = whole human genome at 30X coverage.

Table 2 – Runtime and cost for basecalling the CliveOME 5mC dataset with Guppy. With Guppy the lowest cost is achieved with smaller instance types such as the g5.xlarge, g5.2xlarge, g5.12xlarge and g4dn.xlarge. These instance types rank before the most performant instance type, the p4d.24xlarge. Fields with “n/a” indicate test runs that failed. The causes of these failures could not be established before publication of this blog post and is still being investigated.

Of course, the runtime for individual EC2 instances is much longer when basecalling is performed sequentially with smaller instance types. However, with the architecture presented in the “Architecture” section basecalling jobs can be executed as parallel batch jobs. The AWS Batch service allows horizontal scaling of basecalling to hundreds or thousands of instances. The same throughput as a p4d.24xlarge instance can be achieved with parallel execution across smaller instances, but at a lower cost.

For example, when parallelizing basecalling with Dorado and 5mCG calling across 25 g5.xlarge instances, basecalling of a whole human genome (30x coverage) can be performed with a runtime similar to a single p4d.24xlarge at a 12% lower cost — $21.28 vs. $24.14 for WHG with 5mCG calling. The ability to run Oxford Nanopore basecallers on ubiquitous smaller EC2 instance types and more fine-grained control of load balancing jobs across GPUs means that further cost savings can be achieved through utilization of Amazon EC2 Spot Instances.

Test procedure

This section provides a summary of the basecaller versions, their parameters and the dataset used in the benchmarking tests.

Initial benchmarking tests were conducted using the guppy_basecaller version 6.4.8. End of 2022 Oxford Nanopore released dorado the successor for the guppy_basecaller. The benchmarking tests were conducted with dorado version 0.2.4 and v0.3.0 for the p4d.24xlarge instance type with NVIDIA A100 GPUs. With dorado Oxford Nanopore achieved significant performance improvements. Most notably, dorado utilizes GPU support for methylation calling.

The CliveOME 5mC dataset was used as the test dataset. This dataset is made available as open dataset by Oxford Nanopore. The dataset comprises 584 FAST5 files with a total data volume of 745 GiB. FAST5 files were converted to the POD5 file format. This was done as this is the recommended file format to achieve optimal performance with the dorado basecaller. The guppy_basecaller delivers identical performance whether running on files in FAST5 or POD5 format.

The CliveOME 5mC dataset was generated from sequencing DNA samples through R10.4.1 nanopores. Therefore, basecalling was performed with the corresponding R10.4.1 models for high accuracy. The performance of basecalling without, with 5mCG and with 5mCG_5hmCG methylation calling was evaluated.

Below are the basecaller commands and their parameters as they were executed in the benchmarking tests.

guppy_basecaller without methylation calling:

guppy_basecaller \
    --compress_fastq \
    --input_path /fsx/pod5-all-files/ \
    --save_path /fsx/out/ \
    --input_file_list /fsx/pod5-file-lists/${file_list}  \
    --config dna_r10.4.1_e8.2_400bps_hac.cfg \
    --bam_out \
    --index \
    --device cuda:all:100% \
    --records_per_fastq 0 \
    --progress_stats_frequency 600 \
    --recursive \
    --num_base_mod_threads ${num_base_mod_threads} \
    --num_callers 16 \
    --gpu_runners_per_device 8 \
    --chunks_per_runner 2048

guppy_basecaller with methylation calling 5mCG (only difference in parameters shown):

guppy_basecaller \
    ...
    --config dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_hac.cfg \
    ...

guppy_basecaller with methylation calling 5mCG_5hmCG (only difference in parameters shown):

guppy_basecaller \
    ...
    --config dna_r10.4_e8.1_modbases_5hmc_5mc_cg_hac.cfg \
    ...

dorado without methylation calling:

dorado basecaller \
    /usr/local/Dorado/models/dna_r10.4.1_e8.2_400bps_hac@v3.5.2 \
    ${file_list}/ \
    --verbose | \
    samtools view --threads 8 -O BAM -o /fsx/out/&job_id&/calls.bam

dorado with methylation calling 5mCG (only added parameters shown):

dorado basecaller \
    ...
    --modified-bases 5mCG | \
    ...

dorado with methylation calling 5mCG_5hmCG (only added parameters shown):

dorado basecaller \
    ...
    --modified-bases 5mCG_5hmCG | \
    ...

Architecture

The architecture developed for this project provides completely unattended automation for deployment, data collection and environment tear down. This automation made it possible to run a large number of experiments in parallel to tune parameters, compare different basecallers and assess the performance impact of methylation calling.

Figure 3 – Key components of the benchmarking architecture are the AWS Batch and Amazon FSx for Lustre services. Other services utilized for automated deployment are the AWS Cloud Development Kit (CDK) and Amazon EC2 Image Builder. Benchmarking jobs were created with Python. Results were written to a DynamoDB table and evaluated using Amazon SageMaker Notebooks.

Customers may reuse all or part of the architecture as guidance for deploying genomics workload in the AWS Cloud. The core services utilized are AWS Batch and Amazon FSx for Lustre. AWS Batch is a service to run batch computing workloads on the AWS Cloud. We used separate AWS Batch job queues and compute environments for each instance types to isolate and orchestrate benchmarking experiments. For example, for the g5.2xlarge instance, we created a job queue named “g5-2xlarge-queue” and a compute environment named “g5-2xlarge-ce”. Amazon FSx for Lustre provides a fully-managed, high-performance Lustre file system. You use Lustre for workloads where speed matters, such as machine learning, high performance computing (HPC), video processing, and financial modeling. To get high-performance for reads and writes we used Amazon FSx for Lustre backed by Amazon S3 buckets.

We used Amazon Elastic Container Registry (Amazon ECR) to host container images. We also used EC2 Image Builder to simplify build, test and deployment of container images. This resulted in improved iteration times as the container image definitions changed frequently in the beginning of the project.

We used AWS Cloud Development Kit (CDK) to orchestrate the infrastructure deployment. CDK is an infrastructure as code solution that makes it possible to create and tear down infrastructure rapidly. As a result, we are able to rebuild the environment rapidly after design changes. Furthermore, CDK reduced cost as we can tear down the environment when not in use between test phases.

The CliveOME 5mC dataset is automatically downloaded as part of the environment build by an EC2 Instance (the “downloader”) and placed in the S3 bucket that backs the Amazon FSx for Lustre file system. Once the download is completed, the instance gets terminated by CDK to avoid cost of an idle EC2 instance.

Conclusion

In this blog post we demonstrated the successful execution of the Oxford Nanopore basecallers Guppy and Dorado on 20 different Amazon EC2 GPU instance types. The new software architecture of the Dorado basecaller delivers significantly higher performance over the previous basecaller, Guppy. For example, we see a 3.8 x performance increase when performing methylation calling with 5hmCG.

Our estimates for computing cost demonstrate that customers should not only focus on high powered EC2 instance but also evaluate smaller instance types. Longer runtimes can be compensated by running multiple basecalling jobs in parallel on many EC2 instances.

Customers should consider an architecture for basecalling that allows them to choose the right compute environment depending on requirements for basecalling time and cost. A setting that requires processing a small number of samples in a short time will benefit from high performance instances such as the p4d.24xlarge that can process one WGS in less than an hour. On the other hand, large population scale genome research projects will benefit from the cost effectiveness of smaller instances types. Further cost savings can be realized by utilizing Spot Instances.

Customers with research projects in the area of DNA methylation should consider Dorado for their genomics pipelines. With Dorado, methylation calling information can be extracted with only a small increase in cost (g5.xlarge: 2% increase for 5mCG and 9% for 5mCG_5hmCG) compared to basecalling without methylation calling.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

AWS HPC Blog