AWS for Industries
Enhanced Genomic Data Storage and Workflows with MGI, Sentieon, and AWS
The genomic research landscape is evolving at an unprecedented pace, demanding robust and scalable solutions for data storage and analysis. In collaboration, MGI, Sentieon, and Amazon Web Services (AWS) provide capabilities to help with genomic research. Researchers and clinicians can now fully capitalize on the capabilities of MGI’s sequencing technologies by utilizing AWS HealthOmics and Sentieon’s workflows for efficient data storage, retrieval, organization, and high-accuracy genomic analysis at scale.
We will present the integrated collaboration’s solution architecture, workflow options, and implementation details for high-throughput genomic analysis in research laboratories.
High-Throughput Genomic Data Analysis at the European MGI Headquarters in Berlin
High-throughput sequencing facilities face significant hurdles in managing massive data volumes and analysis workflows. Traditional approaches using local storage and computing infrastructure struggle with the scale of data generation, often resulting in processing bottlenecks, high maintenance costs, and complex data security requirements. Labs need a solution that can seamlessly handle data storage, provide scalable computing resources, and maintain security compliance while offering predictable costs.
MGI-tech, a leader in biotechnology innovation, develops and delivers next-generation sequencing (NGS) and laboratory automation technologies, with their flagship DNBSEQ™ platforms offering high-accuracy, high-throughput genetic analysis solutions for precision medicine, agriculture, and healthcare applications.
The European MGI Headquarters in Berlin house three high-throughput T7 sequencers generating up to 63 Tera base pair (Tb) of genomic data per week. The T7 sequencers generate data, which is securely transmitted to AWS and ingested into the AWS HealthOmics Sequence Store, a scalable and cost-optimized storage solution for genomic data. Within AWS HealthOmics, Sentieon’s high-performance private workflows are executed for fast and precise data analysis.
This integrated, high-throughput solution empowers researchers and clinicians at the European MGI Headquarters to derive valuable insights from complex genomic data for critical applications. These include germline disease diagnosis, cancer research, drug development, and large-scale population genomics studies.
AWS HealthOmics and Sentieon
AWS HealthOmics is a secure, scalable service for genomic data storage, processing and analysis. The HealthOmics Sequence Store offers a cost-effective way to store large volumes of sequencing data, making it readily accessible for downstream analysis through seamless integration with HealthOmics workflows. To protect sensitive customer data at rest, HealthOmics provides encryption by default using a service-owned AWS Key Management Service (AWS KMS) key. Customer managed keys are also supported.
Sentieon, an AWS partner since 2014, has developed highly optimized bioinformatics algorithms for fast and accurate genomic data processing. With HealthOmics workflows, Sentieon’s DNAscope and TNseq pipelines can be run either as pre-defined Ready2Run workflows or as customizable private workflows.
Architecture Overview
Figure 1 – The workflow architecture implemented at the European MGI Headquarters in Berlin.
Figure 1 shows how:
- The MGI DNBSEQ-T7 sequencer generates output up to 21 Tb of data in CAL format (a binary file format generated by MGI Sequencer basecall software) within a week.
- MGI ZTRON Lite Pro converts CAL files into FASTQ files and enables data delivery.
- FASTQ files are securely transferred to Amazon Simple Storage Service (Amazon S3) on AWS.
- The files are imported into the AWS HealthOmics Sequence Store. Alternatively, a FASTQ file can be uploaded directly into the HealthOmics Sequence Store from local storage through HealthOmics Transfer Manager.
- Sentieon’s Genomics software, such as DNAscope and TNseq pipelines, can be run either as Ready2Run workflows or as customizable private workflows with HealthOmics workflows.
- At the end of the workflow run, AWS HealthOmics transfers the resulting BAM and VCF files to an S3 bucket.
- The results are accessed by users such as clinical researchers.
Ready2Run Workflows: Fast and Easy Setup
For researchers looking for a quick, straightforward solution, Sentieon’s Ready2Run workflows provide pre-built, optimized pipelines that can be deployed with just a few clicks or a simple API call and run at fixed costs with predictable runtime. These workflows are designed to handle various genomic tasks, including:
- DNAscope for germline variant calling, optimized for the MGI sequencing platform.
- TNseq for somatic variant calling, which matches the accuracy of GATK’s Mutect2 while offering faster runtimes for MGI data.
With AWS HealthOmics workflows researchers can select from Sentieon’s Ready2Run pipelines, providing flexible options for different analysis needs, reference genomes, and sequencing platforms. These workflows are priced on a per-run basis, offering predictable costs and the scalability needed for high-throughput genomic projects.
Private Workflows: Full Customization for Advanced Users
For teams requiring more control or customization, Sentieon also supports running pipelines as private workflows on AWS HealthOmics. These workflows offer complete flexibility, allowing researchers to customize and optimize workflows for their specific needs.
Running Sentieon private workflows allow users to:
- Build their own Sentieon container images for AWS.
- Customize parameters, reference genomes, and analysis outputs.
- Take advantage of AWS HealthOmics infrastructure while maintaining control over pipeline configurations.
To get started with private workflows, users can follow detailed setup instructions available on the Sentieon GitHub repository. This repository includes everything needed to build and deploy customized workflows, including container images, setup scripts, and workflow examples for both WDL and Nextflow engines. Instructions for enabling the Sentieon license server for private workflows read: Requesting Sentieon licenses for private workflows.
Conclusion
The collaboration between MGI, Sentieon, and AWS delivers a robust solution for genomics labs. Utilizing MGI sequencing technology, offers both streamlined and customizable workflows that integrate seamlessly with AWS infrastructure. Also, whether running Sentieon Ready2Run workflows or setting up private workflows, researchers can leverage AWS HealthOmics to store and analyze genomic data at scale in the secure and compliant environment This enables faster, more accurate discoveries.
Built-in security features, including AWS KMS encryption, verify data protection in a compliant environment, while the fixed-price Ready2Run workflows provide predictable costs and automated resource management.
This collaboration between MGI, Sentieon, and AWS delivers a robust solution for genomics labs utilizing MGI sequencing technology, enabling faster, more accurate discoveries. The solution helps eliminate processing bottlenecks typically associated with high-throughput sequencing operations.
To learn more about implementing this integrated solution for your genomics research or clinical applications consult AWS HealthOmics webpage, Sentieon GitHub repository and contact our team for a personalized consultation on how this technology can accelerate your genomic analysis workflows. Contact an AWS Representative.
Acknowledgements
We wish to thank Bioinformatics Scientist, Dr. Liene Astica, Field Bioinformatics Scientist, Ying Zhan, and Field Application Scientist, Ongeziwe Mbhele, from MGI-tech for their contributions to the creation of the blog.