AWS for Industries

Enhanced Genomic Data Storage and Workflows with MGI, Sentieon, and AWS

The genomic research landscape is evolving at an unprecedented pace, demanding robust and scalable solutions for data storage and analysis. In collaboration, MGI, Sentieon, and Amazon Web Services (AWS) provide capabilities to help with genomic research. Researchers and clinicians can now fully capitalize on the capabilities of MGI’s sequencing technologies by utilizing AWS HealthOmics and Sentieon’s workflows for efficient data storage, retrieval, organization, and high-accuracy genomic analysis at scale.

We will present the integrated collaboration’s solution architecture, workflow options, and implementation details for high-throughput genomic analysis in research laboratories.

High-Throughput Genomic Data Analysis at the European MGI Headquarters in Berlin

High-throughput sequencing facilities face significant hurdles in managing massive data volumes and analysis workflows. Traditional approaches using local storage and computing infrastructure struggle with the scale of data generation, often resulting in processing bottlenecks, high maintenance costs, and complex data security requirements. Labs need a solution that can seamlessly handle data storage, provide scalable computing resources, and maintain security compliance while offering predictable costs.

MGI-tech, a leader in biotechnology innovation, develops and delivers next-generation sequencing (NGS) and laboratory automation technologies, with their flagship DNBSEQ™ platforms offering high-accuracy, high-throughput genetic analysis solutions for precision medicine, agriculture, and healthcare applications.

The European MGI Headquarters in Berlin house three high-throughput T7 sequencers generating up to 63 Tera base pair (Tb) of genomic data per week. The T7 sequencers generate data, which is securely transmitted to AWS and ingested into the AWS HealthOmics Sequence Store, a scalable and cost-optimized storage solution for genomic data. Within AWS HealthOmics, Sentieon’s high-performance private workflows are executed for fast and precise data analysis.

This integrated, high-throughput solution empowers researchers and clinicians at the European MGI Headquarters to derive valuable insights from complex genomic data for critical applications. These include germline disease diagnosis, cancer research, drug development, and large-scale population genomics studies.

AWS HealthOmics and Sentieon

AWS HealthOmics is a secure, scalable service for genomic data storage, processing and analysis. The HealthOmics Sequence Store offers a cost-effective way to store large volumes of sequencing data, making it readily accessible for downstream analysis through seamless integration with HealthOmics workflows. To protect sensitive customer data at rest, HealthOmics provides encryption by default using a service-owned AWS Key Management Service (AWS KMS) key. Customer managed keys are also supported.

Sentieon, an AWS partner since 2014, has developed highly optimized bioinformatics algorithms for fast and accurate genomic data processing. With HealthOmics workflows, Sentieon’s DNAscope and TNseq pipelines can be run either as pre-defined Ready2Run workflows or as customizable private workflows.

Architecture Overview

Image is defined in the next paragraph of the blog

Figure 1 – The workflow architecture implemented at the European MGI Headquarters in Berlin.

Figure 1 shows how:

  1. The MGI DNBSEQ-T7 sequencer generates output up to 21 Tb of data in CAL format (a binary file format generated by MGI Sequencer basecall software) within a week.
  2. MGI ZTRON Lite Pro converts CAL files into FASTQ files and enables data delivery.
  3. FASTQ files are securely transferred to Amazon Simple Storage Service (Amazon S3) on AWS.
  4. The files are imported into the AWS HealthOmics Sequence Store. Alternatively, a FASTQ file can be uploaded directly into the HealthOmics Sequence Store from local storage through HealthOmics Transfer Manager.
  5. Sentieon’s Genomics software, such as DNAscope and TNseq pipelines, can be run either as Ready2Run workflows or as customizable private workflows with HealthOmics workflows.
  6. At the end of the workflow run, AWS HealthOmics transfers the resulting BAM and VCF files to an S3 bucket.
  7. The results are accessed by users such as clinical researchers.

Ready2Run Workflows: Fast and Easy Setup

For researchers looking for a quick, straightforward solution, Sentieon’s Ready2Run workflows provide pre-built, optimized pipelines that can be deployed with just a few clicks or a simple API call and run at fixed costs with predictable runtime. These workflows are designed to handle various genomic tasks, including:

  • DNAscope for germline variant calling, optimized for the MGI sequencing platform.
  • TNseq for somatic variant calling, which matches the accuracy of GATK’s Mutect2 while offering faster runtimes for MGI data.

With AWS HealthOmics workflows researchers can select from Sentieon’s Ready2Run pipelines, providing flexible options for different analysis needs, reference genomes, and sequencing platforms. These workflows are priced on a per-run basis, offering predictable costs and the scalability needed for high-throughput genomic projects.

Private Workflows: Full Customization for Advanced Users

For teams requiring more control or customization, Sentieon also supports running pipelines as private workflows on AWS HealthOmics. These workflows offer complete flexibility, allowing researchers to customize and optimize workflows for their specific needs.

Running Sentieon private workflows allow users to:

  • Build their own Sentieon container images for AWS.
  • Customize parameters, reference genomes, and analysis outputs.
  • Take advantage of AWS HealthOmics infrastructure while maintaining control over pipeline configurations.

To get started with private workflows, users can follow detailed setup instructions available on the Sentieon GitHub repository. This repository includes everything needed to build and deploy customized workflows, including container images, setup scripts, and workflow examples for both WDL and Nextflow engines. Instructions for enabling the Sentieon license server for private workflows read: Requesting Sentieon licenses for private workflows.

Conclusion

The collaboration between MGI, Sentieon, and AWS delivers a robust solution for genomics labs. Utilizing MGI sequencing technology, offers both streamlined and customizable workflows that integrate seamlessly with AWS infrastructure. Also, whether running Sentieon Ready2Run workflows or setting up private workflows, researchers can leverage AWS HealthOmics to store and analyze genomic data at scale in the secure and compliant environment This enables faster, more accurate discoveries.

Built-in security features, including AWS KMS encryption, verify data protection in a compliant environment, while the fixed-price Ready2Run workflows provide predictable costs and automated resource management.

This collaboration between MGI, Sentieon, and AWS delivers a robust solution for genomics labs utilizing MGI sequencing technology, enabling faster, more accurate discoveries. The solution helps eliminate processing bottlenecks typically associated with high-throughput sequencing operations.

To learn more about implementing this integrated solution for your genomics research or clinical applications consult AWS HealthOmics webpage, Sentieon GitHub repository and contact our team for a personalized consultation on how this technology can accelerate your genomic analysis workflows. Contact an AWS Representative.

Acknowledgements

We wish to thank Bioinformatics Scientist, Dr. Liene Astica, Field Bioinformatics Scientist, Ying Zhan, and Field Application Scientist, Ongeziwe Mbhele, from MGI-tech for their contributions to the creation of the blog.

Further Reading

Dr. Tomasz Zemojtel

Dr. Tomasz Zemojtel

Dr. Tomasz Zemojtel is a Healthcare Business Development Manager in the EMEA region at AWS. With a background in healthcare and technology, he specializes in helping healthcare organizations and research-clinical institutions leverage cloud services to drive innovation. Tomasz is passionate about unlocking the potential of large clinical datasets through cloud computing to advance personalized therapies and integrated healthcare solutions. Outside of work, he enjoys jogging, hiking in the mountains, and spending time at the seaside.

Christopher Gatsch

Christopher Gatsch

Christopher Gatsch is a Field bioinformatics scientist at MGI, focused on optimizing client workflows and enhancing the BIT product experience and workflow efficiency. His passions are reading books, doing outdoor activities and travelling to different countries.

Dr. Manuel Delpero

Dr. Manuel Delpero

Dr. Manuel Delpero is the associate Manager of the Application Scientist department at MGI, specializing in optimizing client workflows and bridging the gap between bioinformatics and commercial strategy. With a PhD in molecular genetics and bioinformatics, he has extensive industry experience in both technical and business aspects of biotechnology. His passions include strength training, traveling, and exploring new business opportunities.

Zhan Ying

Zhan Ying

Zhan Ying is an IT specialist with a background in Artificial Intelligence, providing comprehensive BIT solutions. He is responsible for BIT-related IT support, including machine integration, user lab networking, and post-sales troubleshooting.

Dr. Mueller Michael

Dr. Mueller Michael

Dr. Michael Mueller is a Senior Genomics Solutions Architect at AWS. He is passionate about helping customers to combine the power of genomics and cloud computing to improve healthcare. He likes exploring the big city jungle of London as much as the outdoors of the UK and Europe.