Skip to main content
2024

Centre for Cellular and Molecular Biology Achieves 98% Time Savings for Data Analysis on AWS

Learn how the Centre for Cellular and Molecular Biology uses Amazon S3 for storage and backup of critical genomic data, utilizing other AWS services to accelerate analysis and facilitate collaboration with other research organizations.

Overview

The Centre for Cellular and Molecular Biology (CCMB) is a premier research institute under the Ministry of Science and Technology, Government of India. To ensure reliable backup and offsite storage of critical genomic data, CCMB migrated targeted workloads to Amazon Web Services (AWS).

The institute is using Amazon S3 and Amazon EBS for scalable storage, AWS Batch for faster data processing, and the Registry of Open Data on AWS to utilize key public datasets. CCMB can perform research nearly twice as fast on AWS without the need to download and consolidate datasets or worry about storage redundancy and configuration.

Scientist Working on Computer In  Modern Laboratory
NOTE TO INSPECTOR: all graps, dna models, charts are made by me

About the Centre for Cellular and Molecular Biology (CCMB)

Reporting to India’s Ministry of Science and Technology, the Centre for Cellular and Molecular Biology (CCMB) was established in 1977 to conduct high-quality research and training in frontier areas of modern biology. CCMB is a leader in human genomics research, playing a pivotal role in national-scale projects such as GenomeIndia, the India Breast Cancer Genome Atlas, INSACOG, and the Indian Tuberculosis Genome Consortium (InTGS).

Opportunity | Turning to the Cloud to Back Up Critical Research

The Centre for Cellular and Molecular Biology (CCMB) is a premier research organization and a pioneer in human genomics research in India. The institute is a primary partner in initiatives including the GenomeIndia project, which aims to construct a comprehensive catalog of genetic variations for Indian populations by sequencing 10,000 genomes from various individuals across the country.

During the pandemic, while pursuing time-sensitive virus-related research for the government, CCMB experienced a major failure in a local backup server. All of the institute’s data was stored on this server, causing long delays in retrieving critical data the government was using to make daily policy decisions. CCMB had already been considering the cloud as an offsite backup option for its on-premises infrastructure, and this failure prompted the institute to accelerate cloud adoption.

With the move to cloud, CCMB also hoped for faster connections to speed up data retrieval and processing. The institute utilizes terabytes of data from multiple public and private databases, across institutes and even countries, and typically needs to download data from these sources before beginning analysis. On a good day, with a strong connection and no network downtime, CCMB could download about 1 TB of data per day. It thus needed to plan research projects days or weeks in advance to ensure data was ready on demand.

Solution | Migrating Data and Custom Pipelines for Faster Analysis

CCMB issued a request for proposal (RFP) from public cloud platforms, outlining its strict data residency requirements and the need to stream data directly from its Illumina sequencing platform to cloud-based storage. Ultimately, the institute decided on Amazon Web Services (AWS)—largely because Illumina has a direct connect option with Amazon Simple Storage Service (Amazon S3). CCMB currently uses Amazon S3 for redundant storage of over 83 TB of raw data, alongside storing data from primary and secondary analysis.

Another important factor in choosing AWS was the ability to reference multiple genomic databases from the Registry of Open Data on AWS. Dr. Divya Tej Sowpati, scientist at CCMB, elaborates, “The open data portal on AWS is much more extensive than other cloud providers. This helps us to work with viewable data sets without having to worry about long download times.”

In addition to the storage and backup use case, CCMB has started adopting compute infrastructure on AWS to improve collaboration. “Working on the cloud has been particularly helpful when we want to do simultaneous analysis, with everyone working on different aspects of the same data,” says Tej Sowpati.

To more efficiently run its custom pipelines for post-analysis data, CCMB introduced AWS Batch. In many of its analytics workflows, the institute needs to deploy a hundred or more independent files at once, and AWS Batch automates the efficient allocation of compute resources. CCMB migrated its bioinformatics pipelines for secondary analysis—the bulk of its workflows—to Amazon Genomics CLI, the precursor to AWS HealthOmics.

In addition, CCMB increasingly relies on GPU instances to train its machine learning (ML) models. Demand for on-premises GPU servers in India is high, and lead times for GPU-based analyses can extend six months or more. On the cloud, however, CCMB can instantly access the scalable GPU power it needs. When using GPU instances on Amazon Elastic Compute Cloud (Amazon EC2), CCMB links to high-performance storage on Amazon Elastic Block Store (Amazon EBS). 

From the start of the RFP process, leaders at CCMB noted the extensive support they received from AWS and AWS Partner Locuz. Locuz streamlined the administrative aspects of cloud implementation for CCMB, adjusting billing structure to the institute’s preferences and ensuring the technical specifications were correct during the RFP and implementation stages. In addition, the AWS Genomics team supported CCMB with general cloud operations, plus provided training sessions on specific topics as needed.

Outcome | Benefiting from a Central Data Store with No Downloads Required

By centralizing its data on AWS, CCMB now benefits from reliable storage with zero downtime, while saving significant time and costs. With public data sets and other private institutions storing data on AWS, CCMB no longer needs to download files before analysis—eliminating redundant data consolidation expenses. As a result, Tej Sowpati estimates that CCMB has accelerated its analysis processes by up to 98 percent.

“Previously, we would often find ourselves still downloading the data for a review on the day it was due. But with all the data on AWS, we can perform targeted analysis very quickly with local access to open data sets,” Tej Sowpati says. Similarly, the ability to run standardized workflows on large data sets through Amazon Genomics CLI leads to reduced overhead on resource configuration and allocation. 

Currently, CCMB is working with AWS on the India Breast Cancer Genome Atlas, a project dedicated to analyzing 300 genomic pairs from breast cancer patients—which amounts to about 110 TB of data. The institute is also collaborating with AWS on training AI models to recognize DNA modifications from nanopore sequencing data, relying on GPU instances launched in other AWS Regions for the heavy data processing requirements associated with the work. More importantly, Tej Sowpati emphasizes how the built-in security guardrails on AWS facilitate the confidential nature of CCMB’s work. “Security is one thing we don’t have to worry about because I’m assured that with the high standards inherent with AWS, we get best-in-class data security.” 

Logo of CSIR-India featuring a gear with a lamp inside, surrounded by text in Hindi and English stating "The Innovation Engine of India."
Previously, we would often find ourselves still downloading the data for a review on the day it was due. But with all the data on AWS, we can perform targeted analysis very quickly with local access to open data sets.

Dr. Divya Tej Sowpati

Scientist, CCMB

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.
Contact Sales