Customer Stories / Life Sciences / France

2024

The Institut Pasteur and AWS are analysing the world's DNA, using a public database

Institut Pasteur, a leading French virology research center, processed 20 petabytes of DNA data in record 30 hours, leveraging AWS Batch over a cluster of 2.18M AWS Graviton cores.

30 hours

Reduce the computing time required of 30 million vCPU hours to 30 hours, with 2.18 million vCPUs mobilised at peak

20-petabyte

First exhaustive use of a 20-petabyte DNA database

AWS expertise

Provision of AWS technical resources and support expertise

Overview

To date, less than 0.01% of existing viruses have been identified. And among these countless as yet unknown species may lie the culprit of a future pandemic. Following the Covid-19 crisis, and to more easily identify future threats, a research project called "IndexThePlanet" at the Institut Pasteur set about analysing and mapping the DNA of the entire living world, using a public database. However, to process such a large volume of data, the project had to set up an appropriate infrastructure to meet the challenge of processing some 20 petabytes of data. This is the purpose of the partnership with Amazon Web Services (AWS), which has provided the researchers with a cluster of more than 2 million vCPUs to carry out this massive task. 

Hardware electronic circuit board. technology style concept semiconductor motherboard computer server cpu

Opportunity

To date, only 0.01% of existing viruses have been identified, and their exact number is still unknown. Among these countless as yet unknown species may lie the culprit behind a future pandemic. Following the Covid-19 crisis, a research team at the Institut Pasteur set about analysing and mapping the DNA of all living organisms to help identify future threats.

To process such a large volume of data, the team set up an appropriate infrastructure to meet the challenge of processing 20 petabytes of DNA data. To put this in perspective, this is roughly equivalent to all the data YouTube hosted during its first decade. This is the purpose of the partnership with Amazon Web Services (AWS), which has provided the researchers with a cluster of 2.18 million vCPUs to carry out this massive task.

"The IndexThePlanet project is actually the sequel to an initial research project carried out jointly with an international team, the Serratus project, which led to the identification of new species of coronavirus and other RNA viruses," points out Rayan Chikhi, a bio-computing researcher at the Institut Pasteur. It has enabled us to map ten times as many species as before, with a total of around 3 petabytes of data analysed. "Encouraged by this initial success, we decided to take things a step further by broadening the spectrum to include all viruses present on earth, i.e., by analysing the DNA of all known living organisms. This naturally represents a considerable challenge in terms of computing power, since this time we had to process a volume of data more than six times greater than that of the Serratus project."

kr_quotemark

AWS has mobilised considerable resources, which have reached 2.18 million vCPUs at peak for Graviton instances."

"We reckon it would have taken a desktop computer nearly 30 million hours, or 3,400 years, to carry out such a calculation."

Solution

Developing a DNA Search Engine

For this research, the teams at the Institut Pasteur had access to a global database, stored and accessible to the scientific community by AWS and its Registry of Open Data Programme. This database contains sequencing data for all living species on Earth. However interesting this data may be scientifically, it is still unstructured, making it extremely tedious to explore. The IndexThePlanet project is therefore based on two specific stages: first of all, the "global analysis" of this database in order to make it readable and usable, and secondly, the provision of a search engine that can quickly and efficiently navigate the index that has been created. This search engine should be operational by 2026.

"To really understand what is at stake in our work, we need to think of this database as a sort of gigantic library, but one in which all the pages of all the books have been scattered. The challenge for IndexThePlanet is to restore coherence to this data by methodically classifying all the DNA fragments in order to reconstruct them on the scale of a living being, but also taking account of its environment. This is a major undertaking, which should ultimately benefit the entire biological research community," adds the researcher.

2.18 Million vCPUs Mobilised

The Institut Pasteur consequently turned to AWS to set up an appropriate infrastructure to meet the challenge of this massive processing. "Preparing the operations took almost a year, ultimately resulting in a calculation batch lasting just 30 hours," smiles Rayan Chikhi. But what a batch! During processing, AWS mobilised considerable resources, which reached 2.18 million vCPUs at peak for the AWS Graviton instances. As a comparison, we reckon that it would have taken a desktop computer nearly 30 million hours, or 3,400 years, to carry out such a calculation.”

AWS Technical Support

"To provide the best support for the Institut Pasteur's teams, we called on all the resources available to us," explains Dorian Schaal from Amazon Web Services, who supported the researcher throughout the project. This included scheduling the calculations over the weekend to access resources that were less in demand, as the massive size of the resources took up a significant proportion of the available resources. He continued: "The success of this project is something our teams are very proud of, and will help enhance the Open Data database that AWS is making available free of charge to the global scientific community."

Facilitating Tomorrow's Processing

The IndexThePlanet project has resulted in the creation of two datasets: a complete one of 2.2 petabytes and a more compact one of around 400 terabytes, which will serve as the basis for the future genomic search engine. It will provide accurate information on all the viruses and bacteria in the global database. A success that Rayan Chikhi nevertheless tempers: "this database is still highly incomplete in terms of terrestrial variety and, despite its success, this research project will only make it possible to increase the number of known viruses from 0.01 to 0.1%. But the progress remains considerable in terms of current knowledge". Ultimately, IndexThePlanet could serve as the basis for a system dedicated to global monitoring of the emergence of pandemics. From the moment a strain is discovered in a hospital, it could be compared with all the genetic material on the planet, saving precious time in the search for treatments and vaccines and potentially saving tens of thousands of lives.

Architecture Diagram

Outcome | Nunc tincidunt laoreet nunc sed mattis 

Nulla vitae sapien at libero elementum porttitor eget dapibus lorem. Sed dapibus ultrices sem. Donec dolor dui, pharetra id diam sed, euismod viverra est. Nunc mattis vitae enim at accumsan. Morbi viverra, neque non porttitor dapibus, velit turpis semper felis, sit amet ullamcorper sem risus in magna. Sed iaculis mauris vestibulum, suscipit tortor nec, pulvinar dolor. Fusce ut felis purus. Proin id porttitor mauris. Vestibulum congue, odio vitae laoreet tincidunt, nisi mauris sagittis metus, eu placerat neque eros in nulla. Nunc quis dictum velit. Suspendisse tempus eros turpis. Maecenas efficitur neque ac ex porta, eu dignissim leo consequat. Sed quis pretium nibh.

About Institut Pasteur

Founded by Louis Pasteur in 1887, the Institut Pasteur is a world-renowned French biomedical research centre conducting cutting-edge scientific research on infectious diseases and public health.

AWS Services Used

Amazon EC2 Spot Instance

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices.

Learn more »

AWS Graviton

AWS Graviton is a family of processors designed to deliver the best price performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2). Choose the AWS Graviton-based instance that best meets your needs.

Learn more »

AWS Batch

AWS Batch is a fully managed batch computing service that plans, schedules, and runs your containerized batch ML, simulation, and analytics workloads across the full range of AWS compute offerings, such as Amazon ECS, Amazon EKS, AWS Fargate, and Spot or On-Demand Instances.

Learn more »

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance.

Learn more »

AWS Customer Success Stories

Organizations of all sizes use AWS to increase agility, lower costs, and accelerate innovation in the cloud.

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.