Skip to main content

AWS Pioneers Project

European innovation, told by those who built it

Institut Pasteur building the search engine for all of Earth’s life

During the global Covid-19 pandemic, scientists set out on what Rayan Chikhi, researcher and group leader from the Institut Pasteur, calls “a wildly ambitious project”.

They wanted to identify not just new species of coronavirus, but other RNA (ribonucleic acid) viruses in a huge piece of research that became known as the Serratus Project. Once the pandemic finished, it gave scientists at the Institut Pasteur another idea.

“We realised that using the same techniques, we could help not just the virology community but the entirety of biology. We could analyse all of life sequencing data, focusing not just on viruses but on micro-organisms, humans, insects, animals. There’s a large amount of genetic data but it’s not searchable. This aims to make it searchable,” explains Chikhi.

Backed by the European Union, the project was suitably named IndexthePlanet.

Meet Rayan Chikhi

Researcher, Institut Pasteur

Two stages

“To date, only 0.01% of existing viruses have been identified, and lurking in the unknowns may lie the culprit for a future pandemic.”

To map the DNA of all living organisms is a huge job – the volume of data involved is equivalent to everything uploaded to YouTube during its first decade, and five times greater than that processed for Serratus.

AWS had already helped Serratus, storing nearly 20 petabytes of data in a public and open repository of collected DNA samples called The Sequence Read Archive (SRA). Every time a scientist collects leaf litter from the Amazon rainforest or elephant seal dung from the Antarctic, it goes in the public domain through the SRA database.

There are two stages to IndexthePlanet – firstly to make the open DNA data already stored on SRA more readable and usable. And the second is to turn it into a giant DNA search engine, which is available today at https://logan-search.org. AWS provided “extensive assistance, technical, administrative, and financial support,” says Chikhi.

“This bold project was enabled by two things. One, the immense amount of data, millions of gigabytes which was recently moved to AWS. We then needed the resources to access this data, which became available via two large grants, one given by the European Research Council, the other by AWS.”

This project was among the early initiatives where AWS supported biological research at such a large scale, involving immense computational power.

“The amount of data that needed to be processed was 20 petabytes in size. Bear in mind that a single computer can hold a terabyte of data. If we had a single desktop computer to do this analysis, it would have required about 3,400 years to do it. But because we had 70,000 computers to use, totaling two million processors, we could do the analysis in 30 hours.”

Pioneering project

It was, says Chikhi, “uncharted territory”.

“This project ushers in a new era where genetic data, cloud technology, and later AI, are revolutionizing biology and global public health.”

“I am pretty confident that few large-scale computer initiatives of this magnitude have been carried out worldwide in biology.”

One of the metrics of success will be the discovery of new viruses, but the database is also a way to preserve the genetic heritage of Earth and make it accessible to all biology labs. The project has been immense in scale and ambition but none of it would have been possible without the partnership with AWS, whose vast amounts of computing power is, says Chikhi, “transforming biology”.

So far there are two datasets: a complete one of 2.2 petabytes and a more compact one of around 400 terabytes, which will serve as the basis for the future genomic search engine. Ultimately, IndexThePlanet could serve as the basis for a system dedicated to global monitoring of the emergence of pandemics. From the moment a strain is discovered in a hospital, it could be compared with all the genetic material on the planet, saving precious time in the search for treatments and vaccines and potentially saving tens of thousands of lives.

“Projects that process immense amounts of data can be done outside of traditional biology labs. This open data will also be an incredible resource for all AI researchers worldwide. In some ways it is the new age of biology,” says Chikhi.

Rayan Chikhi

A man sits on an armchair in a studio setting, with a microphone, lighting equipment, and a camera visible. There is a table with decorative items and a lamp next to him, indicating a professional interview or video recording environment.
A man wearing a mustard shirt and white pants stands with his hands in his pockets, smiling, against a yellow and purple gradient background.
A behind-the-scenes look at the filming of an interview set for AWS Pioneers featuring Rayan Chikhi. The setup includes professional cameras, lighting equipment, a director, and technical crew in a studio environment.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages