AWS Public Sector Blog

Computational Genomics on AWS Lambda with the Genome Institute of Singapore

Propelled by current global efforts, such as personalized medicine and cancer research, genomics has become an essential tool for modern medicine and biology.

Computational genomics deals with the interpretation of genomics data, which is growing at an exponential scale due to quickly evolving next-generation sequencing technologies. Input data size for single samples can easily reach 100s of Gigabytes and studies now require the analysis of thousands of such samples. A typical analysis requires running complex workflows with substantial disk-space, input/output, CPU and memory demands, usually executed in expensive high performance compute (HPC) environments.

By leveraging the inherent data-parallel features of computational genomics, AWS Lambda functions can be used to analyze large amount of genomics data efficiently.

At our upcoming AWS Public Sector Summit in Singapore, the Genome Institute of Singapore (GIS) will explore the use of pure serverless architectures for analysing big genomics data. Read what Andreas Wilm of the Genomic Institute of Singapore has to say about using AWS Lambda for computational genomics.

Serverless architectures, like AWS Lambda, offer an inexpensive execution environment (originally meant for running microservices), which is limited in terms of disk space, memory, processing power and runtime (maximum five minutes). As such, AWS Lambda doesn’t seem an intuitive fit for big data processing in genomics. However, by exploiting the data-parallel nature of genomics datasets, we can run analyses at scale and cost efficiently with AWS Lambda. Here, we showcase two applications:

  1. We established a proof of concept by porting a typically resource-hungry analysis step to AWS Lambda: variant calling on a human genome, a process that can take hours in multi-core environments. Our starting point was a preprocessed genome stored on Amazon Simple Storage Service (Amazon S3), containing roughly 100 billion data points. Using AWS Lambda functions only, we were able to complete the analysis in less than 15 minutes at costs of roughly 10 cents USD (assuming the user exhausted the 1 million free AWS Lambda requests, which are renewed every month).
  2. Furthermore, we implemented an entire analysis workflow for bacterial samples on AWS Lambda. The analysis is started by simply uploading the input files and subsequent steps are automatically triggered and run in a highly parallel fashion.

Learn more about how GIS uses AWS Lambda at the AWS Public Sector Summit in Singapore.