AWS Government, Education, & Nonprofits Blog

How the University of British Columbia uses the cloud to reduce sunflower genomic processing time and research costs with a data lake

sunflowers in a field

The botany department at the University of British Columbia (UBC) and the UBC Data Science Institute are working together to research the evolution and genetic makeup of sunflowers – a critical crop in addressing global food security.

UBC professor Dr. Loren Rieseberg says, “Sunflowers are challenging plants to work with, in part because their genome is large (3.6 billion base pairs), exceeding that of humans by 300 million base pairs, but also because the genomes of different sunflower lines differ in gene content and genome structure. A gene present in one line is frequently missing in another or found in a different place in the genome. Despite its challenging genome, sunflowers are important as a food security crop due to their hardiness and ability to survive extreme heat, which is expected to become more frequent with the climate crisis. We want to provide genomic resources, new seeds, and research that will help make sunflowers more environmentally resilient. We use genomic tools to find the particular alleles that can be helpful in making sunflower cultivars well-suited for particular regions of the world.”

Genomic testing requires large amounts of storage and processing power, which is why the UBC team turned to the cloud.

Moving from high performance computing (HPC) to the Amazon Web Services (AWS) Cloud

UBC migrated its sunflower genomics research pipeline from a 2048 core SGI mainframe to an Amazon Simple Storage Service (Amazon S3)-based data lake. Dr. Jean-Sébastien Légaré, a postdoctoral fellow at the UBC Data Science Institute, is working on building a framework for experimental reproducibility. Prior to migrating to AWS, Dr. Légaré says that the team faced functional challenges while conducting research. “We have analyses we need to run, and they require intricate orchestration of the jobs’ replication and distribution of the processes across many servers. We had significant pain points in getting our jobs to run reliably and timely.”

The team was faced with long run times, which stalled analysis and writing of research papers. “Around 12 percent of the jobs were timing out and failing for reasons out of our control. We would have to restart, and we couldn’t really recover from those. The last time we ran the pipeline, we had upgraded the software, and it took about 40 core years to run. We submitted jobs – sometimes 500 at a time – and it could be two weeks before anything even started to run. It was very time consuming. People waited for the analyses downstream and we were unable to provide reliable estimates on time and cost to run the pipeline,” says Dr. Légaré.

Using Amazon S3, AWS Batch, Amazon Elastic Container Registry, Amazon Elastic Container Service, Amazon FSx, AWS Lambda, as well as Amazon CloudWatch and Amazon EventBridge, for monitoring and usage reporting, Dr. Légaré says working toward building a framework for reproducibility of scientific experiments is now a simpler task.

“Before, working with files was difficult, but everything in AWS has a URL and an associated ID. It’s easier for me to manipulate the data. Not having to worry about where the files are and having it all in one data lake is helpful. The AWS Cloud has changed the way we can run these experiments. Everything appears to be logically in the same location, ready to access.” Dr. Légaré says there are now over 100 terabytes of data in the lab’s data lake.

With AWS, the team improved insight into their own research. “We can specify what data we want, and predict how long it will take to compute, based on similar datasets. A major benefit is that we can translate these resource requirements into a precise cost equation. We can adjust our parameters and redo our query to see how long a new job will take. Accuracy in prediction can only be achieved with a reliable compute platform – and one that can handle jobs of any size,” says Dr. Légaré. “Trying out new genomics tools and filtering parameters used to be a constant ordeal that would require weeks of turnaround time. Now we can spin up hundreds of jobs on-demand within minutes.”

With a reduction in time to science and cost, the UBC researchers will continue their inquiry into how sunflowers’ alleles will work in different cultivated backgrounds around the world.

Learn more about healthcare and life sciences on AWS and genomics in the cloud on AWS. Read more healthcare stories on the AWS Public Sector Blog.