CSIRO Prepares for the Era of “Mega-Biobank” Analytics on AWS


To unlock the causes of complex diseases, the global scientific community needs big data. In recent years, this reality has driven the development of massive databases and biobanks, with rapidly expanding cohorts and more data points than ever before. But understanding individual nuances, such as gene-to-gene interactions, at a population-scale increasingly overwhelms existing bioinformatics systems. 

For Denis Bauer, PhD,  group lead at Australia’s national science agency, the Commonwealth Scientific and Industrial Research Organisation (CSIRO), the solution is not simply more compute power—it’s taking advantage of innovative cloud-based technologies made possible on Amazon Web Services (AWS).  

“In the era of mega-biobanks, we need to rethink how we analyze the data. The vast amount of genomic data is just too big and too complex to be moved around. Instead, we can harness AWS to bring the analysis to the data, ensuring security without compromising scalability or speed," says Bauer, who is also an AWS Hero.

With this new paradigm in mind, CSIRO’s national digital health research program, the Australian e-Health Research Centre, developed VariantSpark on AWS to provide a scalable, secure platform to support today’s research and future growth. 

AWS Healthcare & Life Sciences Virtual Symposium 2021: CSIRO

When we talk about population-scale datasets, where genomic data sequencing might be accessible to every child, there's a physical limitation to handling that kind of database. You need to apply an economy of scale, and building on AWS helps us perform now and scale for the future."

Denis Bauer, PhD
Group Lead, CSIRO

Developing for the Future with Linear Growth

VariantSpark is a machine learning library specifically designed for the analysis of genomic data, supporting gene detection for polygenic diseases. In 2019, VariantSpark was released on the AWS Marketplace—an industry first—to provide widespread access to the genomics community. The availability of VariantSpark on AWS Marketplace enables researchers to take the analytical software to the data, not the other way around, resulting in a more secure data environment.  

“When we talk about population-scale datasets, where genomic data sequencing might be accessible to every child, there's a physical limitation to handling that kind of database. You need to apply an economy of scale, and building on AWS helps us perform now and scale for the future,” says Bauer. 

To process and analyze the vast amounts of data, VariantSpark uses Amazon EMR, a managed cluster platform that simplifies the running of big data frameworks, including Apache Spark. It also uses Amazon Elastic Compute Cloud (Amazon EC2) for secure, resizable compute capacity in the cloud. Importantly, VariantSpark scales linearly, not exponentially, as data volumes grow. With today’s datasets, its analysis times are 3.6 times faster than other big data solutions. As the field moves towards the analysis of one trillion datapoints, Bauer calculates VariantSpark will run in 15 hours what other technologies would need 100,000 years to compute. 

To protect the genomic information, Amazon EMR brings in encrypted data directly from Amazon Simple Storage Service (Amazon S3). Decryption occurs exclusively on the individual compute nodes using the AWS Key Management Service (KMS). 

“VariantSpark went through the full process of security and systems hardening for AWS in order to get to the marketplace,” explains Bauer. “That’s really valuable, as clients can be confident that the products in the AWS Marketplace adhere to international security standards for dealing with personal health data.”

Going Serverless on AWS

The need for scalable solutions becomes increasingly pressing for applications that require the data to be shared globally. Global Alliance for Genomics and Health (GA4GH) Beacon is one of the most widely adopted genomic data exchange protocols. According to Bauer, Beacon remains hugely valuable but relies on data being served from a centralized database, making costs increase dramatically with variant numbers and cohort sizes. 

To maintain access and interoperability at an increased scale, CSIRO developed Serverless Beacon or sBeacon. Built on AWS Lambda, it pulls on the breadth and depth of AWS serverless solutions to support the seamless scaling of compute resources. sBeacon also supports parallelization, another critical component when managing massive datasets. 

“For some applications, we do need to have those truly massive Amazon EC2 instances and Amazon EMR clusters where everything is about one entity and you need to parallelize in a traditional way. But if the aim is to deliver something cost-effective and in real-time, I think serverless is the solution, as a sort of cost-effective high-performance computing. You can apply a similar parallelization idea to the serverless system,” noted Bauer. 

Bauer added that serverless technology unlocks a range of other benefits associated with cloud computing, including the flexibility of having analytics components standardized. In addition, leveraging AWS serverless technology, sBeacon doesn’t require the data to be consolidated into one massive database. This unlocks the growth potential needed for large-scale datasets of the future, which may involve as many as 30 quadrillion data points. 

“sBeacon allows individual researchers to share information with the world, without having to hand over their full dataset. It also allows dynamic patient consent to be handled more seamlessly as data does not need to be replicated in a database,” explains Bauer. 

Both VariantSpark and sBeacon provide value in a range of applications, from genomic surveillance of SARS-CoV-2 to the search for emerging Anti-Microbial Resistance (AMR) markers. As more and more data become available, the utility of these frameworks will grow, and their cloud-based modularity will make them building blocks for third-party industry and research applications.

Learn More

See how AWS is working with other genomics organizations to drive discovery and accelerate innovations.


Founded in 1916, the Commonwealth Scientific and Industrial Research Organisation (CSIRO) is an Australian government agency that works to industrialize technological inventions, including new tools to advance genomic data access and analysis.  

Benefits of AWS

  • Leverages Amazon EMR for analysis times 3.6x faster than legacy big data solutions 
  • Unlocks growth potential needed for large-scale (up to 30 quadrillion data points) datasets
  • Democratized access to novel tool using AWS Marketplace 
  • Lowered cost of genomic data analysis using serverless technology

AWS Services Used

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

Learn more »

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

Learn more »

AWS Lambda

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. 

Learn more »

Amazon EMR

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

Learn more »

Get Started

Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.