DNAnexus & Amazon Web Services (AWS) Power Technology Behind UK Biobank Research Analysis Platform

Executive Summary

Researchers from around the world needed to be able to securely access the UK Biobank, a petabyte-sized biomedical database and research resource. AWS partner DNAnexus leveraged Amazon S3 and Amazon EC2 to build and operate a scalable platform that enables approved users to securely view and analyze “soft copies” of the filesin a virtual environment. This ensured the security of health data and democratized access to researchers who lack their own storage andanalysis infrastructure.

Understanding different factors

To understand and treat complex illnesses such as type 2 diabetes, cancer and Alzheimer’s disease, scientists need to understand the relationship between genetic, environmental, and lifestyle factors over time. Longitudinal data of this nature is extremely difficult to amass, which is why the global scientific community stands to benefit greatly from a collaborative, large-scale biomedical dataset and research resource known as UK Biobank.

According to a 2019 study on dementia, with data from 196,383 UK Biobank participants, following a healthy lifestyle may reduce the risk of dementia, regardless of the genetic risk. The results showed that interventions could offset the genetic risk for dementia. A 2018 study on 472,000 UK Biobank participants between the ages of 40 and 69 concluded that smoking, diabetes, and high blood pressure increase the risk of heart attack more in women than in men. In women, high blood pressure was associated with 80 percent higher risk than in men overall. Among type I diabetes patients, women’s risk of heart attack was almost three times higher than men’s, while in type 2 diabetes patients, women’s risk was 47 percent higher.

Between 2006 and 2010, UK Biobank recruited 500,000 volunteers from across the United Kingdom. Each provided detailed information about their lifestyle and physical measures, including blood, urine, and saliva samples to be stored for future analysis. UK Biobank set up ongoing data collection, coupled with the integration of electronic health records, that has generated tens of thousands of data points for each participant. Full genotyping data was added in 2017, and whole-genome sequencing data from all 500,000 participants will be made publicly available in early 2023 (the sequencing component was recently completed). UK Biobank anticipates its database will exceed 40 petabytes of data by 2025.

The collective aim of this wide-scale data collection is to help approved researchers from around the world better understand, prevent, and treat a wide range of diseases. But a dataset of this size and complexity creates an unprecedented data management challenge. That’s where DNAnexus comes in. A long-term AWS Life Sciences Competency Partner, DNAnexus was founded in 2009 with a mission to help scientific researchers securely access, analyze, and operationalize complex biomedical data. Its scalable platform fosters collaboration and enables users to analyze multiple data types together, including genomic and clinical data. This is a crucial feature for researchers working to decipher complex diseases.

“The key challenge was bringing the data together in a single place so that researchers could analyze millions of metrics across the breadth of data types including genetics, lifestyle, and imaging, all without data replication,” said Asha Collins, general manager of
Biobanks at DNAnexus.“Just as importantly, we had to address how we could provide the necessary compute and data storage to enable researchers to really work with this massive dataset with ease.”

In 2020, DNAnexus and AWS began a three-year collaboration with UK Biobank to democratize access to the data. Together, they replaced costly and time-intensive data downloads with an innovative cloud-based Research Analysis Platform (RAP) that enables
researchers to access and analyze the entire UK Biobank database securely from anywhere in the world. Along with the initial development, UK Biobank understood that success hinged on the platform’s ability to manage increasing quantities of data and provide analysis tools in a centralized environment.

"The key challenge was bringing the data together in a single place, so that researchers could analyze millions of metrics across the breadth of different data types including genetics, lifestyle, and imaging, all without data replication. Just as importantly, we had to address how we could provide the necessary compute and data storage to enable researchers to work with this massive dataset with ease.” 

-Asha Collins, General Manager of Biobanks at DNAnexus

Sharing “soft copies”

Researchers initially accessed UK Biobank files via custom data delivery systems, which packaged the early tabular data for researchers to download and analyze in their own environments. But as more data became available and a wider pool of researchers requested access, the individual approach became untenable. By late 2021, more than 28,000 academic and industry scientists from more than 90 countries had been approved to access the UK Biobank database and research resource.

“We’re now getting to this scale where it’s just not efficient or cost-effective for all of these groups to maintain multiple copies of data all around the world,” said Mark Effingham, deputy CEO at UK Biobank. “We needed to take a different approach, where we could bring our approved researchers to an environment where they can use the data.”

DNAnexus created a secure alternative that reduced the infrastructure and cost burden placed on UK Biobank’s users. A single version of the data is stored using Amazon Simple Storage Service (Amazon S3), a scalable cloud-based infrastructure that can support and keep pace with UK Biobank’s continued growth.

The platform intelligently provides the data to the researchers, minimizing data duplication. Researchers do not have direct access to these files. Instead, they operate through a virtual environment that provides “soft copies” of the data subsets that they’re approved to access.

The collaboration also leverages Amazon Elastic Compute Cloud (Amazon EC2), a service that provides secure, resizable compute capacity in the cloud. Using Amazon EC2, DNAnexus delivers a flexible, scalable platform where researchers are only charged when they run analyses. The platform can also leverage Amazon EC2 Spot Instances, which are available at up to a 90 percent discount compared to On-Demand pricing, so even the very largest jobs can be run economically.

“Working with DNAnexus and AWS on this platform creates an area where researchers can not only engage and run their own data analyses, but they can also cost-effectively use scalable cloud infrastructure, compute, and storage to actually support those analyses wherever they’re working from,” said Effingham. “We are proud to provide a research platform that maximizes the value of the data and democratizes access for all researchers around the world.”

Secure access through pseudonymization

Sharing insights into half a million participants with linked health records is challenging from a data privacy perspective. To protect this data—while preserving the value of the many interconnected biomedical data points—DNAnexus developed a system of pseudonymization.

“It enables us to keep one copy of the data behind the scenes, which realizes significant cost savings,” explains Collins. “That data is appropriately pseudonymized and ‘soft copied’ into a virtual area where they see exactly the files and tabular fields that they've been approved for, with appropriate changes in the file names.”

UK Biobank relies on enhanced security measures, requiring every researcher to get a slightly different copy of the data. Participant IDs are pseudonymized for each and every researcher. Those IDs are embedded in both the file name and the content itself, enabling DNAnexus to build out its pseudonymization support. Leveraging the "soft copies" described above, as well as some secure download mechanisms, the platform met these challenging requirements for thousands of researchers without duplicating any of the data.

DNAnexus developed this functionality to address the increasing need for platforms that can mediate secure access to multi-omics population datasets, which continue to grow.

The UK Biobank database has already proven to be a powerful resource for the global research community, powering new scientific discoveries that could improve public health. The Research Analysis Platform has the potential to increase the speed and scale of scientific discoveries and democratize access, enabling approved researchers to bring their own analyses to the data from anywhere in the world to advance understanding of human disease. In addition, the RAP resolves the complexity associated with integrating and harmonizing genomics and clinical data. It also facilitates greater collaboration between researchers by enabling users to analyze multiple data types and work on the same research project within the cloud-based platform. This success will likely fuel further growth, reinforcing UK Biobank’s choice to collaborate with partners like DNAnexus and AWS that are known for their scalable, agile solutions.

Biobank

About the Customer

UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. The database is regularly augmented with additional data and is globally accessible to approved researchers undertaking vital research into the most common and life-threatening diseases. It is a major contributor to the advancement of modern medicine and treatment and has enabled several scientific discoveries that improve human health.

About DNAnexus

DNAnexus has established a secure, trusted cloud platform for accessing, analyzing, and translating the world’s biomedical data—powering a scientific community that generates life-changing breakthroughs in healthcare and life sciences.

Published May 2022