The Center for Translational Data Science at the University of Chicago

The University of Chicago Manages Biomedical Data Resources at Scale on AWS

2021

At the University of Chicago, building and managing large-scale data resources to facilitate research in biology, medicine, healthcare, and the environment is a complex undertaking. However, using the cloud infrastructure of Amazon Web Services (AWS), the university’s Center for Translational Data Science can produce ongoing data innovation and support biomedical research for partners such as the National Institutes of Health.

Translational data science applies data science principles to solve scientific problems and benefit society. Robert L. Grossman is a professor of medicine and computer science and the director of the Center for Translational Data Science at the University of Chicago. “We are interested in examining and understanding the data, including how it can be used to improve health outcomes and how we can analyze it with machine learning and artificial intelligence to make discoveries,” says Grossman.

To improve data workspaces and power data analysis, the Center for Translational Data Science relies on AWS to help it store and share biomedical datasets reliably and securely at scale. Using AWS services for functions that include storage, compute, and data management, Grossman and the team deliver secure, durable data for biomedical and genomics research.

Center for Translational Data Science Researcher
kr_quotemark

The data services developed by the center are now called Data Commons Framework Services and are hosted on AWS. They make all the data findable, accessible, interoperable, and reusable in accordance with the FAIR data principles.”

Robert L. Grossman
Professor of Medicine and Computer Science and Director of the Center for Translational Data Science, University of Chicago

Building Data Commons at Scale

The Center for Translational Data Science at the University of Chicago builds cloud-based environments to manage, analyze, and share data, and it builds artificial intelligence / machine learning models to extract insights. The environments include data commons—for sharing and analyzing datasets—and data ecosystems, which include multiple data commons, data repositories, workspaces, and other computing environments that facilitate data access using a common set of software services. The framework supports the data commons operated by the Blood Profiling Atlas in Cancer (BloodPAC) Consortium, the BloodPAC Data Commons, which is used by more than 50 universities, companies, and government agencies from the consortium to study liquid biopsy data and its role in detecting and treating cancer.

The Center for Translational Data Science has been using AWS services to support its data environments since 2011. Between 2010 and 2014, the center developed a secure cloud-based environment for working with sensitive petabyte-scale biomedical data. Then, between 2014 and 2016, the center developed a second-generation environment with increased data size, number of users, and required functionality. In 2016, the center began developing its Gen3 software, which has now been used to build more than 15 new data commons. “As data size and the number of datasets grew, it became simpler to use cloud-based environments such as AWS to make the data available and to analyze it,” says Grossman.

Operating Data Environments Reliably on AWS

To set up and operate data commons reliably to innovate in the area of data analysis, the Center for Translational Data Science relies on Amazon Simple Storage Service (Amazon S3), an object storage service that offers industry-leading scalability, data availability, security, and performance. For power, the center uses Amazon Elastic Compute Cloud (Amazon EC2), which provides secure, resizable compute capacity in the cloud.

For database management, the center uses Amazon Relational Database Service (Amazon RDS), which makes it simple to set up, operate, and scale a relational database in the cloud. The center also relies on AWS for networking and utility services and uses Amazon Elastic Kubernetes Service (Amazon EKS), a managed container service to run and scale Kubernetes applications in the cloud or on premises.

To simplify working with petabytes of data in its commons, the Center for Translational Data Science developed data services so that data and the associated metadata could be managed with persistent digital identifiers that don’t change over time and don’t refer to the actual physical location of the data in the cloud. “The data services developed by the center are now called Data Commons Framework Services and are hosted on AWS,” says Grossman. “They make all the data findable, accessible, interoperable, and reusable in accordance with the FAIR data principles.”

Supporting Data Access and Analysis

In April 2020, the center set up the Pandemic Response Commons using support from AWS Professional Services, a global team that helps users realize desired business outcomes using the cloud. The Pandemic Response Commons is designed to help regions aggregate data related to the COVID-19 pandemic, and it provides flexible data workspaces that accommodate different pay models and security requirements. In particular, the Pandemic Response Commons can host clinical data required by COVID-19 researchers.

Using AWS infrastructure, Grossman’s team can scale compute and gradually transition from hardware to software infrastructure. “In the Gen3 environment, we have more than 10 PB of publicly accessible data,” says Grossman. “The unit of compute has gone from measuring server racks to thinking about the virtual data centers we get on AWS. So we have the elasticity that supports scale.”

By building the center’s data infrastructure on AWS, Grossman and the team can work to optimize the processing of large genomic workflows. “We run over 10,000 bioinformatics workflows per month,” Grossman says. “So we’re extremely grateful for the scalability and robust functionality of AWS.”

Facilitating Innovation in Research and Data Methods

Using the security, durability, and elasticity offered by AWS services, the Center for Translational Data Science at the University of Chicago aims to grow from more than 15 data commons to more than 100 in a few years. The goal is to make the commons simpler to build so that they can be delivered as a service, helping both the center and other researchers to focus on research and discovery.

Grossman’s team is also working to innovate data methods. “Along with building more commons and improving the workspaces, I’m also interested in improving the methodology with which we look at large-scale data,” Grossman says. “In addition to the completely data-driven techniques that we use in deep learning today, we hope to develop techniques that integrate existing knowledge about the phenomena we’re studying, including constraints imposed by the biology.”


About the University of Chicago

The University of Chicago is an urban research university founded in 1890 and located in Chicago, Illinois. Its community of scholars works to challenge conventional thinking and generate new insights for the benefit of present and future generations.

Benefits of AWS

  • Supports more than 10 PB of data
  • Runs over 10,000 bioinformatics workflows per month
  • Created the Pandemic Response Commons
  • Supports secure data environments
  • Supports data durability with persistent digital identifiers and other FAIR data services
  • Scales operation using virtual data centers
  • Facilitates innovation of data analysis and methods

AWS Services Used

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. 

Learn more »

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

Learn more »

Amazon RDS

Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud.

Learn more »

AWS Professional Services

Adopting the AWS Cloud can deliver sustainable business benefits. Complementing your team with specialized skills and experience can help achieving these results.

Learn more »


Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.