AWS Public Sector Blog

Stanford researchers accelerate autism research by sharing genomic data in the cloud

genomic makeup data

In 2014, the Wall Lab at Stanford University sought to answer one of the most pressing questions in neuroscience: What genes influence autism spectrum disorder (ASD)? According to the Centers for Disease Control (CDC), this neurodevelopmental disorder affects roughly one in 54 children in America and is on the rise—nearly tripling since 1992.

Dennis Wall, professor of pediatrics, psychiatry, and biomedical data sciences at Stanford University, started working in ASD research nearly 17 years ago. His lab studies ASD from novel angles, including ASD epidemiology and gut microbiome associations; the lab has even ventured into therapeutic devices with glasses that give real-time emotional feedback to ASD patients. In the lab’s study of ASD genetics, they chose the cloud—and a unique experimental approach—to speed the time to science.

Keeping it in the family

In genetics, most scientists favor the “genome-wide association study” (GWAS) as their experimental design of choice. GWAS finds common genetic variants associated with a disease by looking for DNA sequences that are shared by unrelated individuals who have the disease. The catch? The more complex the disease—that is, the more genes involved, what effect the environment has, or whether the disease varies in presentation—the bigger the group of individuals required.

Researchers have conducted GWAS for ASD over the past 15 years, with ever-larger sample sizes. For example, a recent study used more than 46,000 ASD cases and controls. And while ASD is considered to be highly heritable, ASD genetics are far from simple. Two recent studies highlight 71 and 18 genes as extremely likely to drive autism risk, and the online resource SFARI lists more than 900 genes implicated over the years.

To tease out more of the genetics, the Wall Lab combined a time-honored approach with next-generation sequencing and high performance computing (HPC) resources. The researchers opted for an experimental design known as a linkage study. Linkage studies involve tracing diseases through a pedigree (a group of individuals with recorded biological relationships, usually spanning several generations) and seeing which pieces of DNA seem to follow the disease. The linkage study has a long and successful history in genetics—even before researchers completed the sequencing of the human genome in 2001, linkage studies identified genomic regions associated with corneal disease, vision loss, and schizophrenia.

For their linkage study, the Wall Lab worked with the Hartwell Autism Research and Technology Initiative (iHART) and the Autism Speaks Autism Genetic Resource Exchange to generate the iHart dataset. It consists of 1,010 families with at least two children diagnosed with ASD. That’s a total of 4,610 individuals, making this the largest multiplex family ASD dataset in the world with whole genome sequencing at an average depth of 30X. In genetics lingo, 30X means that any base in the three billion basepair human genome was sequenced an average of 30 times, and it is the current gold standard for genome sequencing. A sequencing project of this magnitude (on the order of several hundred terabytes of genomic data) was a strong candidate for cloud-based storage and processing.

AWS Promotional Credit and open data

With support from the AWS Cloud Credit for Research program, the Wall Lab funneled their raw genomic data into their analytic pipelines. Using resizable compute capacity in the cloud with Amazon Elastic Compute Cloud (Amazon EC2), the Wall Lab analyzed genetic variation across all 4,610 samples in parallel. By another Stanford lab’s estimation, a similar workload on an on-premises high performance computing cluster might have taken four times longer. The Wall Lab used a serverless analytics tool, Amazon Athena, to quickly query groups of genetic variants using standard SQL by sample, or family.

Jae-Yoon Jung, a postdoctoral associate in the Wall Lab, explains, “Once you have annotated genetic variation in your sample population, you then have to query subsets of the data. For example, ‘What are all the mutations uncharacterized, highly predicted to have bad effects in this family?’ Manual queries of this nature would normally take hours. With Amazon Athena, we can reduce query time to seconds.”

The outcome of this work? In a 2019 Cell article, the Wall Lab reported 69 genes associated with ASD, including 16 not previously identified. They integrated the work of other groups to demonstrate an interactive network of ASD-implicated genes, many of which contribute to neural cortical development.

Co-localizing data and tools

The Wall Lab shares their data upon request. Approved users receive access to the iHART dataset stored in Amazon Simple Storage Service (Amazon S3), as well as an Amazon Machine Image pre-loaded with the Wall Lab’s processing and analytic pipelines. Information and directions to access the iHart dataset can be explored on the Registry of Open Data on AWS.

Ultimately, the Wall Lab hopes their work will be the foundation for an ASD data lake. “Centralizing clinical and genomic data and putting it in close proximity to the compute workflows will help us solve the hard problems,” says Wall, “And the hard problems are the ones that matter.”

Learn more about how iHART is advancing our understanding of Autism Spectrum Disorder.

Access open data and share your open data with the Registry of Open Data on AWS.

Learn more about how AWS supports researchers with the AWS Cloud Credit for Research program.

Explore resources for genomicists and life scientists at AWS Genomics.