AWS for Industries

Executive Conversations: Completing the Human Reference Genome with Leaders of the Telomere-to-Telomere (T2T) Consortium

Karen Miga, PhD, associate director of the University of California Santa Cruz Genomics Institute, and Adam Phillippy, PhD, senior investigator at the National Human Genome Research Institute, join Ankit Malhotra, PhD, Genomics Lead, Worldwide Public Sector, Amazon Web Services (AWS), to discuss how their team leveraged technological advancements to assemble the first complete human reference genome. The Telomere-to-Telomere (T2T) consortium is a collaboration of scientists across universities and government research institutions that aims to sequence and assemble unresolved regions of the human genome with the goal of accelerating genomics research across the board. Time Magazine has recognized Dr. Karen Miga and Dr. Adam Phillipy as part of The 100 Most Influential People of 2022 for their work on the Telomere-to-Telomere (T2T) consortium.

This Executive Conversation is one in a series of discussions held with thought leaders in healthcare and life sciences, where we seek to learn more about the impact of technological innovation and cloud computing on their industries.

Ankit Malhotra: Welcome. This is something I’ve been looking forward to. It’s exciting to discuss genomics and the recently completed human genome assembly with both of you. To get us started, can you please give us an overview of your roles at your respective organizations and your interest in genome mapping?

Karen Miga: I’m an assistant professor in the Biomolecular Engineering Department and an associate director at the Genomics Institute. I did my master’s with Evan Eichler when the Human Genome Project came out. As a result, I am very invested in the public effort of thinking deeply about the parts of our genome that were missing. I did my PhD with Hunt Willard, one of the world’s experts in satellite DNA and centromere biology. So, from day one I wanted to visualize these gaps that span human centromere regions and understand the biology of satellite DNA. Now, that is a central theme of my research group.

Adam Phillippy: I’m a senior investigator at the National Human Genome Research Institute, part of the National Institutes of Health. I started my career thinking I would be a software engineer, so my undergrad degree is in computer science. Halfway through, I was exposed to genomics through a summer internship and realized it’s a great application of computer science. For the last 20 years, I’ve been doing genomics, developing methods for the alignment, assembly, and analysis of genomes.

For the first decade of my career, I was doing microbial genomics because in the early 2000s sequencing was expensive and it was easier to sequence and assemble bacterial genomes. Shortly after I joined the NIH, I met Karen through a collaboration on ultra-long Nanopore sequencing, and we realized its power to finish genome assemblies. Therefore, we teamed up to complete the human genome and create a complete reference genome assembly that can be used as a representative example of a person’s genetic code.

AM: The first human reference genome was published in 2001. Why are reference genomes important, and what was missing from the original published human genome?

AP: Reference genomes are powerful tools for several reasons. First, the annotation highlights the genes and their DNA patterns. Second, having a completed genome allows you to compare it to other genomes to understand their evolutionary relationship. Critically, nearly all genomics assays run today use a reference genome to accelerate mapping and alignment. It is a core foundation upon which genomics stands.

Many people didn’t realize that the original reference genome was not completely finished. Even many genomic researchers were unaware that their reference FASTA file has stretches of ‘N’s, representing missing regions. I don’t think they were intentionally omitted due to a lack of importance. But were omitted because until recently, it was difficult to sequence and assemble these regions because of limitations in sequencing and computational technology. I am happy that the Human Genome Project community understood the importance of “completing” the human genome reference and has continued to work on this important goal, and brought it to the forefront.

AM: Missing regions in the completed genome assembly are primarily large DNA repeats. Why are these repeats important, and what is their biological significance?

KM: To address that, I’d like to describe the broader structure of the genome first, for context. The genome is made up of both genes, that code for RNA, and large sections of non-RNA coding intergenic DNA. Historically these intergenic regions were underappreciated and referred to as ‘junk DNA’. These regions are highly repetitive making them difficult to sequence and assemble. In particular, there are large regions of our genome—often spanning millions of bases—that are enriched with repeats and are extremely gene-poor yet are known to span regions fundamental to life. For example, every time our cells divide, our genome must replicate and be partitioned equally. Any error in that process can lead to a genetic imbalance between cells, which can lead to cell death and cancers. Chromosome structures known as centromeres span regions that are extremely repeat-rich and represent some of the largest, most persistent gaps in our reference genome. Needless to say, fully understanding the genome requires us to understand these large repeating regions.

I think now that we have a more holistic view of what a genome looks like, we’ll be able to do analysis to add to our knowledge of chromosomes, and open new lines of research. We can explore what they do at the ground level in regulation, spatial organization, replication, and DNA repair. There are almost more questions than answers.

AM: Are there examples of the biological impact of these repetitive genomic regions?

KM: An example is the Fragile X Syndrome, which is a rare genetic disorder that can lead to developmental delays, learning disabilities, and cognitive impairment. Fragile X is caused by an increased number of tandem repeats inside of the FMR1 gene. However, it’s important to keep in mind that these tandem repeated regions are incredibly dynamic parts of our genome. The regions expand and contract, and we in the scientific community don’t fully understand why. We don’t yet know how all genetic information translates to disease, but we are hoping that our work will motivate new studies. I’m hoping we’ll have more examples to answer your question thoroughly in the future.

We are now celebrating having access to base-level maps of many of these regions. Hopefully, we will have many detailed disease association studies emerge in the near future. However, we know that many of the gene families being discovered may be involved in understanding brain development, crediting some of the early work characterizing segmental duplications directly adjacent to centromere regions on chromosome 1. We also expect that centromeres—which are the sites responsible for proper segregation of chromosomes every time a cell divides—could help us understand how chromosomes are gained and lost in human disease.

AP: People understand the concept of paralogs, which are multiple copies of the same gene, some of which are perfectly identical. However, for the last 20 years, the word ‘repeat’ has been associated with non-functionality. The genomics community needs to push back and reject this notion. I think this myth exists because we haven’t yet had the sequence for the repeated regions and were therefore unable to study their function. However, studies have now demonstrated these regions are functional, and that’s what we are most excited about. Now that we have a complete reference genome, we can run even more functional studies to understand the biological role of these repeats.

AM: Why now? What technological and computational advances made this possible?

AP: We are standing on the shoulders of giants—the technology developers. They have continued to push sequencing technology to be longer and more accurate, eventually reaching a level that allowed us to disambiguate the complex repetitive regions and assemble a complete genome. This goes hand-in-hand with computational advances, which sometimes get overlooked. Long read sequencing is amazing biochemistry, but it requires amazing computer science to do the base calling and downstream analysis. The computations rely on recurrent and convolutional neural networks to read off an electrical signal that we wouldn’t have been able to collect and compute at scale 20 years ago. That is why, as a computer scientist, I love the field so much. It’s the best of biology and the best of computer science enabling space-age advances.

KM: The T2T team has a tremendous amount of computational biology talent. Not only did they develop new techniques to put the genome together in repetitive areas, but they also had to demonstrate the assembly is accurate. That took the development of many new tools including repeat aware mapping strategies, largely from Adam’s group. There was significant innovation required for us to be confident in the reference genome assembly.

AM: How are scientists like you using the computational infrastructure provided by the cloud to enable genomic advances?

AP: The part I’ve found most useful about the cloud is the collaborative aspect, especially with the giant datasets we share across our broad consortium. It’s been transformative in the way we can all look at the data and manipulate it in ways that make sense to us. For example, I can assemble it, someone else can annotate it, another can validate it, and the data stays in a centrally accessible location readily available for scientific collaboration. The cloud allows us to work in a way we couldn’t have previously. The storage and shareability of the data have been a huge benefit.

Scalability is also essential. These assemblies aren’t cheap to compute. Base-calling of the types of sequencing we had to do takes thousands of CPU hours. When we scale to thousands of genomes, suddenly we’re talking about having to wait a thousand years! Instead, by using the cloud, we can scale up to many instances so we can assemble in an accelerated and reasonable timeframe.

KM: I’d like to add that the collaborative nature of the cloud has allowed us to develop innovative tools as well. This will become increasingly important as we move from development into production mode. Our goal is to build workflows that other researchers can use, and the scalability and on-demand nature of cloud platforms like AWS are essential for these goals.

AM: Could you expand on how Amazon’s Open Data Program allowed for sharing and collaboration and impacted data security and privacy?

AP: On the data side, the Vertebrate Genomes Project, T2T, and the Human Pangenome Reference Consortium are all AWS open datasets. In these research settings, we are happy to release our data freely and openly to build the consortium. Since these projects involve either non-human or human samples consented for open data release, we had no concerns regarding data privacy.

KM: There are valid concerns around privacy, especially as clinicians use genomic data, because data misuse may lead to infringement of privacy for individuals and their blood relatives. As we grow the consortium, we are actively exploring ways to ensure data security. We need an ecosystem of security for protected datasets, and a cloud-based federated data-sharing platform. This may create connections for new collaborations in which communities that offer compatible reference genomes that are not yet as open access as those included in the official Human Pangenome Reference Resource can still benefit from data sharing and variant calling.

AM: What does a completed reference genome in 2022 mean for the future of genomics? What are some of these upcoming research priorities?

KM: Adam and I are both co-leaders of the Assembly Working Group for the Human Pangenome Reference Consortium. What we have in T2T is one human haplotype that’s largely of European ancestry. That is completely insufficient to represent our species, and it contains genetic bias. The goal of the initial Human Genome Project was to understand our collective genome, so we need to collect more of these haplotypes. This way we can trace human histories and create an equitable genomic resource.

The question becomes, how do you do this? There are options: you could assemble 10,000 genomes of lower resolution, or assemble hundreds perfectly. Our team opted for the latter to gain a more complete picture of genetic diversity across different populations. Inclusivity has always been a thematic goal for the human genome reference—to reach a complete, comprehensive view of our collected genomes. To ensure benefits to people around the world, we need to engage participation in this global genomic resource.

AP: The critical keyword that Karen mentioned is ‘bias.’ All these projects aim to reduce bias in the analysis. For example, the T2T project aims to correct the positional bias against the gaps in the genome. But now we have the problem of variants across different populations not being fully represented in a single reference genome. For example, if you’re doing a disease association study you will be missing variants within sequences that aren’t included in the reference genome. So now we are collecting variants across the population to further reduce bias.

AM: Capturing genetic diversity in reference genomes is incredibly important. As part of the Human Pangenome Research Project, you are unraveling 350 genomes. How are you selecting samples to capture human genetic diversity?

KM: Our team wants to have diverse representation at a societal and geographical level. 350 was a target number issued with the call to action from the NIH based on theoretical estimates from single nucleotide polymorphisms (SNPs). While we want to eliminate all bias, we have to start by collecting common variations, including larger variations and those within highly repetitive regions.

It’s important to note that we are working with cell lines to ensure sufficient sample material can be collected. With that in mind, we try to use early passage cells to avoid spurious variant events. Early in our reference production efforts, we paired samples of parents and their adult children. Trios are quite useful for putting genomes together, so we have prioritized these for sequencing and assembly.

Infrastructure from the 1,000 Genomes Project allowed us to start adding value to this amazing resource on day one. We have sequence data from these cell lines and can study their variation relative to different reference genomes. We can project their variation using PCA plots to see how genomes offer different levels of dimensionality of variation and analyze this across a geographical map to ensure we’re representing variation around the world.

AM: How do you see the field advancing, both from a technology and computing perspective?

AP: Ultimately, we wanted this consortium to push the technology to make perfectly accurate genomes routine. This first genome is a proof of concept that we have the methods to do it, and in the next 5 to 10 years we want to make it clinically routine. To do that, the technology needs to be reliable, accurate, and as cheap as possible.

AM: How do you think our new understanding of genomic biology advance healthcare, especially at the point of care?

KM: People want to see their genetic information lead to more precise treatments or diagnostics by informing targeted therapies. As Adam mentioned, sequencing technology and computational science go hand-in-hand, and we hope genomics and therapeutic discovery can go hand-in-hand as well. Once we build momentum, I think it will progress quickly. There are a lot of reasons to be hopeful that the next 10 years will bring more information about how sites in our genomes inform our health.

AP: I predict a transformation in how people think about diagnostics. Traditional diagnostics look for one specific disease at a time. Genomic sequencing, on the other hand, is an all-in-one global diagnostic: you find genetic information about every disease and disorder in one assay. I think this will be transformational in the clinic.

AM: That would be the Holy Grail of genomic applications in the clinic. How has this experience been for you as a researcher?

KM: I am thrilled. I can’t imagine doing this work in a world where we don’t have multi-center, multi-production workflows. This is big science, and it works because we can capitalize on resources we’ve acquired through AWS, for storage, compute, and data sharing. It’s a joy to have this type of organization, and it’s made it easy for various sequencing centers to upload and assess the quality of data. I can’t speak enough about how central to our success these resources are. I am very optimistic about data equity as well. Many places don’t have the same computational power, but they can access the infrastructure through the cloud. This levels the playing field for genomics researchers around the world.

AP: It’s allowed us to move so much faster. The infrastructure on NCBI, for example, is capable of ingesting this amount of data, but it takes time and would have delayed things. Now, we can upload data straight from the sequencer into the cloud and our collaborators can start assembling it the following day. It opens the data pipe as wide as it goes to have a firehose of data. The collaborative tools from AWS have brought our team together in a way that wasn’t possible earlier in my career.

AM: Absolutely. Genomic diversity and equitable access fall into health equality, which is a key priority for AWS.

I appreciate you taking the time to discuss your breakthrough in assembling the first complete human genome, and how AWS was able to power this innovation. It is an exciting time in genomics, and we look forward to seeing what discoveries lie in these previously unexplored regions.

See how AWS is supporting other life science researchers in their quest to expand biological understanding and improve human health.

Karen Miga

Dr. Karen Miga is an Assistant Professor in the Biomolecular Engineering Department UCSC, and an Associate Director of the UCSC Genomics Institute. In 2019, she co-founded the Telomere-to-Telomere (T2T) Consortium, an open, community-based effort to generate the first complete assembly of a human genome. Additionally, Dr. Miga is the Director of the Reference Production Center for the Human Pangenome Reference Consortium (HPRC). Central to Dr. Miga’s research program is the emphasis on satellite DNA biology and the use of long-read and new genome technologies to construct high-quality genetics and epigenetic maps of human peri/centromeric regions.

Adam Philipy

Dr. Adam Phillippy is head of the Genome Informatics Section and a senior investigator in the Computational and Statistical Genomics Branch at NHGRI. He is a bioinformatician who bridges the fields of computer science and genomics, and his lab has developed numerous widely used tools for the problems of genome assembly, alignment, clustering, forensics and metagenomics.

Ankit Malhotra

Ankit Malhotra

Ankit Malhotra is the worldwide genomics lead on the Amazon Web Services (AWS) Public Sector healthcare team. At AWS, Ankit helps healthcare and biomedical research customers in the public sector integrate genomics into their workloads, helping them accelerate and innovate using the AWS Cloud. With cross training in computer science, molecular biology, and genetics, he has over 10 years of experience as a NIH-funded computational genomic scientist.