AWS Public Sector Blog
AWS helps Genomics England’s Multimodal programme accelerate research with whole slide images
This is a guest post from Genomics England.
Pathologists have been looking at morphological patterns in patients’ tissue sections highlighted by hematoxylin and eosin (H&E) staining for more than a century. However, as the pathology transformation from glass slides to digital imaging gains momentum, it opens the door to artificial intelligence (AI) tools to complement expert assessment with quantitative measurements to enable data-driven medicine.
Yet, challenges remain with handling digital imaging files such as storage and pre-processing prior to application of AI tools. Genomics England have utilised Amazon Web Services (AWS) and tools such as Amazon SageMaker to demonstrate how to prepare digital pathology images for research and the development of machine learning models.
Cancer diagnosis, pathology and slides
Cancer is a collection of diseases with complex causes and treatments, collectively affecting more than two million people in England. It is important to identify cancer early and to diagnose the correct subtype, so that patients can receive the most appropriate treatment and the best possible outcome.
For decades, histopathology has been the gold standard for cancer diagnosis; tissue is removed from the patient during a surgery, then assessed and sampled by a pathologist before preparation in the laboratory into a collection of glass slides (case) for review. The pathologist looks at the tissue on these slides at the microscopic level, extracting core information to diagnose the patient along with features that can guide treatment and provide prognostic information.
The visual analysis performed by a pathologist is supported by multiple types of adjunct testing, including special stains and immunohistochemistry. When applied to the slide, these can highlight protein expression in tumour tissue. Such patterns of expression can help guide diagnosis and treatment, for example oestrogen receptor expression within breast cancers is an indicator that hormone-based therapy may help. In the last decade, increasing numbers of molecular tests are part of the clinical pathways.
Modern technology
Two previously discrete diagnostic areas – genetics and histopathology – are now intrinsically linked and molecular testing is now the norm in many cancers. This has been a key enabler of precision medicine, and increasingly forms part of the World Health Organization’s (WHO) diagnostic classification systems.
Technological advances, matched with a reduction in cost, mean whole genome sequencing (WGS) datasets are more readily available than previously. Compared with other approaches, WGS enables more in-depth characterisation of tumour mutations, and is particularly suited to capturing structural variations, copy number variations, and mutations in non-coding regions.
Simultaneously, histopathologists are adapting to the increasing use of digital pathology (DP), in which traditional glass slides of tissue are converted to digital whole slide images (WSI). In addition to the many clinical workflow opportunities digital pathology enables, these complex images (much like high-throughput sequencing data) are well-suited to machine learning (ML). Already there are several applications for biomarker scoring. For example, oestrogen receptor, progesterone receptor, and programmed cell death ligand 1 are already approved by the U.S. Food and Drug Administration (FDA), and computer-aided diagnostic algorithms follow close behind.
The value of multimodal data
The value of a multimodal “pathogenomic” approach is intuitive, given the complementary information each dataset contributes. Whilst genomic and other molecular information reveals pathological change at a sub-cellular level, the tissue architecture captured in WSIs provides information about the phenotypic manifestations of these changes. It allows the tumour to be understood in the wider tissue context, including the cellular microenvironment, immune response, and invasiveness.
Already, these approaches have shown the ability to predict genomic features directly from the images (mutations, mutational signatures, copy number alterations, gene expression, and presence of oncogenic viruses) and to improve classification tasks (such as predicting patient survival) compared with unimodal approaches.
The inclusion of explainability and interpretability mechanisms into these models also offers potential to gain new insights about the drivers of cancer, and may help identify new treatment targets. Complex spatial patterns such as tumour heterogeneity and the immune microenvironment may be particularly attractive targets. Linking visually detectable features on routinely generated WSIs to insights from genomic analysis, offers a way to more readily translate these discoveries into the clinic through image analysis tools. This way we may avoid the slower, more expensive sequencing approaches.
Collecting data for the 100,000 Genomes Project Cancer Participants
Genomics England was initially founded to run the 100,000 Genomes Project, to demonstrate the value of WGS for diagnostics in rare disease and cancer patient cohorts. It has subsequently provided a reading research library using this data, to enable academic and industry researchers to investigate hypotheses and produce publications using this data.
The Multimodal programme in Genomics England aims to enrich the data with other data modalities – in particular WSI, with a focus on the programme’s 15,000-plus cancer participants.
The National Pathology Imaging Cooperative (NPIC) is based out of the Leeds Teaching Hospitals NHS Trust. Leeds has been at the forefront of digital pathology and AI research and innovation for more than 15 years, with St James University Hospital being one of the world’s first fully digital pathology labs.
Successes at Leeds are being used as a blueprint for deployment in other Trusts. NPIC is delivering a core programme across four themes: clinical deployment, research and AI, training and validation, and quality. As part of the research offering, NPIC has developed a unique multi-scanner facility called AI FORGE (Facilitating Opportunities for Robust Generalisable data Emulation) for the development of datasets for AI.
Genomics England works with the hospitals which initially participated in the 100,000 Genomes Project to find the pathology slides associated with the investigation used for the original genomic sequencing, and ship them to NPIC. NPIC scans these slides and takes a wide range of actions to ensure data quality and robustness for ML, for instance by scanning some of the slides across multiple scanners.
Once the slides are scanned at NPIC, they are transferred to Genomics England using their AWS deployment of IBM Aspera, which is a tool for robust transfer of large files. This is done along with manifests to enable identification of the relevant participants and provide additional metadata. These then land in Genomic England’s estate in an Amazon Simple Storage Service (Amazon S3) bucket. Genomics England uses several other AWS services to better understand its storage, including Amazon S3 Inventory for tracking files in their buckets.
We also use Amazon S3 Intelligent-Tiering to ensure that our data is stored in the most cost-effective way. Depending on the interests of research teams, some images may be accessed frequently and others much less so, but this can be hard to predict so AWS automation helps find the right image-level trade-off between access and storage costs.
Making data ready for researchers
As part of the Genomics England National Genomic Research Library, we must ensure that data we make available to researchers is de-identified, and that we have ongoing consent to use the participants’ data for research. AWS helps us process the data in a number of ways:
- We are able to use the Amazon SageMaker ML environment to collect and process both the manifests and image data to produce catalogues for researchers.
- AWS roles and permissions allow us to ensure that only specific internal staff have access to potentially identifiable data or images that have not been positively confirmed to be appropriate for research. We can then set up role-based systems which mean we can define what is ‘research ready,’ and make it available to a wide group of users using different systems to access that data.
- We use our SageMaker environment to test access to this data from a research perspective and enable our own research teams to test a similar level of access to that available to external researchers.
We have also been able to use AWS services to quickly conduct internal experiments that may eventually be useful for our researchers. This includes testing SageMaker Ground Truth for labelling text and slide images, helping us choose specific images to run specific analyses against. We also use Amazon Elastic Compute Cloud (Amazon EC2), Amazon WorkSpaces, and Amazon S3 File Gateway to enable the use of applications such as QuPath, which are normally run in very different environments.
Research applications and implications
With the slide being available, having been thoroughly checked for both quality and consent, researchers can use them to start exploring different aspects of cancer biology. Our internal team of engineers have been using SageMaker and the flexibility of the cloud to build ML models using the images. This allows us to explore how patterns in both a patient’s genome and the tumour structure affect important aspects such as survival.
We have also built several workflows that allow us to extract numerical embeddings – compressed representations that capture the meaning of the data – for the WSIs. These embeddings are much more pliable for ML tasks due to their small size compared to the much larger WSIs. We have also been working to fill in missing labels in our manifests, so that when researchers are able to access pathology slides, they have the best understanding of what those slides represent.