AWS Public Sector Blog

Building NHM London’s Planetary Knowledge Base with Amazon Neptune and the Registry of Open Data on AWS

photograph showing the front entrance of the Natural History Museum in London, England

The Natural History Museum in London.

Introduction

The twin crises of biodiversity loss and climate change are perhaps the single greatest challenge faced by science and society. The Earth is losing species at up to 1,000 times the natural rate, and by some estimates, 30–50 percent of all species may be lost by 2050. The world’s largest museums have worked together to look at how their data can be used to tackle these challenges.

The Natural History Museum in London (NHM) is a world-class visitor attraction and a leading science research center. NHM and Amazon Web Services (AWS) have worked together to transform and accelerate scientific research by bringing together a broad range of UK biodiversity and environmental data types in one place for the first time. The museum cares for a collection of more than 80 million natural science specimens that provide a historical record that can be used to understand the impacts of human activity and project future trends in biodiversity. In this post, the first in a two-part series, we provide an overview of the NHM-AWS project and the potential research benefits.

Applying technology to global specimen collections

Globally, there is an estimated total of 1.1 billion specimens in collections worldwide. Using data such as this to create a knowledge graph (KG) and applying graph neural networks (GNNs) on that graph will enable NHM to build a powerful platform. This platform would allow researchers across the world to identify new relationships, align new specimens, and detect anomalies across the breadth of natural science collections. Using these technologies across the available worldwide specimen collection will also allow researchers to deduce special ecological interactions and identify misclassified species in collections. Drs. Qianqian Gu, Ben Scott, and Vincent Smith from the NHM’s Life Sciences, Diversity and Informatics Division have prototyped a method for using KG and GNN. The researches documented their work in two articles for the Biodiversity Information Science and Standards (BISS) journal:

Although still in the prototyping phase, NHM plans to make this scientific resource, named the Planetary Knowledge Base (PKB), available globally in order to accelerate research on the effects of biodiversity and habitat loss, sustainable sources of minerals, human health, and the effects of climate change.

“Until now, museums like the UK’s Natural History Museum have had to manage their data in a series of data silos, curating and editing this data independently,” said Dr. Smith, head of the NHM’s Life Sciences, Diversity and Informatics Division. “The PKB breaks these silos, unlocking the enormous collective power of this data and changing the way the museum community works. With the PKB, we can begin to build real-time models of the past and present state of life on the planet, and from this information, draw inferences that predict the future of the natural world at almost any scale, under different environmental scenarios.”

In the past, the project had used datasets such as the NHM Botany Collector database comprising 105,780 entities and the NHM Indian Region Botanical Specimen Dataset comprising 110,043 entities with geographical information. For the AWS prototype, the team planned to use the Global Biodiversity Information Facility (GBIF) dataset. GBIF is an international network that integrates datasets documenting more than 2.5 billion species occurrences. The GBIF species occurrence dataset is currently hosted on the Registry of Open Data on AWS. In the past, the team at the NHM prototyped the platform using museum-provided laptops and machines, running scripts overnight in order to collect data from a variety of sources. The technical challenge, therefore, was how to scale the platform in order to build a graph capable of supporting an ever-growing dataset comprising billions of specimen occurrences.

Architectural diagram of the solution described in this blog post, including AWS services AWS Glue, Amazon S3, and Amazon Neptune Serverless. The architecture is described in more detail in the paragrpahs that follow.

Figure 1. The architecture overview of AWS services involved in the workflow described in this post. The main components are AWS Glue, Amazon S3, and Amazon Neptune Serverless.

The PKB uses AWS Glue to ingest and transform data stored in Amazon Simple Storage Service (Amazon S3). AWS Glue gives the NHM team the ability to prepare the GBIF dataset and any other datasets they require in a fast and simple manner. The team uses an AWS Glue PySpark notebook in order to filter specific data and generate load files that will hydrate an Amazon Neptune serverless database.

Neptune allows the team to build a graph database that can be queried by researchers using the Gremlin query language. The Neptune bulk loader is a feature used to load the data prepared using AWS Glue into the database.

Neptune provides a specific capability called Amazon Neptune ML that can be used to train and deploy GNNs to make easy, fast, and accurate predictions using graph data. Neptune ML provides an integrated solution for deploying machine learning (ML) models on your graph data. It simplifies the process of data preparation, model training, and hosting a secure inference endpoint into a set of workbench magic commands that can be end-to-end in a Neptune graph notebook. The easiest way to get started with Neptune ML is to use the AWS CloudFormation quick-start template, which sets up all the necessary components such as a Neptune DB cluster, required identity and access management (IAM) roles, and a Neptune graph-notebook for seamless usage.

Training a GNN using Neptune ML requires that training data is provided and formatted such that it can be utilized by the Deep Graph Library. After loading the data in the Neptune graph database, the data must be exported from the Neptune cluster to a customer-owned Amazon S3 bucket, chosen at export time. Getting data from a Neptune graph database formatted and into Amazon S3 for training can be accomplished with a single line of code by running an export job using Neptune workbench magics.

Once graph data is exported, Neptune ML makes it easy to train the ML model and deploy it to an Amazon SageMaker endpoint. This hosted endpoint allows controlled access for authorized users to run graph queries and get model predictions integrated with the graph results. All network access to the SageMaker endpoint is restricted based on your configurations. You can review the AWS documentation on configuring security in Amazon SageMaker to understand the identity and network security features available for securing your ML endpoint access. In the case of the PKB, the endpoint is used by researchers to perform queries against the data.

Enabling global biodiversity research

The PKB enables a plethora of research use cases and breaks down data silos to potentially allow for a transformative way of working within the scientific community that looks after the world’s natural collections. Some of these use cases include:

  • A discovery tool for biodiversity researchers – PKB will allow researchers to interrogate existing biodiversity data faster and at scale. Never before has a tool like this existed that spans the breadth of biodiversity data.
  • Redetermination, discrepancy, and conflict flagging – Users will be able to identify species that require reexamining or reidentifying due to taxonomic updates, inconsistencies, or anomalies. This could lead to incredible discoveries in our natural world.
  • Change (loss) quantification – Specimen occurrences can be tracked over time, and we will be able to compute conservation indicators. This problem has historically required multiple disparate systems and datasets to coalesce.
  • New species and attribute discovery – PKB will potentially enable the identification of new specimens that may be new species or discover unknown facts about known specimens.

“The PKB is a step towards museum scientists being able to cocurate data with colleagues worldwide and ultimately cocurate the models that manage the quality of this data. Biodiversity provides the support systems for all life on Earth. Yet the natural world is in peril, and we face biodiversity and climate emergencies,” said Dr. Smith. “Solutions to these problems can be found in the massive amount of data associated with natural science collections, from digitized collections, real-time biodiversity observations, knowledge from the scientific literature, and a wealth of molecular information. The PKB provides the toolkit to unlock this data.”

For more on this topic, check out the AWS re:Invent 2023 session on Building next-generation sustainability workloads with open data (SUS304). 

In this post, we outlined the project and highlighted the potential benefits of the research. Stay tuned for our follow-up post, where we discuss in detail how we built the AWS architecture to support the PKB.

Nishant Casey

Nishant Casey

Nishant is a principal solutions architect at Amazon Web Services (AWS). He primarily supports non-profit organizations in the UK and EMEA. Nishant works with partners and colleagues to provide technical guidance and promote architectural best practices to customers. A DJ and gigging musician in his past life, he is an avid home chef, baker, and fan of RTS video games.

Ilan Gleiser

Ilan Gleiser

Ilan is a principal global impact computing specialist at Amazon Web Services (AWS) focusing on circular economy, responsible artificial intelligence (AI), and ESG. He is an expert advisor of digital technologies for circular economy with the United Nations. Ilan’s background is in quantitative finance and he led AI enterprise solutions at Wells Fargo prior to AWS.

Karsten Schroer

Karsten Schroer

Dr. Schroer is a senior machine learning (ML) prototyping architect at Amazon Web Services (AWS), focused on helping customers leverage artificial intelligence (AI), ML, and generative AI technologies. With deep ML expertise, he collaborates with companies across industries to design and implement data- and AI-driven solutions that generate business value. Karsten holds a PhD in applied ML.

Sam Bydlon

Sam Bydlon

Dr. Bydlon is a specialist solutions architect with the Advanced Compute, Emerging Technologies team at Amazon Web Services (AWS). Sam received his Ph.D. in geophysics from Stanford University in 2018 and has 12 years experience developing and utilizing numerical simulations in research science and financial services. In his free time, he likes to watch birds, talk personal finance, and teach kids about science.