AWS Public Sector Blog

Bridging AI and biology: Inside the AWS and NVIDIA Open Data knowledge graph hackathon

AWS branded background with text "Bridging AI and biology: Inside the AWS and NVIDIA Open Data knowledge graph hackathon"

Knowledge graphs (KGs) and large language models (LLMs) are transforming biomedical research, but ensuring artificial intelligence (AI) outputs are trustworthy and evidence-based remains challenging. At the recent Amazon Web Services (AWS) and NVIDIA Open Data knowledge graph hackathon, teams of researchers tackled this challenge by developing innovative solutions that combine knowledge graphs with graph-based retrieval-augmented generation (GraphRAG). These solutions demonstrate how we can create more reliable AI systems that ground their outputs in verifiable evidence while maintaining the powerful capabilities of large language models.

The hackathon brought together seven teams of 50 researchers across two locations: the AWS Skills Center in Arlington, Virginia, US and the European Bioinformatics Institute in Cambridge, UK. Over three intensive days, these teams developed prototype systems that showcase different approaches to building and deploying evidence-grounded AI systems. These projects leveraged AWS services including Amazon Neptune for graph database management, Open Data on AWS for accessing public datasets, NVIDIA resources for PyTorch Geometric (PyG) RAG, and various compute and machine learning services to build end-to-end solutions. The following section provides brief descriptions of the seven projects along with the links to the respective GitHub repositories where the teams documented their work across the three days.

Figure 1. Hackathon participants in Arlington, VA on the last day of the event.

Project descriptions

GeNETwork is a reproducible multi-scale knowledge graph integrating cancer genomics and pharmacological data that are currently siloed across specialized databases, for precision oncology. Users can initiate queries starting with a specific variant, gene, pathway, disease, or drug, and traverse the graph to discover related entities and relationships. GeNETwork addresses the reproducibility crisis of KGs (only 0.4% of KGs provide adequate sufficient code and data for reproduction), by making all data files, loading scripts, and documentation publicly available, with transparent documentation of integration challenges to guide future development efforts.

ECoGraph is a knowledge graph framework that integrates genomic and proteomic data to identify biomarkers for colorectal adenocarcinoma. This integrated approach provides a systematic method for identifying and validating potential therapeutic targets while accounting for the demographic and molecular heterogeneity of colorectal cancer. Colorectal cancer is the third most common cancer by incidence and mortality, with recent evidence indicating that incidence has been rising among younger populations.

ClassiGraph integrates multi-omics and ontology-based data into a unified knowledge graph where nodes and edges capture biological relationships. From these relationships, users can train graph neural network models for tasks like cancer sub-type classification. The resulting embeddings and predictions can then be evaluated, visualized, and exported for downstream biological analysis.

EasyGiraffe is a simulator-based validation framework tailored for multicentric polygenic variant extraction. This simulator generates synthetic sequencing data embedded with known variants across multiple loci and samples. When processed through the variant calling pipeline, these outputs enable robust benchmarking by comparing detected variants against the ground truth.

The Model Integration and Data Assembly System (MIDAS) takes heterogeneous datasets and turns them into a modular KG which can then be connected to an external KG, allowing for interoperability of current resources but also creating scalable pathways for integrating future datasets. The system can be accessed through an agentic LLM with GraphRAG, allowing users to pose natural language questions (e.g., “Which therapies target Epidermal Growth Factor Receptor (EGFR) variants on chromosome 6?”), and receive evidence-grounded responses. This approach makes the KG directly usable in AI-driven discovery workflows, enabling clinicians, bioinformaticians, and data scientists to interactively explore complex biological relationships at scale.

The KG-LLM Garbage Collection Tool uses a combination of human review, grounded AI, and graph learning to identify and prune erroneous edges in biomedical knowledge graphs, improving their accuracy and trustworthiness in a more automated human-in-the-loop fashion. The prototype includes a conceptual front-end designed to bridge human and AI-assisted curation workflows, offering an intuitive environment where users can explore, validate, and refine graph-derived edges generated by the Graph Neural Network (GNN) models.

BioGraphRAG bridges biomedical data and publication KGs with GraphRAG, so that the natural language responses are generated solely from trusted knowledge sources. The system is designed to transform a free-text biomedical question into a grounded, citation-supported answer. Optionally, the output can include an evidence table summarizing the supporting snippets. This pipeline integrates graph-based retrieval, neural pruning, and language generation to provide accurate, evidence-linked biomedical question answering.

For more information on these projects please refer to the preprint published by the participants in BioHackrXiv: A Blueprint for Open Science: How Transatlantic Teams Built and Deployed Knowledge Graphs to Enable Biological (AI) Models.

Figure 2. Hackathon participants introducing themselves on day one. Fifty-three researchers from the US and the UK attended the event.

 

Figure 3. NVIDIA Global Omics Alliances Manager Dr. Ben Busby provided scientific direction for the projects. The teams worked on seven different projects related to knowledge graphs in biomedical data.

 

Figure 4. AWS Open Data solutions architect Cristian Chicas explaining how to access data from AWS Open Data.

 

Summary

The projects demonstrated significant advances in creating more reliable AI systems for biomedical research. The teams’ approach to data integration addresses a critical challenge in biomedical research: accessing and connecting information across siloed datasets. Teams successfully built knowledge graphs that bring together data from genomics, proteomics, and scientific literature—making it possible to discover relationships within these disparate datasets.

Finally, these solutions were built with scalability and transparency in mind. The cloud-native architecture enables deployment at scale, while the open-source nature of the projects lets other researchers build upon and improve these foundations. The combination of improved accuracy, enhanced transparency, and scalable architecture represents a significant step forward in making AI systems more trustworthy and useful for biomedical research.

Next Steps

Looking ahead, each team has identified prospective directions and development priorities to extend these prototypes into reproducible, scalable, and open-source solutions for the biomedical community. Proposed directions include incorporating new data sources and relationship types, exploring advanced GraphRAG capabilities, and investigating novel retrieval methods that could further improve the accuracy and relevance of AI-generated responses. Collectively, these efforts aim to advance these solutions toward robust, continually learning frameworks for transparent and trustworthy generative AI in biomedicine.

Hackathon participants by team:

  1. GeNETwork – Deanne Taylor, Taha Mohseni Ahooyi, Polina Rusina, Chantera Lazard, Vivien Ho, Christine Withers, Seeta Ramaraju Pericherla, Karthick Subramanian, Sangeeta Shukla, Benjamin Wingfield
  2. ECoGraph – Fateema Bazzi, Kara Quaid, Nishat Anjum Bristy, Raghu Arrola, Carmen Diaz Soria, Arshi Arora, Jack Murphy
  3. ClassiGraph – Anurag Limdi, Marcin Domagalski, Bijan Paul, Shakuntala Mitra, Irene Lopez Santiago, Camille Daniels, Andrew Scouten, Raghavendra Kini, Simone Weyand
  4. EasyGiraffe – Cheng-Han Chung, Yaphet Kebede, Radu Robotin, Shilpa Sundar, David Yuan
  5. MIDAS – Evan Morris, Toshiaki Katayama, Viren Amin, Maria Kim, Daniall Masood, Likhitha Surapaneni, Kathleen Carter
  6. KG Model Garbage Collection – Evan Molinelli, Van Truong, Anne Ketter, Allen Baron, Yibei Chen, Samarpan Mohanty
  7. BioGraphRAG – Ben Stear, Jean-Paul Courneya, Mina Peyton, Arun Bondali, Aymen Maqsood Mulbagal, Michelle Holko, Victor Felix

AWS and NVIDIA technical support:

AWS solutions architects: Taylor Teske, Archana Sharma, Emmanuela Derival, Jessie Johnson, Ashley Chen, Kayla Taylor, Cristian Chicas, Gargi Singh Chhatwal, Nate Haynes, and Gregg Grieff

NVIDIA: Rishi Puri, lead engineer for PyTorch Geometric (PyG) and lead researcher for PyG RAG (GNN+LLM)

Ben Busby

Ben Busby

Ben is a renowned leader in bioinformatics, computational biology, and interdisciplinary data science. With a career spanning academia, industry, and government, he is dedicated to making biomedical data science a more productive and collaborative environment for bioinformaticians and data scientists. He currently serves as global alliances manager, omics at NVIDIA, where he drives strategic collaborations and innovations in genomics, AI, and biomedical research. He is also an adjunct faculty member in the Computational Biology Department at Carnegie Mellon University, contributing to cutting-edge research and education.

Beryl Rabindran

Beryl Rabindran

Beryl is the life sciences lead for AWS Open Data. Beryl is a cell biologist by training and led clinical research for a medical technology AI startup in cancer imaging before joining AWS. She is passionate about working directly with researchers from around the world to grow the community of open life sciences data users.

Cristian Chicas

Cristian Chicas

Cristian is a solutions architect at AWS based in Arlington, VA, where he serves as the AWS technical lead on the Open Data team within the worldwide public sector. With a deep passion for machine learning and artificial intelligence, Cristian bridges the gap between innovation and implementation. He helps data providers onboard high-value, cloud-optimized datasets while empowering data consumers to unlock maximum value from open data. In his free time, Cristian enjoys spending time with his family and dog, pursuing his goal of visiting every US national park, and staying active through playing sports.

Emily Richardson

Emily Richardson

Emily is a genomics solutions architect at AWS based in the UK. She has a background in microbial genomics. Prior to working for AWS she worked for MicrobesNG, a spinout from the University of Birmingham, as the head of bioinformatics.