AWS Public Sector Blog
Leveraging generative AI to accelerate public health genomics data standardization
In the fields of precision medicine and pathogen surveillance, genomic sequencing has emerged as a critical component of public health research and response. Yet the advancement and application of genomics faces challenges in data interoperability and standardization. Public health laboratories across the United States process tens of thousands of genomic samples monthly, with each laboratory using locally optimized schemas and formats that must be standardized before communication with public health partners and submission to public data repositories.
The scale of this challenge is substantial. Public health laboratory staff can spend up to 2–4 hours manually preparing data for submission to a public genomic data repository like the National Center for Biotechnology Information (NCBI). This translates to over 400 hours annually per laboratory—time that could be spent on critical analysis and response activities. More concerning, these manual processes can introduce errors that result in submission rejections or corrections, creating delays in releasing vital genomic data to the broader scientific community.
To overcome this growing challenge, the Digital Transformation Hub (DxHub) at California Polytechnic State University (Cal Poly)—powered by Amazon Web Services (AWS) and part of the AWS Cloud Innovation Centers (CIC) program—took action. The DxHub joined forces with the Wisconsin State Laboratory of Hygiene (WSLH) and Virginia Department of General Services Division of Consolidated Laboratory Services (DCLS) to create AI Genomic Schema Harmonizer, a generative AI–powered application that revolutionizes how laboratories prepare genomic data for submission to public repositories.
AI Genomics Schema Harmonizer
AI Genomics Schema Harmonizer reduces the burden of genomics data reporting. Rather than the typical method of managing data in spreadsheets, the application uses the power of generative AI to automatically align diverse laboratory terminologies with standardized or required formats. Through natural language processing capabilities, AI Genomics Schema Harmonizer analyzes laboratory-specific terms and matches them to standardized NCBI BioSample definitions, providing consistent and accurate data submission.
The solution’s impact readily extends beyond simple field mapping. By streamlining the standardization process, AI Genomics Schema Harmonizer advances a broader initiative in healthcare and life sciences: achieving semantic interoperability. This capability means that genomic data can be seamlessly shared, understood, and utilized across different systems and organizations—a critical requirement for modern public health disease surveillance and research.
Solution overview
The application’s architecture is API-driven and uses several AWS managed services for security, scalability, and reliability. Amazon Bedrock and Anthropic’s Claude Sonnet 3.5 v2 foundation model (FM) power the secure processing of sensitive genomic metadata. AWS Lambda and Amazon API Gateway provide the backend application logic, and data mapping definitions are securely stored using Amazon Simple Storage Service (Amazon S3). The following diagram is the solution architecture.
The system’s sophisticated natural language processing capabilities enable it to parse and analyze source data structure and terminology and then match fields with corresponding NCBI standardized terms using comprehensive definition libraries. Scientists do a final pass to validate mappings against current NCBI requirements and generate submission-ready files that meet federal repository standards for quick and compliant submissions.
“AI Genomics Schema Harmonizer has the potential to transform the genomic data submission process,” said Dr. Kelsey Florek, WSLH senior genomics and data scientist. “Replacing manual data transformations and maintenance of custom macros and scripts with a simple and broadly applicable approach allows our team to focus on the critical work of genomic analysis rather than data formatting.”
The DxHub student experience
At the Cal Poly DxHub, students work alongside university staff and Amazon employees to develop innovative solutions for public sector challenges using the Amazon Working Backwards methodology. For Noor Dhaliwal, the student developer behind GenomicsMetaDataMapper, the experience opened new horizons in applied AI.
“Working at the DxHub gave me the opportunity to build a solution that impacts public health laboratories nationwide,” said Dhaliwal. “Learning to harness cutting-edge AI technologies like Amazon Bedrock while solving real-world problems has been transformative. The experience of creating a tool that helps accelerate critical public health work has shown me the true potential of cloud computing and AI in the public sector.”
Measurable impact and future direction
Early implementation results demonstrate the AI Genomics Schema Harmonizer’s transformative potential. Measurable impacts include eliminating the need to develop spreadsheet-based formulas and macros, increasing accuracy by eliminating typographical errors, and saving a potential 2–4 hours per week per data submission. Although this proof of concept was focused on a widespread issue affecting the submission of genomic data, it has broad applicability to any number of genomic data interfaces.
As genomic sequencing becomes increasingly central to public health infectious disease surveillance and clinical diagnostics, the need for efficient processing of data transformation will only grow. The partnership between the DxHub and public health laboratories demonstrates how generative AI can be used to solve complex interoperability challenges while maintaining the high standards required for scientific research.
To learn more about AI Genomics Schema Harmonizer and how it could help your lab or agency, email Dr. Kelsey Florek.
Contributing authors: Nick Osterbur, Dr. Dawn Heisey-Grove, and Darren Kraker
About the Cal Poly DxHub
Launched in 2017 as the first CIC housed in an institution of higher education, the Cal Poly DxHub provides opportunities for nonprofits, educational bodies, and government agencies to collaborate on their most pressing challenges, test new ideas, and access the technological expertise of AWS to help create cloud-based solutions. To learn more about the Cal Poly Digital Transformation Hub and how your organization can engage, reach out to Nick Osterbur (nosterb@amazon.com) or visit the DxHub website.