AWS Public Sector Blog

Leveraging generative AI to accelerate public health genomics data standardization

AWS Branded Background with text "Leveraging generative AI to accelerate public health genomics data standardization"

In the fields of precision medicine and pathogen surveillance, genomic sequencing has emerged as a critical component of public health research and response. Yet the advancement and application of genomics faces challenges in data interoperability and standardization. Public health laboratories across the United States process tens of thousands of genomic samples monthly, with each laboratory using locally optimized schemas and formats that must be standardized before communication with public health partners and submission to public data repositories.

The scale of this challenge is substantial. Public health laboratory staff can spend up to 2–4 hours manually preparing data for submission to a public genomic data repository like the National Center for Biotechnology Information (NCBI). This translates to over 400 hours annually per laboratory—time that could be spent on critical analysis and response activities. More concerning, these manual processes can introduce errors that result in submission rejections or corrections, creating delays in releasing vital genomic data to the broader scientific community.

To overcome this growing challenge, the Digital Transformation Hub (DxHub) at California Polytechnic State University (Cal Poly)—powered by Amazon Web Services (AWS) and part of the AWS Cloud Innovation Centers (CIC) program—took action. The DxHub joined forces with the Wisconsin State Laboratory of Hygiene (WSLH) and Virginia Department of General Services Division of Consolidated Laboratory Services (DCLS) to create AI Genomic Schema Harmonizer, a generative AI–powered application that revolutionizes how laboratories prepare genomic data for submission to public repositories.

AI Genomics Schema Harmonizer

AI Genomics Schema Harmonizer reduces the burden of genomics data reporting. Rather than the typical method of managing data in spreadsheets, the application uses the power of generative AI to automatically align diverse laboratory terminologies with standardized or required formats. Through natural language processing capabilities, AI Genomics Schema Harmonizer analyzes laboratory-specific terms and matches them to standardized NCBI BioSample definitions, providing consistent and accurate data submission.

The solution’s impact readily extends beyond simple field mapping. By streamlining the standardization process, AI Genomics Schema Harmonizer advances a broader initiative in healthcare and life sciences: achieving semantic interoperability. This capability means that genomic data can be seamlessly shared, understood, and utilized across different systems and organizations—a critical requirement for modern public health disease surveillance and research.

Solution overview

The application’s architecture is API-driven and uses several AWS managed services for security, scalability, and reliability. Amazon Bedrock and Anthropic’s Claude Sonnet 3.5 v2 foundation model (FM) power the secure processing of sensitive genomic metadata. AWS Lambda and Amazon API Gateway provide the backend application logic, and data mapping definitions are securely stored using Amazon Simple Storage Service (Amazon S3). The following diagram is the solution architecture.

Figure 1. Solution architecture diagram

The system’s sophisticated natural language processing capabilities enable it to parse and analyze source data structure and terminology and then match fields with corresponding NCBI standardized terms using comprehensive definition libraries. Scientists do a final pass to validate mappings against current NCBI requirements and generate submission-ready files that meet federal repository standards for quick and compliant submissions.

“AI Genomics Schema Harmonizer has the potential to transform the genomic data submission process,” said Dr. Kelsey Florek, WSLH senior genomics and data scientist. “Replacing manual data transformations and maintenance of custom macros and scripts with a simple and broadly applicable approach allows our team to focus on the critical work of genomic analysis rather than data formatting.”

The DxHub student experience

At the Cal Poly DxHub, students work alongside university staff and Amazon employees to develop innovative solutions for public sector challenges using the Amazon Working Backwards methodology. For Noor Dhaliwal, the student developer behind GenomicsMetaDataMapper, the experience opened new horizons in applied AI.

“Working at the DxHub gave me the opportunity to build a solution that impacts public health laboratories nationwide,” said Dhaliwal. “Learning to harness cutting-edge AI technologies like Amazon Bedrock while solving real-world problems has been transformative. The experience of creating a tool that helps accelerate critical public health work has shown me the true potential of cloud computing and AI in the public sector.”

Measurable impact and future direction

Early implementation results demonstrate the AI Genomics Schema Harmonizer’s transformative potential. Measurable impacts include eliminating the need to develop spreadsheet-based formulas and macros, increasing accuracy by eliminating typographical errors, and saving a potential 2–4 hours per week per data submission. Although this proof of concept was focused on a widespread issue affecting the submission of genomic data, it has broad applicability to any number of genomic data interfaces.

As genomic sequencing becomes increasingly central to public health infectious disease surveillance and clinical diagnostics, the need for efficient processing of data transformation will only grow. The partnership between the DxHub and public health laboratories demonstrates how generative AI can be used to solve complex interoperability challenges while maintaining the high standards required for scientific research.

To learn more about AI Genomics Schema Harmonizer and how it could help your lab or agency, email Dr. Kelsey Florek.

Contributing authors: Nick Osterbur, Dr. Dawn Heisey-Grove, and Darren Kraker

About the Cal Poly DxHub

Launched in 2017 as the first CIC housed in an institution of higher education, the Cal Poly DxHub provides opportunities for nonprofits, educational bodies, and government agencies to collaborate on their most pressing challenges, test new ideas, and access the technological expertise of AWS to help create cloud-based solutions. To learn more about the Cal Poly Digital Transformation Hub and how your organization can engage, reach out to Nick Osterbur (nosterb@amazon.com) or visit the DxHub website.

Learn more about the AWS Cloud Innovation Centers program.

Noor Dhaliwal

Noor Dhaliwal

Noor is a computer engineering student at California Polytechnic State University, San Luis Obispo, and currently serves as a software engineering intern at the Cal Poly AWS Cloud Innovation Center (CIC). His work focuses on developing technical solutions for complex challenges encountered by public sector organizations.

Dr. Kelsey Florek

Dr. Kelsey Florek

Kelsey is a senior genomics and data scientist who leads the Bioinformatics Team within the Communicable Disease Division at the Wisconsin State Laboratory of Hygiene. Her work focuses on expanding equitable access to genomic data analytics and strengthening the integration of genomics in public health practice. Kelsey collaborates closely with public health partners to build workforce capacity and advance the use of actionable genomic insights in infectious disease surveillance and response.

Logan Fink

Logan Fink

Logan is a bioinformatics lead scientist at the Department of General Service's Division of Consolidated Laboratory Services (DGS-DCLS). He also serves as the Bioinformatic Regional Resource for the mid-Atlantic region of state public health labs, where he coordinates trainings and offers support in the field of public health bioinformatics. Logan is excited to advance public health initiatives by finding ways to utilize cloud computing technologies and by fostering innovative approaches to problem-solving.