Accelerating Genomic Data Discovery with AI-Powered Natural Language Queries

In the rapidly evolving field of Life Sciences, researchers and healthcare professionals face the challenge of efficiently accessing and analyzing vast amounts of complex genomic, clinical, and imaging data. Traditional data querying methods often require specialized technical knowledge of SQL and database structures, creating bottlenecks in research workflows and limiting the accessibility of valuable insights.

In the recent release about Amazon Bedrock Knowledge Bases it now supports natural language querying to retrieve structured data from your data sources, such as Amazon Redshift. Researchers can now ask questions in natural language such as, “What is the top gene mutation found in all patients” or “Give me all information on OR6Y1 gene.” These searches will receive precise data from their genomic databases, patient records, and medical imaging repositories. Amazon Bedrock Knowledge Bases automatically translates these natural language queries into optimized SQL statements.

This approach accelerates research workflows, enabling faster discoveries and more efficient clinical decision-making.

We will explore how Amazon Bedrock Knowledge Bases can be implemented to transform the way Life Sciences organizations interact with their valuable data assets stored in Amazon Redshift.

Solution Overview

To illustrate this feature, we will build a solution using sample patient genomics data and set up Amazon Redshift as the knowledge base. This will enable users and applications to access this information using natural language prompts. The following (Figure 1) provides an overview of the solution.

Figure 1 – Solution Architecture for Genomics Data Analysis Using Natural Language

The steps to build and run the solution are the following:

1. Load patients’ data: Load the sample patient genomics data into Amazon Redshift using copy process.
2. Setup knowledge base: Configure Amazon Redshift as a knowledge base in Amazon Bedrock, grant access and sync the metadata.
3. Prompt in natural langue: User or application starts sending the prompts in natural language. (In this overview, we illustrated using testing interface.)
4. Generate and run the query: Amazon Bedrock generates the query taking the prompt and the Amazon Redshift metadata as input. Runs the query on Amazon Redshift instance.
5. Return results: Results of the query are returned from Amazon Redshift.
6. Return response in natural language: Amazon Bedrock infers the tabular results and translates into a natural language response.

Implementation

The following tutorial will walk you through the process of loading sample patient data from data files in an Amazon Simple Storage Service (Amazon S3) bucket into your Amazon Redshift database tables, and then configuring Amazon Bedrock Knowledge Bases for natural language interactions with the data.

Step 1: Download the data files

Download a set of sample data files to your computer. Next, upload the files to an S3 bucket.

1. Download the zipped file: samplepatientdata.zip. Data sources attribution: The clinical datasets were generated using Synthea. The OMICS and Images data were sourced from The Cancer Genome Atlas (TGCA) open data sets.
2. Extract the files to a folder on your computer.

Step 2: Upload the files to an S3 bucket

Create an S3 bucket and upload the data files to the bucket.

1. Create a bucket in Amazon S3. For more information about creating a bucket, see Creating a bucket.
2. Upload the data files to the new S3 bucket. In the Upload wizard, choose Add files. Follow the Amazon S3 console instructions to upload all of the files you downloaded and extracted.

Step 3: Create Redshift Serverless instance

Create an Amazon Redshift Serverless instance, create tables and load the data from the S3 bucket.

1. Follow Creating a data warehouse with Amazon Redshift Serverless documentation to create the data warehouse instance.
2. Download the SQL file: SQL.txt on your computer. Replace “S3://redshift-kb-bedrock-logdata” with the name of the S3 bucket where you uploaded the data downloaded in Step 1.
3. Open a Redshift Query Editor V2 by clicking on Query Data and connect to your Amazon Redshift Serverless Instance using the current admin credentials.
4. Run all SQL commands found in the SQL.txt file you downloaded earlier. This step will create tables and load the data into the tables from your S3 bucket. Verify that these tables are created with data. patient_reference_data_rs, patients_rs, gene_mutation_rs, gene_copy_number_rs, image_data_rs

Step 4: Setup Bedrock Knowledge Bases

Create Amazon Bedrock Knowledge Bases for the Amazon Redshift database and sync the data.

1. Prerequisites 1: If you are using an Amazon Web Services (AWS) Identity and Access Management (IAM) role, then it needs appropriate policy permissions attached to it before it can execute operations on Amazon Bedrock Knowledge Bases. Follow Prerequisites for creating an Amazon Bedrock knowledge base with a structured data store for instructions.

Prerequisites 2: If you are creating an Amazon Bedrock Knowledge Bases through the AWS Management Console, you can skip setting up a service role. It automatically creates one with the necessary permissions for Amazon Bedrock Knowledge Bases to retrieve data from your new Knowledge Base and generate SQL queries for structured data stores. If you are not using the AWS Management Console for setup, then be certain to follow Prerequisites 1 instructions.

2. Create your Knowledge Bases. You can now incorporate a structured data store while setting up a Knowledge Base by selecting the option.

Figure 2 – Create Knowledge Base with structured data store

After naming and describing the Knowledge Base, you can select Amazon Redshift as the query engine and create a new IAM service role for resource management before proceeding to the next step. Note the new IAM role.

Figure 3 – Select Amazon Redshift as the query engine and create a new IAM service role

3. Within the connection settings, select Redshift Serverless (Redshift Provisioned is also supported) with your chosen Workgroup. Authenticate using the previously created IAM role, and choose a metadata database from your Amazon Redshift database options. We chose ‘dev’ for this tutorial.

Figure 4 – Select Query engine as Redshift Serverless, IAM role for Authentication and ‘dev’ database for storage metadata

4. Provide the IAM role with specific access permissions to retrieve data from selected tables by executing the GRANT command for an Amazon Redshift database. You can scope to specific databases, tables, rows or columns. For example, GRANT SELECT on dev.public.patient_reference_data_rs to “IAMR:AmazonBedrockExecutionRoleForKnowledgeBase_izzap”;
5. For this tutorial, grant this permission to all tables created earlier. Replace the IAM role “AmazonBedrockExecutionRoleForKnowledgeBase_izzap” with the name you noted in this Step, for the second action.
6. Sync your Amazon Redshift database with your Knowledge Base. Select Knowledge Base and choose your Knowledge Base. In the query engine section, select the Amazon Redshift database source and click Sync. Once the sync is complete the Status will show COMPLETE. Please note that whenever you make modifications to your database schema, you need to sync the changes.

Figure 5 – Sync the Amazon Redshift database

Step 5: Test the Amazon Bedrock Knowledge Bases for Amazon Redshift database
Run queries against the newly created Amazon Bedrock Knowledge Bases for Amazon Redshift database. You can set up your application to query the Knowledge Base or attach the Knowledge Base to an agent by proceeding to Deploy your knowledge base for your AI application. For this tutorial, you can use a native testing interface in the Amazon Bedrock Knowledge Bases Console.

1. Click Test to test the Knowledge Base by running a query to generate responses.

2. Turn the toggle Generate Responses so it is active and select a foundation model from the Amazon Bedrock model providers. You may need to request access to Amazon Bedrock foundation models. For this specific tutorial, we selected Claude 3.5 Sonnet from Anthropic.

Figure 7 – Select Claude 3.5 Sonnet foundation model

3. Test queries in natural language. Following are some sample queries. You can try the queries specific to your situation.

Query 1 – ‘what is the top gene mutation found in all patients’

Query 2 – ‘what are start and end columns for OR10R2 gene’

Query 3 – ‘give me all information on OR6Y1 gene’

Query 4 – ‘give me all information about patient [Noel608] Wolf938’

Considerations

As you explore this new feature of Amazon Bedrock Knowledge Bases, consider the following three design aspects.

1 – Data sources, structures and analysis

1. Data types and data sources: Customers are no longer constrained by a lack of knowledge in a specific language like SQL, as the natural language interface translates and prepares commands required by Amazon Redshift for you.
2. Based on the accuracy of metadata (data structure information), customers may sometimes need to reword analysis requests (prompts) in natural language. Users should try to provide descriptive column names so that AI can read metadata correctly (context).
3. Data from several sources can be made available with Amazon Redshift for quick exploration as well as testing.

2 – Cost considerations

As this solution is leveraging serverless architecture, customers only pay for what they use. Explore Amazon Bedrock pricing and Amazon Redshift serverless cost considerations.

3 – Security, privacy, safety and compliance

1. Security of data: Amazon Bedrock provides ability to encrypt data from Knowledge Base sources. Amazon Redshift also provides comprehensive data security, including fine grain access control for data.
2. Hallucinations, privacy, data sensitivity, security and safety: Amazon Bedrock Guardrails provides configurable safeguards to help safely build generative AI applications at scale. With a consistent and standard approach used across all the components like foundation models (FMs), Amazon Bedrock Guardrails delivers industry-leading safety protections for all generative AI use cases.
3. Data Refresh: Knowledge Base data can be ‘synced’ for updates to the data set on a schedule or as needed.

Conclusion

We demonstrated how the natural language querying capability of Amazon Bedrock Knowledge Bases with Amazon Redshift enables rapid development. Life Sciences can use this solution to solve ‘access and analysis’ challenges for researchers and healthcare professionals who need to efficiently leverage vast amounts of complex genomic, clinical, and imaging data. Natural language driven interaction removes the knowledge barrier for database structures query (SQL) and analysis, accelerating decision making and innovation.

AWS experts are here to help your company to rapidly build data driven products and solutions using AWS analytics and generative AI services like Amazon Bedrock and Amazon Redshift. Contact an AWS Representative to know how we can help accelerate your business.

AWS for Industries

Accelerating Genomic Data Discovery with AI-Powered Natural Language Queries

Solution Overview

Implementation

Considerations

Conclusion

Further reading

Resources

Follow

Learn

Resources

Developers

Help