AWS Machine Learning Blog
Build an intelligent search solution with automated content enrichment
Unstructured data belonging to the enterprise continues to grow, making it a challenge for customers and employees to get the information they need. Amazon Kendra is a highly accurate intelligent search service powered by machine learning (ML). It helps you easily find the content you’re looking for, even when it’s scattered across multiple locations and content repositories.
Amazon Kendra leverages deep learning and reading comprehension to deliver precise answers. It offers natural language search for a user experience that’s like interacting with a human expert. When documents don’t have a clear answer or if the question is ambiguous, Amazon Kendra returns a list of the most relevant documents for the user to choose from.
To help narrow down a list of relevant documents, you can assign metadata at the time of document ingestion to provide filtering and faceting capabilities, for an experience similar to the Amazon.com retail site where you’re presented with filtering options on the left side of the webpage. But what if the original documents have no metadata, or users have a preference for how this information is categorized? You can automatically generate metadata using ML in order to enrich the content and make it easier to search and discover.
This post outlines how you can automate and simplify metadata generation using Amazon Comprehend Medical, a natural language processing (NLP) service that uses ML to find insights related to healthcare and life sciences (HCLS) such as medical entities and relationships in unstructured medical text. The metadata generated is then ingested as custom attributes alongside documents into an Amazon Kendra index. For repositories with documents containing generic information or information related to domains other than HCLS, you can use a similar approach with Amazon Comprehend to automate metadata generation.
To demonstrate an intelligent search solution with enriched data, we use Wikipedia pages of the medicines listed in the World Health Organization (WHO) Model List of Essential Medicines. We combine this content with metadata automatically generated using Amazon Comprehend Medical, into a unified Amazon Kendra index to make it searchable. You can visit the search application and try asking it some questions of your own, such as “What is the recommended paracetamol dose for an adult?” The following screenshot shows the results.
We take a two-step approach to custom content enrichment during the content ingestion process:
- Identify the metadata for each document using Amazon Comprehend Medical.
- Ingest the document along with the metadata in the search solution based on an Amazon Kendra index.
Amazon Comprehend Medical uses NLP to extract medical insights about the content of documents by extracting medical entities such as medication, medical condition, anatomical location, the relationships between entities such as route and medication, and traits such as negation. In this example, for the Wikipedia page of each medicine from the WHO Model List of Essential Medicines, we use the DetectEntitiesV2 operation of Amazon Comprehend Medical to detect the entities in the categories
TIME_EXPRESSION. We use these entities to generate the document metadata.
We prepare the Amazon Kendra index by defining custom attributes of type STRING_LIST corresponding to the entity categories
TIME_EXPRESSION. For each document, the
DetectEntitiesV2 operation of Amazon Comprehend Medical returns a categorized list of entities. Each entity from this list with a sufficiently high confidence score (for this use case, greater than 0.97) is added to the custom attribute corresponding to its category. After all the detected entities are processed in this way, the populated attributes are used to generate the metadata JSON file corresponding to that document. Amazon Kendra has an upper limit of 10 strings for an attribute of
STRING_LIST type. In this example, we take the top 10 entities with the highest frequency of occurrence in the processed document.
After the metadata JSON files for all the documents are created, they’re copied to the Amazon Simple Storage Service (Amazon S3) bucket configured as a data source to the Amazon Kendra index, and a data source sync is performed to ingest the documents in the index along with the metadata.
To deploy and work with the solution in this post, make sure you have the following:
- An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
- Basic knowledge of AWS and the AWS Command Line Interface (AWS CLI). For more information about the AWS CLI, see AWS CLI Command Reference.
- An S3 bucket to store the documents and metadata. For more information, see Creating a bucket and What is Amazon S3?
- Access to AWS CloudShell, Amazon Kendra, and Amazon Comprehend Medical.
We use the AWS CloudFormation template
medkendratemplate.yaml to deploy an Amazon Kendra index with the custom attributes of type STRING_LIST corresponding to the entity categories
The following diagram illustrates our solution architecture.
Based on this architecture, the steps to build and use the solution are as follows:
- On CloudShell, a Bash script called
getpages.shdownloads Wikipedia pages of the medicines and store them as text files.
- A Python script called
meds.py, which contains the core logic of the automation of the metadata generation, makes the detect_entities_v2 API call to Amazon Comprehend Medical to detect entities for each of the Wikipedia pages and generate metadata based on the entities returned. The steps used in this script are as follows:
- Split the Wikipedia page text into chunks smaller than the maximum text size allowed by the
- Make the
- Filter the entities detected by the
detect_entities_v2call using a threshold confidence score (0.97 for this example).
- Keep track of each unique entity corresponding to its category and the frequency of occurrence of that entity.
- For each entity category, sort the entities in that category from highest to lowest frequency of occurrence and select the top 10 entities.
- Create a metadata object based on the selected entities and output it in JSON format.
- Split the Wikipedia page text into chunks smaller than the maximum text size allowed by the
- We use the AWS CLI to copy the text data and the metadata to the S3 bucket that is configured as a data source to the Amazon Kendra index using the S3 connector.
- We perform a data source sync using the Amazon Kendra console to ingest the contents of the documents along with the metadata in the Amazon Kendra index.
- Finally, we use the Amazon Kendra search console to make queries to the index.
Create an Amazon S3 bucket to be used as a data source
Create an Amazon S3 bucket that you will use as a data source for the Amazon Kendra index.
Deploy the infrastructure as a CloudFormation stack
To deploy the infrastructure and resources for this solution, complete the following steps:
In a separate browser tab, open the AWS Management Console, and make sure that you’re logged in to your AWS account. Click the following button to launch the CloudFormation stack to deploy the infrastructure.
After that you should see a page similar to the following image:
For S3DataSourceBucket, enter your data source bucket name without the s3:// prefix, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack.
Stack creation can take 30–45 minutes to complete. You can monitor the stack creation status on the Stack info tab. You can also look at the different tabs, such as Events, Resources, and Template. While the stack is being created, you can work on getting the data and generating the metadata in the next few steps.
Get the data and generate the metadata
To fetch your data and start generating metadata, complete the following steps:
- On the AWS Management Console, click icon shown by a red circle in the following picture to start AWS CloudShell.
- Copy the file
code-data.tgzand extract the contents by using the following commands on AWS CloudShell:
- Change the working directory to
At this point, you can choose to run the end-to-end workflow of getting the data, creating the metadata using Amazon Comprehend Medical (which takes about 35–40 minutes), and then ingesting the data along with the metadata in the Amazon Kendra index, or just complete the last step to ingest the data with the metadata that has been generated using Amazon Comprehend Medical and supplied in the package for convenience.
- To use the metadata supplied in the package, enter the following code and then jump to Step 6:
- Perform this step to get a hands-on experience of building the end-to-end solution. The following command runs a bash script called main.sh, which calls the following scripts:
prereq.shto install prerequisites and create subdirectories to store data and metadata
getpages.shto get the Wikipedia pages of medicines in the list
getmetapar.shto call the
meds.pyPython script for each document
The Python script
meds.py contains the logic to make the
get_entities_v2 call to Amazon Comprehend Medical and then process the output to produce the JSON metadata file. It takes about 30–40 minutes for this to complete.
While performing Step 5, if CloudShell times out, security tokens get refreshed, or the script stops before all the data is processed, start the CloudShell session again and run
getmetapar.sh, which starts the data processing from the point it was stopped:
- Upload the data and metadata to the S3 bucket being used as the data source for the Amazon Kendra index using the following AWS CLI commands:
Review Amazon Kendra configuration and start the data source sync
Before starting this step, make sure that the CloudFormation stack creation is complete. In the following steps, we start the data source sync to begin crawling and indexing documents.
- On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.
- In the navigation pane, choose Data sources.
- On the Settings tab, you can see the data source bucket being configured.
- Choose the data source and choose Sync now.
The data source sync can take 10–15 minutes to complete.
Observe Amazon Kendra index facet definition
In the navigation pane, choose Facet definition. The following screenshot shows the entries for
TIME_EXPRESSION. These are the categories of the entities detected by Amazon Comprehend Medical. These are defined as custom attributes in the CloudFormation template that we used to create the Amazon Kendra index. The facetable check boxes for
TIME_EXPRESSION aren’t selected, therefore these aren’t shown in the facets of the search user interface.
Query the repository of WHO Model List of Essential Medicines
We’re now ready to make queries to our search solution.
- On the Amazon Kendra console, navigate to your index and choose Search console.
- In the search field, enter
What is the treatment for diabetes?
The following screenshot shows the results.
- Choose Filter search results to see the facets.
The headings of
TEST_TREATMENT_PROCEDURE are the categories defined as Amazon Kendra facets, and the list of items underneath them are the entities of these categories as detected by Amazon Comprehend Medical in the documents being searched.
TIME_EXPRESSION are not shown.
- Under MEDICAL_CONDITION, select pregnancy to refine the search results.
You can go back to the Facet definition page and make
TIME_EXPRESSION facetable and save the configuration. Go back to the search console, make a new query, and observe the facets again. Experiment with these facets to see what suits your needs best.
Make additional queries and use the facets to refine the search results. You can use the following queries to get started, but you can also experiment with your own:
- What is a common painkiller?
- Is parcetamol safe for children?
- How to manage high blood pressure?
- When should BCG vaccine be administered?
You can observe how domain-specific facets improve the search experience.
To delete the infrastructure that was deployed as part of the CloudFormation stack, delete the stack from the AWS CloudFormation console. Stack deletion can take 20–30 minutes.
When the stack status shows as
Delete Complete, go to the Events tab and confirm that each of the resources has been removed. You can also cross-verify by checking on the Amazon Kendra console that the index is deleted.
You must delete your data source bucket separately because it wasn’t created as part of the CloudFormation stack.
In this post, we demonstrated how to automate the process to enrich the content by generating domain-specific metadata for an Amazon Kendra index using Amazon Comprehend or Amazon Comprehend Medical, thereby improving the user experience for the search solution.
This example used the entities detected by Amazon Comprehend Medical to generate the Amazon Kendra metadata. Depending on the domain of the content repository, you can use a similar approach with the pretrained model or custom trained models of Amazon Comprehend. Try out our solution and let us know what you think! You can further enhance the metadata by using other elements such as protected health information (PHI) for Amazon Comprehend Medical and events, key phrases, personally identifiable information (PII), dominant language, sentiment, and syntax for Amazon Comprehend.
About the Authors
Abhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS partners to help them in their cloud journey.
Udi Hershkovich has been a Principal WW AI/ML Service Specialist at AWS since 2018. Prior to AWS, Udi held multiple leadership positions with AI startups and Enterprise initiatives including co-founder and CEO at LeanFM Technologies, offering ML-powered predictive maintenance in facilities management, CEO of Safaba Translation Solutions, a machine translation startup acquired by Amazon in 2015, and Head of Professional Services for Contact Center Intelligence at Amdocs. Udi holds Law and Business degrees from the Interdisciplinary Center in Herzliya, Israel, and lives in Pittsburgh, Pennsylvania, USA.