This Guidance demonstrates how to use the Custom Document Enrichment feature with Amazon Kendra to improve search experiences. Documents with precise content and rich metadata are more searchable and yield more accurate results. Large repositories of raw documents can be improved for search by modifying the content or adding metadata before indexing, enhancing their search results.
Amazon Kendra ingests text results returned from Amazon Textract and Amazon Transcribe with a post-extraction Lambda function.
Using the advanced Custom Document Enrichment operation, Lambda calls Amazon Comprehend to detect entities from the text ingested by Amazon Kendra and returns the entities with a post-extraction Lambda function.
Entities detected are used to update the document metadata, which is presented as facets in an Amazon Kendra search.
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Amazon Kendra uses Amazon CloudWatch logs to provide insight into the operation of data sources. Amazon Kendra logs process details for the documents as they are indexed. It also logs errors from the data source that occur while documents are being indexed. CloudWatch logs can be used to monitor, store, and access the log files. With minimal user intervention, CloudWatch logs can capture insights and anomaly detection to continuously analyze metrics of systems and applications, determine normal baselines, and surface anomalies. The AWS CloudFormation template can be easily modified and extended to integrate changes.
The CloudFormation infrastructure as code (IaC) automation deploys resources to the AWS Cloud securely. This reduces the risk of human and potential errors related to manual configuration or management.
Lambda functions are configured through AWS Identity and Access Management (IAM) with least-privilege access, limiting access to just the required Amazon S3 data buckets.
The Kendra Enterprise Edition of Amazon Kendra is highly available by default within a Region and can handle Availability Zones failures. Lambda runs in multiple Availability Zones to ensure that it is available to process events in the case of a service interruption in a single zone.
Before extraction, Lambda is configured to run only for a maximum of 5 minutes. Text extraction from each audio and video file must complete in 5 minutes. Post extraction, Lambda is configured to run for a maximum 1 minute, so Amazon Comprehend has to detect entities from the text within that time.
Amazon Kendra is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in Amazon Kendra. CloudTrail captures all API calls from Amazon Kendra as events, including calls from the Amazon Kendra console and from code calls to the Amazon Kendra APIs.
Services used in the Guidance are purpose built for this use. Amazon Transcribe is built to create a transcription of audio and video files. Amazon Textract extracts text from scanned image documents. Amazon Comprehend detects entities from within the text.
With CloudFormation IaC, this Guidance can be deployed to any supported AWS Region close to the user base to decrease latency and improve performance.
The code is executed using Lambda functions that provide serverless compute capabilities without the infrastructure. The functions automatically scale in and out to meet the changes in demand.
Serverless architectures and services such as Lambda, Amazon Textract, Amazon Comprehend, and Amazon Transcribe provide a pay-as-you-go pricing model that is based on usage. And because they're serverless, these services scale based on demand.
The Lambda functions' execution environment shuts down the application logic after it has been executed. This saves on infrastructure use and cost.
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.