Getting Started Resource Center

Developer Tools

TUTORIAL

Extract information from unstructured documents with Amazon Bedrock and Amazon Textract

Introduction

Overview

In this tutorial, you will learn how to utilize Amazon Bedrock and Amazon Textract to extract and process information from unstructured documents.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, layout elements, and data from scanned documents.

What you will accomplish

In this tutorial, you will:

Enable access to a foundation model on your AWS account
Create a new Jupyter notebook to write test code and run tests
Generate code
Clean up your resources

Prerequisites

Before starting this tutorial, you will need:

An AWS account: if you don't already have one follow the Setup Your Environment tutorial.

Implementation

AWS experience

Beginner

Time to complete

20 minutes

Cost to complete

Less than $0.15 if completed within two hours and the notebook is deleted at the end of the tutorial.

Get help

Troubleshooting Amazon Bedrock models

Debugging traning issues

Last update

November 14, 2024

Enable Anthropic FM

In this step, you will enable the use of Anthropic models on your AWS account.

Already requested and obtained access to Anthropic models on Amazon Bedrock?
Skip to Create a Jupyter Notebook.

1. Open Amazon Bedrock

Sign in to the AWS Management console, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/home.

In the left navigation pane, under Bedrock configurations, choose Model Access.

2. Enable a model

On the What is Model access? page, choose Enable specific models.

"Amazon Bedrock model access page with options to enable all models or enable specific models, and links to IAM permissions and quotas."

3. Choose the Anthropic models

On the Edit model access page, select the Anthropic models, and choose Next.

A user interface for editing model access, showing a list of AI models grouped by provider, with all models under "Anthropic" selected and the "Next" button highlighted.

4. Review and submit the change

On the Review and submit page, review your selections, and choose Submit.

"Interface for editing model access in AWS, listing eight Claude models with 'Request access' options, and a 'Submit' button highlighted at the bottom."

Create a Jupyter Notebook

In this step, you will create a Jupyter notebook to write your Proof-of-Concept code and test it out with real documents.

1. Open Amazon SageMaker

Open the Amazon Sagemaker console at https://console.aws.amazon.com/sagemaker/home.

In the left navigation pane, under Applications and IDEs, choose Notebooks.

2. Create a notebook instance

On the Notebooks and Git repos page, choose Create notebook instance.

Amazon SageMaker interface showing the "Notebook instances" tab with no resources listed and an orange "Create notebook instance" button highlighted.

3. Configure notebook instance settings

On the Create notebook instance page:

In the Notebook instance settings section:

For Notebook instance name, enter a name for your Jupyter instance.
For Notebook instance type, verify ml.t3.medium is selected.
Keep all other default settings.

"Amazon SageMaker interface for creating a notebook instance, showing settings for instance name, type, IAM role creation success, and the 'Create notebook instance' button highlighted."

4. Configure permissions and encryption

In the Permissions and encryption section:

For IAM role, choose Create a new role.
On the Create an IAM role pop up window, for S3 buckets you specify – optional, choose None, and then choose Create role.

Then, choose Create notebook instance.

Alt-text: "AWS IAM role creation screen with S3 bucket access options, 'None' selected, and 'Create role' button highlighted."

Generate code to process your documents

In this step, you will use Bedrock playground to generate code for your Jupyter notebook.

1. Open JupyterLab

On the Notebook instance page, choose Open JupyterLab for the instance you created in the previous step.

Note: The notebook will open in a separate browser tab.

2. Create a new notebook

On the JupyterLab tab, right-click the file area, and then select New Notebook.

Alt-text: A JupyterLab interface showing a context menu with the "New Notebook" option highlighted in red.

3. Select kernel

On the Select Kernel pop up window, choose conda_python3, and choose Select.

Dialog box titled "Select Kernel" with a dropdown showing "conda_python3" selected and two buttons: "No Kernel" and "Select."

4. Open the chat playground

In a new tab, open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/home.

In the left navigation pane, under Playgrounds, choose Chat/text.

Alt-text: Amazon Bedrock interface showing navigation options on the left, including "Playgrounds" with "Chat/text" highlighted, and foundation model options on the right.

5. Select the model

On the Mode page, choose Select model.

Alt-text: Amazon Bedrock interface showing a "Select model" button highlighted in orange, with options for input, output, and latency.

5. Specify the model details

In the Select model dialog box:

For Categories, choose Anthropic.
For Models with access, choose the Claude 3.5 Sonnet model.
Then, choose Apply.

Note: The Claude 3.5 Sonnet is the most intelligent model from Anthropic. You can see a more detailed model comparison here.

A model selection interface showing "Anthropic" as the chosen provider, "Claude 3.5 Sonnet" as the selected model, and an "Apply" button highlighted in orange.

6. Generate code

In the Chat playground, you can now ask the LLM to write sample code. The following is an example prompt that you can use to extract information from an unstructured document.

I am writing a Jupyter notebook with a proof of concept python code snippets to perform a few tasks.

To start, write a snippet to iterate the current folder and read all the jpg/png files and for each file call textract DetectDocumentText API to extract all the text on the image.

Re-save the result with the same file name and txt extension.

Also make sure to:

- Not reprocess any files that already have the txt file existing in the directory

- Print a progress bar output using tdqm

- Keep everything readable and properly componentized in methods

- No need for __main__ implementations as it's a snippet to run on Jupyter notebook.

Once you enter your prompt, and choose Run, the prompt response will include code and also explanation of everything that the model generated. The code will typically be enclosed in quotation marks.

"Screenshot of Amazon Bedrock's Chat/text playground interface with a prompt about writing a Python script for text extraction using the DetectDocumentText API in a Jupyter notebook."

7. Check the output

The generated code with the example prompt should look similar to the following example. You can also use the copy function to paste the code directly into the Jupyter notebook.

import osimport boto3from tqdm import tqdmfrom PIL import Imagedef get_image_files(directory):    """Get all jpg and png files in the given directory."""    return [f for f in os.listdir(directory) if f.lower().endswith(('.jpg', '.png'))]def should_process_file(file_path):    """Check if the file should be processed (i.e., no corresponding txt file exists)."""    txt_path = os.path.splitext(file_path)[0] + '.txt'    return not os.path.exists(txt_path)def extract_text_from_image(image_path):    """Extract text from the image using Amazon Textract."""    client = boto3.client('textract')        with open(image_path, 'rb') as image:        response = client.detect_document_text(Document={'Bytes': image.read()})        extracted_text = []    for item in response['Blocks']:        if item['BlockType'] == 'LINE':            extracted_text.append(item['Text'])        return '\n'.join(extracted_text)def save_text_to_file(text, file_path):    """Save the extracted text to a file."""    txt_path = os.path.splitext(file_path)[0] + '.txt'    with open(txt_path, 'w', encoding='utf-8') as f:        f.write(text)def process_images_in_directory(directory):    """Process all images in the given directory."""    image_files = get_image_files(directory)        for image_file in tqdm(image_files, desc="Processing images"):        image_path = os.path.join(directory, image_file)                if should_process_file(image_path):            extracted_text = extract_text_from_image(image_path)            save_text_to_file(extracted_text, image_path)# Usage in Jupyter notebookdirectory = '.'  # Current directoryprocess_images_in_directory(directory)

Note: The previous example code is built to process all files on the current directory and needs an image in order to fully process the code.

Screenshot of a coding interface showing a Python script for processing image files and extracting text using Amazon Textract, with a highlighted copy icon in the top-right corner.

8. Prepare your image file

You can use your own image or download and save this image. Then, find the image you want to use on your local machine, and drag the file to the Jupyter Notebook file explorer in order to copy and paste it.

Screenshot of a Jupyter Notebook showing Python code for processing image files and extracting text using AWS Textract, alongside a file browser displaying a health insurance card image file.

9. Configure permissions

Before you can run the code in your JupyterLab, the IAM role that was previously created for your Jupyter notebook, needs the appropriate permissions to run the AWS services that your code is going to use. If you chose to use the previous example, Amazon Textract is the AWS service that would need the appropriate permissions.

Open the AWS IAM console at https://console.aws.amazon.com/iam/home.

In the left navigation pane, choose Roles.

"Identity and Access Management (IAM) menu with 'Roles' highlighted in red under Access management."

10. Search for the IAM role

In the search box, find the previously created AmazonSageMaker-ExecutionRole-<timestamp> role, and open the role.

Screenshot of the AWS IAM Roles page showing a search for "AmazonSageMaker-ExecutionRole" with one matching result listed.

11. Add permissions

On the AmazonSageMaker-ExecutionRole-<timestamp> page, choose the Add permissions drop down, and select Attach policies.

Alt-text: AWS SageMaker execution role management interface showing summary details, permissions policies, and options to add or attach policies.

12. Attach the policy

On the Attach policy to AmazonSageMaker-ExecutionRole-<timestamp> page, in the Other permissions policies section search bar, enter AmazonTextractFullAccess. Then, select the policy, and choose Add permissions.

Screenshot of AWS console showing the attachment of the "AmazonTextractFullAccess" policy to an Amazon SageMaker execution role, with the policy selected and the "Add permissions" button highlighted.

13. Run the notebook

Navigate back to your JupyterLab tab, and select Run.

A Jupyter Notebook interface showing Python code for processing image files, including functions to list image files, check if a file should be processed, and extract text using Amazon Textract.

14. View the text file

After your code runs you should now be able to see a .txt file with the extracted text in the left navigation pane of your JupyterLab.

"Screenshot of a file explorer and text editor showing a health insurance card's redacted details, including member name, ID, plan type, and coverage information."

Clean up resources

In this step, you will go through the steps to delete all the resources you created throughout this tutorial. It is recommended that you stop the Jupyter notebook you created to prevent unexpected costs.

1. Delete the notebook

In the SageMaker console, in the left navigation pane, choose Notebooks, and select the Notebook. Then, choose Actions, and select Stop.

Note: The stop operation might take around 5 minutes. Once the notebook is stopped you can also delete it by choosing Actions and selecting Delete.

Congratulations

You have created a sample Proof-of-Concept to extract information from documents.

Next steps

Learn more about Amazon Textract

Read the Amazon Textract Developer guide.

Learn more

Learn about Amazon Bedrock

Learn how to build and scale generative AI applications with foundation models.

Learn more

Find More Hands-on Tutorials

Find more hands-on tutorials to learn how to leverage compute, storage, or connect to a database.

Learn more

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Select your cookie preferences

Extract information from unstructured documents with Amazon Bedrock and Amazon Textract

Introduction

Overview

What you will accomplish

Prerequisites

Implementation

AWS experience

Time to complete

Cost to complete

Get help

Last update

Enable Anthropic FM

1. Open Amazon Bedrock

2. Enable a model

3. Choose the Anthropic models

4. Review and submit the change

Create a Jupyter Notebook

1. Open Amazon SageMaker

2. Create a notebook instance

3. Configure notebook instance settings

4. Configure permissions and encryption

Generate code to process your documents

1. Open JupyterLab

2. Create a new notebook

3. Select kernel

4. Open the chat playground

5. Select the model

5. Specify the model details

6. Generate code

7. Check the output

8. Prepare your image file

9. Configure permissions

10. Search for the IAM role

11. Add permissions

12. Attach the policy

13. Run the notebook

14. View the text file

Clean up resources

1. Delete the notebook

Congratulations

Next steps

Learn more about Amazon Textract

Learn about Amazon Bedrock

Find More Hands-on Tutorials

Did you find what you were looking for today?