Extract information from unstructured documents with Amazon Bedrock and Amazon Textract
TUTORIAL
In this tutorial, you will learn how to utilize Amazon Bedrock and Amazon Textract to extract and process information from unstructured documents.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, layout elements, and data from scanned documents.
What you will accomplish
In this tutorial, you will:
- Enable access to a foundation model on your AWS account
- Create a new Jupyter notebook to write test code and run tests
- Generate code
- Clean up your resources
Prerequisites
Before starting this tutorial, you will need:
- An AWS account: if you don't already have one follow the Setup Your Environment tutorial.
Implementation
AWS experience
Beginner
Time to complete
20 minutes
Cost to complete
Less than $0.15 if completed within two hours and the notebook is deleted at the end of the tutorial.
Requires
- AWS account with administrator-level access*
*Accounts created within the past 24 hours might not yet have access to the services required for this tutorial.
Services used
Last updated
November 14, 2024
-
Step 1: Enable Anthropic FM
In this step, you will enable the use of Anthropic models on your AWS account.
Already requested and obtained access to Anthropic models on Amazon Bedrock? Skip to Create a Jupyter Notebook.
1. Sign in to the AWS Management console, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/home.
2. In the left navigation pane, under Bedrock configurations, choose Model Access.
3. On the What is Model access? page, choose Enable specific models.
4. On the Edit model access page, select the Anthropic models, and choose Next.
5. On the Review and submit page, review your selections, and choose Submit.
-
Step 2: Create a Jupyter Notebook
In this step, you will create a Jupyter notebook to write your Proof-of-Concept code and test it out with real documents.
1. Open the Amazon Sagemaker console at https://console.aws.amazon.com/sagemaker/home.
2. In the left navigation pane, under Applications and IDEs, choose Notebooks.
3. On the Notebooks and Git repos page, choose Create notebook instance.
4. On the Create notebook instance page:
- In the Notebook instance settings section:
- For Notebook instance name, enter a name for your Jupyter instance.
- For Notebook instance type, verify ml.t3.medium is selected.
- Keep all other default settings.
- In the Permissions and encryption section:
- For IAM role, choose Create a new role.
- On the Create an IAM role pop up window, for S3 buckets you specify – optional, choose None, and then choose Create role.
5. Then, choose Create notebook instance.
Note: The notebook instance may take up to 5 minutes to create.
- In the Notebook instance settings section:
-
Step 3: Generate code to process your documents
In this step, you will use Bedrock playground to generate code for your Jupyter notebook.
1. On the Notebook instance page, choose Open JupyterLab for the instance you created in the previous step.
Note: The notebook will open in a separate browser tab.
2. On the JupyterLab tab, right-click the file area, and then select New Notebook.
3. On the Select Kernel pop up window, choose conda_python3, and choose Select.
4. In a new tab, open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/home.
5. In the left navigation pane, under Playgrounds, choose Chat/text.
6. On the Mode page, choose Select model.
7. In the Select model dialog box:
- For Categories, choose Anthropic.
- For Models with access, choose the Claude 3.5 Sonnet model.
- Then, choose Apply.
Note: The Claude 3.5 Sonnet is the most intelligent model from Anthropic. You can see a more detailed model comparison here.
8. In the Chat playground, you can now ask the LLM to write sample code. The following is an example prompt that you can use to extract information from an unstructured document.
I am writing a Jupyter notebook with a proof of concept python code snippets to perform a few tasks. To start, write a snippet to iterate the current folder and read all the jpg/png files and for each file call textract DetectDocumentText API to extract all the text on the image. Re-save the result with the same file name and txt extension. Also make sure to: - Not reprocess any files that already have the txt file existing in the directory - Print a progress bar output using tdqm - Keep everything readable and properly componentized in methods - No need for __main__ implementations as it's a snippet to run on Jupyter notebook.
9. Once you enter your prompt, and choose Run, the prompt response will include code and also explanation of everything that the model generated. The code will typically be enclosed in quotation marks.
10. The generated code with the example prompt should look similar to the following example. You can also use the copy function to paste the code directly into the Jupyter notebook.
import os import boto3 from tqdm import tqdm from PIL import Image def get_image_files(directory): """Get all jpg and png files in the given directory.""" return [f for f in os.listdir(directory) if f.lower().endswith(('.jpg', '.png'))] def should_process_file(file_path): """Check if the file should be processed (i.e., no corresponding txt file exists).""" txt_path = os.path.splitext(file_path)[0] + '.txt' return not os.path.exists(txt_path) def extract_text_from_image(image_path): """Extract text from the image using Amazon Textract.""" client = boto3.client('textract') with open(image_path, 'rb') as image: response = client.detect_document_text(Document={'Bytes': image.read()}) extracted_text = [] for item in response['Blocks']: if item['BlockType'] == 'LINE': extracted_text.append(item['Text']) return '\n'.join(extracted_text) def save_text_to_file(text, file_path): """Save the extracted text to a file.""" txt_path = os.path.splitext(file_path)[0] + '.txt' with open(txt_path, 'w', encoding='utf-8') as f: f.write(text) def process_images_in_directory(directory): """Process all images in the given directory.""" image_files = get_image_files(directory) for image_file in tqdm(image_files, desc="Processing images"): image_path = os.path.join(directory, image_file) if should_process_file(image_path): extracted_text = extract_text_from_image(image_path) save_text_to_file(extracted_text, image_path) # Usage in Jupyter notebook directory = '.' # Current directory process_images_in_directory(directory)
Note: The previous example code is built to process all files on the current directory and needs an image in order to fully process the code.
11. You can use your own image or download and save this image. Then, find the image you want to use on your local machine, and drag the file to the Jupyter Notebook file explorer in order to copy and paste it.
Before you can run the code in your JupyterLab, the IAM role that was previously created for your Jupyter notebook in Step 2, needs the appropriate permissions to run the AWS services that your code is going to use. If you chose to use the previous example, Amazon Textract is the AWS service that would need the appropriate permissions.
12. Open the AWS IAM console at https://console.aws.amazon.com/iam/home.
13. In the left navigation pane, choose Roles.
14. In the search box, find the previously created AmazonSageMaker-ExecutionRole-<timestamp> role, and open the role.
15. On the AmazonSageMaker-ExecutionRole-<timestamp> page, choose the Add permissions drop down, and select Attach policies.
16. On the Attach policy to AmazonSageMaker-ExecutionRole-<timestamp> page, in the Other permissions policies section search bar, enter AmazonTextractFullAccess. Then, select the policy, and choose Add permissions.
17. Navigate back to your JupyterLab tab, and select Run.
18. After your code runs you should now be able to see a .txt file with the extracted text in the left navigation pane of your JupyterLab.
-
Step 4: Clean up Resources
In this step, you will go through the steps to delete all the resources you created throughout this tutorial. It is recommended that you stop the Jupyter notebook you created in Step 2 to prevent unexpected costs.
1. In the SageMaker console, in the left navigation pane, choose Notebooks, and select the Notebook. Then, choose Actions, and select Stop.
Note: The stop operation might take around 5 minutes. Once the notebook is stopped you can also delete it by choosing Actions and selecting Delete.
Congratulations
You have created a sample Proof-of-Concept to extract information from documents.