Posted On: Jan 26, 2022

Amazon Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents and goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

Previously customers had to convert PDF documents to PNG or JPEG formats prior to calling Textract’s synchronous APIs - (DetectDocumentText, AnalyzeDocument, and AnalyzeExpense and AnalyzeID) in order to extract text and data from documents such as claim forms, invoices & receipts, contracts/agreements, ID documents, and application forms. Starting today, Amazon Textract removes that pre-processing step and supports single page PDF documents in synchronous operations so that customers can extract text and data from PDF documents without converting documents from PDF to PNG or JPEG.

Additionally, Amazon Textract now also supports processing of JPEG 2000 encoded images inside PDF documents. You can now extract text and data from JPEG 2000 encoded images within your PDF documents.

To get started, log into the Amazon Textract console to test out your PDF documents. To learn more about Textract capabilities, please visit the Amazon Textract website, developer guide, or resource page.