Q: What is Amazon Textract?
A: Amazon Textract is a document analysis service that detects and extracts text, structured data, such as fields of interest and their values, and tables from images and scans of documents. Amazon Textract's machine learning models have been trained on millions of documents so that virtually any document type you upload is automatically recognized and processed for text extraction. When information is extracted from documents, the service returns a confidence score for each element it identifies so that you can make informed decisions about how you want to use the results. For instance, if you are extracting information from tax documents you can set custom rules to flag any extracted information with a confidence score lower than 95%. Also, all extracted data are returned with bounding box coordinates, which is a rectangular frame that fully encompasses each piece of data identified, so that you can quickly identify where a word or number appears on a document. You can access these features with the Amazon Textract API, in the AWS Management Console, or using the AWS command-line interface (CLI).
Q: What are the most common use cases for Amazon Textract?
A: The most common use cases for Amazon Textract include:
- Import Documents and Forms into Business Applications
- Create Smart Search Indexes
- Build Automated Document Processing Workflows
- Maintain Compliance in Document Archives
- Extract Text for Natural Language Processing (NLP)
- Text Extraction for Document Classification
Q: How do I get started with Amazon Textract?
A: If you are not already signed up for Amazon Textract preview, you can click the "Sign up for the preview" button on the Amazon Textract page and complete the sign-up process. You must have an Amazon Web Services account; if you do not already have one, you will be prompted to create one during the sign-up process. Once you have completed the form, you will receive an email confirmation that your request to join the preview has been received.
Q: What APIs does Amazon Textract offer?
A: Amazon Textract offers APIs that detect and extract text from scanned images of documents, extracts structured data such as tables, and performs key-value pairing on extracted text. For details, please refer to the Amazon Textract API reference.
Q: What document formats does Amazon Textract support?
A: Amazon Textract currently supports PNG, JPEG, and PDF formats. For synchronous APIs, you can submit images either as an S3 object or as a byte array. For asynchronous APIs, you can submit S3 objects.
Q: In which AWS regions is Amazon Textract available?
A: Amazon Textract is currently available in the US East (Northern Virginia), US East (Ohio), US West (Oregon), and EU (Ireland).
Detect Document Text API
Q: What is optical character recognition (OCR)?
A: Optical character recognition (OCR) refers to the technology that can recognize and extract text characters from a digital image. OCR is used to recognize and extract text in scanned documents, such as tax forms or invoices.
Amazon Textract performs OCR using the Detect Document Text API, but goes a step further in the document analyzing process and also performs key-value pair detection so that text extractions remain organized in their intended structure.
Q: How do I use the confidence score Amazon Textract provides?
A: A confidence score is a number between 0 and 100 that indicates the probability that a given prediction is correct. With Amazon Textract, all extracted text and structured data are returned with bounding box coordinates, which is a rectangular frame that fully encompasses each piece of data identified. This allows you to identify the score for each extracted entity so that you can make informed decisions on how you want to use the results.
Q: What type of text can Amazon Textract detect and extract?
A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols.
Analyze Document API
Q: What is the Analyze Document API?
A: The Analyze Document API can detect text, fields, values, their relationships, tables, and other entities within a document along with their associated confidence scores. With the Analyze Document API, developers can automatically capture structured data from a wide variety of documents including tax forms, financial reports, medical records, and loan applications.
Q: How does Amazon Textract count the number of pages processed?
A: An image (PNG or JPEG) counts as a single page. For PDFs, each page in the document is counted as a page processed.
Q: Which APIs am I charged for with Amazon Textract?
A: Refer to the Amazon Textract pricing page to learn more about pricing.
Q: How much does Amazon Textract cost?
A: Amazon Textract charges you based on the number of pages and images processed. For more information, visit the pricing page.
Q: Does Amazon Textract participate in the AWS Free Tier?
A: Yes. As part of the AWS Free Usage Tier, you can get started with Amazon Textract for free. New customers can analyze up to 1,000 pages per month using the Detecting Document Text API and up to 100 pages per month using the Analyze Document API, for the first three months.
Q: Do your prices include taxes?
A: For details on taxes, please see Amazon Web Services Tax Help.
Integrating Amazon Textract with other AWS Services
Q: How does Amazon Textract work with other AWS products?
A: Amazon Textract integrates seamlessly with Amazon S3, AWS Lambda, AWS Batch, and Amazon ElasticSearch Service to enable intelligent searches on data stored in their documents. For large volumes of documents, you can use AWS Batch to call Textract APIs to extract text data and analyze documents asynchronously without getting throttled for sending too many concurrent requests. Amazon ElasticSearch Service can be leveraged to intelligently search through the documents based on the responses from the Amazon Textract APIs. Additionally, Amazon Textract’s layout and structure analysis can be augmented with Amazon Comprehend’s entity recognition feature to analyze customers’ documents better.
Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?
A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract and other Amazon machine-learning/artificial-intelligence technologies. Use of your content is necessary for continuous improvement of your Amazon Textract customer experience, including the development and training of related technologies. We do not use any personally identifiable information that may be contained in your content to target products, services or marketing to you or your end users. Your trust, privacy, and the security of your content are our highest priority and we implement appropriate and sophisticated technical and physical controls, including encryption at rest and in transit, designed to prevent unauthorized access to, or disclosure of, your content and ensure that our use complies with our commitments to you. Please see https://aws.amazon.com/compliance/data-privacy-faq/ for more information.
Q: Who has access to my content that is processed and stored by Amazon Textract?
A: Only authorized employees will have access to your content that is processed by Amazon Textract. Your trust, privacy, and the security of your content are our highest priority, and we implement appropriate and sophisticated technical and physical controls, including encryption at rest and in transit, designed to prevent unauthorized access to, or disclosure of, your content and ensure that our use complies with our commitments to you. Please see https://aws.amazon.com/compliance/data-privacy-faq/ for more information.
Q. Can I delete images and documents stored by Amazon Textract?
A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience.
Q: Do I still own my content that is processed and stored by Amazon Textract?
A: Yes. You always retain ownership of your content, and we will only use your content with your consent.