AWS Machine Learning Blog
Amazon Textract becomes PCI DSS certified, and retrieves even more data from tables and forms
Amazon Textract automatically extracts text and data from scanned documents, and goes beyond simple optical character recognition (OCR) to also identify the contents of fields and information in tables, without templates, configuration, or machine learning experience required. Customers such as Intuit, PitchBook, Change Healthcare, Alfresco, and more are already using Amazon Textract to automate their document processing workflows so that they can accurately process millions of pages in hours. Additionally, you can create smart search indexes, build automated approval workflows, and better maintain compliance with document archival rules by flagging data that may require redaction.
Today, Amazon Web Services (AWS) announced that Amazon Textract is now PCI DSS certified. This means that you can now use Amazon Textract for all workloads that require Payment Card Industry Data Security Standard (PCI DSS) information security standard, such as cardholder data (CHD) or sensitive authentication data (SAD). You can also process protected health information (PHI) workloads on Amazon Textract, because it is a HIPAA eligible service. Also starting today, AWS has also launched new quality enhancements so you can retrieve even more data from tables (structured data organized into rigid rows and columns) and forms (structured data represented as key-value pairs and selectable elements such as check boxes and radio buttons).
Amazon Textract now retrieves more data with more accuracy from complex tables that contain split cells and merged cells. Amazon Textract also identifies rows and columns for cells with wrapped text (text present across multiple lines) with more accuracy, even for tables without explicitly drawn borders. Amazon Textract also more accurately retrieves form data from documents that also contain tables on the same page and key-value pairs that are nested within a table. These enhancements build upon an update launched in October 2019 to improve the accuracy of text retrieval, and to more accurately correct the rotation and deformation present in documents with imperfect scans.
To illustrate the benefits of these new quality enhancements, this post analyzes one of the Acord forms that is common in the insurance industry: the Acord 25. This document often contains multiple tables: a table to represent different insurers, a second table to list liability types, and a third table to capture coverage limits. There may be multiple key-value pairs and check boxes to capture information about the insured, the liability conditions, and more.
The following Acord 25 document contains fictitious content for illustrative purposes.
The following image shows the original output you would have previously received. Amazon Textract correctly identifies two of the primary tables and determined that the section on the right side of the second table is not part of that same table. However, it failed to identify that section as a third table.
The following image shows the output from Amazon Textract with its new updates. It now performs better on complex table structures with merged cells and wrapped text, and correctly identifies all three tables.
Customers using Amazon Textract
PitchBook, MSP Recovery, and Filevine are customers using Amazon Textract, and have shared their experiences with AWS.
PitchBook is the leading provider of data in the private capital markets, specifically VC, PE, and M&A. As a part of that market, a portion of their data comes from surveys, particularly in PDF. PitchBook started using Amazon Textract to improve this part of their research process. “Before using Amazon Textract, this process took hundreds of manual hours going through PDFs and manually entering information as it came in,” says Tyler Martinez, Director of Data Science and Software Engineering at PitchBook. “With Amazon Textract, we have seen gains as high as 60% in our process. We’re hoping to use Amazon Textract in other areas that may improve our data collection processes as well.”
MSP Recovery offers a comprehensive healthcare claims platform to determine primary payment responsibility among multiple insurance carriers. “Amazon Textract is very impressive,” said Franklin Perez, Head of Software Development at MSP Recovery. “We decided to use Amazon Textract to detect different document formats to process information and data properly and efficiently. The feature is designed to have the ability to recognize the various different formats it’s pulling text from, whether this is tables or forms, which is an AI dream come true for us. We needed a solution that would be scalable to various documents, as we receive different document types on a regular basis and need to be efficient at reading them. With a lean team, we are able to allow the machine learning to handle the heavy lifting by automating reading thousands of documents, allowing our team to focus on higher-order assignments.”
Filevine is the operating core for legal professionals, including cloud-based case and matter management, document management, and in-depth reporting analytics. From its launch in 2015, Filevine focused on rapid innovation and award-winning design, and earned the highest ratings from independent review sites. “Millions of matters and case files are handled in Filevine every day,” says Ryan Anderson, Chief Executive Officer at Filevine. “We chose Amazon Web Services because we wanted to deliver best-in-class document search solutions for our customers. Amazon Textract is fast, accurate, and scalable—it helps Filevine meet the exacting requirements of the world’s largest and most sophisticated legal organizations. With Filevine and Amazon, finding the proverbial needle in the haystack has never been easier for legal professionals.”
With the newest improvements to Amazon Textract, you can retrieve more information from the same document, with more accuracy. And Amazon Textract continues to improve; at AWS re:Invent 2019, AWS announced a public preview of Amazon Textract’s integration with the Amazon Augmented Artificial Intelligence service for the forms features. This enables you to apply human validation on your AI inference output from Amazon Textract. Amazon Textract has also increased the file size limit for synchronous APIs to 10 MB. You can also continue to use asynchronous APIs to process files up to 500 MB each. For more information, see the video AWS re:Invent 2019: [REPEAT] AI document processing for business automation On YouTube.
You can get started with Amazon Textract today. Try Amazon Textract with your images or PDF documents and get high-quality results in seconds.
About the Author
Kriti Bharti is the Product Lead for Amazon Textract. Kriti has over 15 years’ experience in Product Management, Program Management, and Technology Management across multiple industries such as Healthcare, Banking and Finance, and Retail. In her spare time, you can find Kriti spending pawsome time with Fifi and her cousins, reading, or learning different dance forms.