AWS Startups Blog

How Amazon Textract helped Fyle boost data extraction accuracy

Fyle is an intelligent spend management product that helps businesses keep track of their expenses. They eliminate all manual activity around expense management. They know it’s frustrating for employees to fill out expense reports with details like amount, category, etc., because most of these details are already available in the receipt. With their Data-Extractor service, when an employee uploads the receipt, fields like amount, currency, date of spend, category, vendor, etc., fill in automatically.

Originally, the Data-Extractor service relied on an external service provider for optical character recognition (OCR) and Fyle’s internal machine learning algorithm to detect amount, category, date, currency, and vendor information. Unfortunately, they were receiving some feedback from customers that their tool wasn’t very accurate. As you can imagine, this isn’t the best place to be, so they rewrote their Data-Extractor service to use Amazon Textract because of its intuitive web console for APIs, which allowed them to test APIs in real-time with personalized input. This let them quickly try out an Amazon Textract API, which helped them achieve their goal of turning around a solution in two months. After implementing their new solution, Fyle saw 51.7% improvement in accuracy for the Data-Extractor service.

Testing the original solution

Before embarking on any major changes, Fyle built a test suite to systematically measure the accuracy numbers of the Data-Extractor service. They initially built out the data using humans, which took about a month. After that, they compared the accuracy of the Data-Extractor service against the human dataset. Table 1 shows the results.

Table 1. Data-Extractor’s performance before using Amazon Textract
Receipt features Extraction accuracy in %
Date 43.46
Amount 45.37
Category 16.53
Currency 46.04
Vendor 21.89

Admittedly, these numbers are not great. So, Fyle embarked on improving accuracy. They had two options: 1) make incremental improvements or 2) rewrite the service. Fyle was leaning towards a rewrite for maintainability reasons, but then they found Amazon Textract.

Applying Amazon Textract to improve accuracy

Fyle’s team of two engineers working on this project were tasked with producing a solution within 2 months. This seemed like a pretty difficult ask, especially since they needed to also support the current version of the Data-Extractor service that was in production.

But after finding Amazon Textract, they saw that they could quickly test APIs and that all of Amazon Textract’s API contracts and code examples in different languages are listed with a detailed explanation of possible errors. This made the task considerably easier.

Once they decided to use Amazon Textract, all Fyle had to do was to convert their files into the required format for particular Amazon Textract APIs, AnalyzeExpense and DetectDocumentText, and make a request to the APIs from their code.

Amazon Textract AnalyzeExpense API extracts financial information

The AnalyzeExpense API synchronously analyzes an input document for financially related relationships between text. This helps Fyle extract information from uploaded receipts, including:

  • Vendor Name: VENDOR_NAME
  • Total: TOTAL
  • Receiver Address: RECEIVER_ADDRESS
  • Invoice/Receipt Date: INVOICE_RECEIPT_DATE
  • Invoice/Receipt ID: INVOICE_RECEIPT_ID
  • Payment Terms: PAYMENT_TERMS
  • Subtotal: SUBTOTAL
  • Due Date: DUE_DATE
  • Tax: TAX
  • Invoice Tax Payer ID (SSN/ITIN or EIN): TAX_PAYER_ID
  • Item Name: ITEM_NAME
  • Item Price: PRICE
  • Item Quantity: QUANTITY

Amazon Textract extracts these fields with a confidence score to show users how accurate the data pull is. It also provides the location (geometry coordinates) of the text on the receipt, so that users can further use that information for tagging features on the receipt.

DetectDocumentText

The DetectDocumentText API detects text in the input document. This helps Fyle get detailed OCR of entire receipts. It provides OCR in the form of lines and words with the confidence score of how confident Textract is with the extraction.

Figure 1 shows a sample output.

Sample output of Amazon Textract DetectDocumentText

Figure 1. Sample output of Amazon Textract DetectDocumentText

Fyle rebuilt their data extraction service on top of Amazon Textract. With this service, they were able to write a considerably more maintainable version of their Data-Extractor service that worked across paper and the digital receipts with automated tests and 87% code coverage.

After applying Amazon Textract, they compared the results against human accuracy and their Data-Extractor service, as shown in Table 2.

Table 2. Data-Extractor’s performance after using Textract
Receipt features Extraction accuracy in %
Date 74.85
Amount 68.84
Category 44.47
Currency 72.18
Vendor 47.12

Figure 2 shows a visualization of the change in accuracy after the Amazon Textract integration.

Old vs new accuracy for receipts

Figure 2. Old vs new accuracy for receipts

Fyle rolled out the change to all their customers over a 10-day period without any major issues. Customers even noticed the difference and sent thank you notes on the improvements.

Conclusion

With Amazon Textract and good engineering practices, Fyle wrote a considerably more maintainable version of the Data-Extractor service with a small team. It took them a few steps further on their mission of removing the drudgery associated with expense management for their customers.

Madhav Mansuriya

Madhav Mansuriya

Madhav Mansuriya is a senior member of technical staff at Fyle. He was the lead engineer of the project.

Vikas Prasad

Vikas Prasad

Vikas Prasad is an Engineering Manager and has been part of the core team at Fyle for many years. He planned and managed the entire rewrite including testing and evaluation of AWS Textract.