Training a machine learning model to automate expense reporting
This is a guest post by Dale Brown, VP of Business Development, Figure Eight
Before you can build a successful machine learning (ML) algorithm that works in the real world, you must ensure that your data is appropriately annotated by human annotators. Creating accurate training data helps ensure better accuracy when you launch your ML models into production. Accurate training data also makes it more likely that you are creating a model that achieves your business goal in the real world. Without the ingestion of quality training data, ML models are useless. Poorly-labeled training data leads to poor-performing algorithms.
In the following tutorial, I will show how to use the Figure Eight Data Labeling Platform to obtain accurate, annotated data to train an ML model. I will also show how to use Figure Eight to extract specific entities such as transaction amount and date from paper receipts. This training data will feed into an automated expense reimbursement model.
To follow this tutorial, you need:
- An AWS account.
- Raw source data that needs annotation.
Data science or machine learning expertise is not required. For an easy procurement process, the Figure Eight Data Labeling Platform is available to AWS customers in AWS Marketplace.
How to train your machine learning model to automate expense reporting
Subscribe and log in
- Subscribe to the Figure Eight Data Labeling Platform in AWS Marketplace. Review the Figure Eight listing to ensure it covers the volume and user requirements needed for your data labeling initiative. If not, please contact Figure Eight to learn more about a Private Offer. Select Continue to Subscribe.
- Log in to the Figure Eight platform using the credentials provided by Figure Eight via email.
Create and design your job
Once you have logged into the Figure Eight platform with your emailed credentials, select Create job. Select a job template, or you can start a job from scratch. The following screenshot shows some of your template choices, including Sentiment Analysis, Search Relevance, Data Categorization, Data Collection & Enrichment, Data Validation, Image Annotation, Transcription, and Content Moderation.
- For the purposes of demonstration, I am going to select a transcription template for my automated expense reimbursement job. We offer a choice of basic and advanced transcription templates. The choice depends on the type of information you would like to receive from the image. If you are unsure which template to use, select Preview to see an example of each template to determine which is best for your use case.
- Next, drag and drop your data into the platform. Figure Eight supports .csv, .tsv, .xlsx, and .ods formats. Note that your team maintains ownership and governance over all source data. The data that your team supplies never leaves your servers. For added security, private buckets can be used while processing training data in the Figure Eight platform.
- Select Design your job. You can edit your job design via the graphical editor or code editor. I’m going to use the graphical editor. On the Design page, enter your instructions for the annotators in the text box to specify what you would like annotated. For my automated expense reimbursement job, in the Overview section I entered the task definition: In this task, you will review purchase receipts for business names and dates. In the Steps section, I entered the following:
- Determine if you are able to read the text in the image.
- If yes, enter the Business Name on the receipt and the Date of the receipt in the provided boxes.
In the Rules & Tips section, I entered the following:
- Anything that has a date and an amount should be considered a receipt. Screenshots, handwritten receipts, emails, and invoices are all acceptable receipts.
- Ignore any handwritten elements when considering the text’s readability.
- Do not include any punctuation.
- Enter the date in this format: YYYY-MM-DD.
In the menu on the right side of the page, select Preview to see what your job looks like from an annotator’s point of view.
When you are done designing your job, select Next to proceed to the Quality tab.
Test and launch your job
You can ensure annotation quality by creating a golden data set using test questions. Test questions are a quality control measure within the Figure Eight platform. To do so, annotate a subset of your own data. The platform inserts your annotations in each job that launches to test your annotators’ work against your quality threshold.
- Select the Create Test Questions. You will be asked to annotate a subset of your data. Figure Eight will insert those annotations in each job that launches in the platform. The annotations will test contributors to see if they have annotated the work to your quality threshold. If you want a large amount of golden data but you don’t want to take the time to do the annotation yourself, you can use the test question generator directly in the platform. Unlock this feature by doing reaching out to our Customer Success team via email.
- In the upper right side of the screen, select Launch Your Job to begin the annotation process. You can track the progress from the platform on the Monitor page and pull your results via an API or CSV on the Results tab. These results can easily be exported and used in ML platforms such as Amazon SageMaker to build and deploy your ML models.
In this post, I showed how to use the Figure Eight Data Labeling Platform to label data for creating an automated expense reimbursement model. I also demonstrated how to create an ML data labeling job in order to obtain accurate, annotated training data.
Resources and next steps
To find out more about Figure Eight, visit the Figure Eight seller page in AWS Marketplace. To find out more and subscribe to Figure Eight Data Labeling Platform, visit the product listing in AWS Marketplace.
About the author
Dale Brown is the VP of Business Development, Head of Operations, and part of the executive team at Figure Eight.
The content and opinions in this post are those of the third-party author, and AWS is not responsible for the content or accuracy of this post.