Running a Cost-effective NLP Pipeline on Serverless Infrastructure at Scale

Amenity Analytics develops enterprise natural language processing (NLP) platforms for the finance, insurance, and media industries that extract critical insights from mountains of documents. We provide a scalable way for businesses to get a human-level understanding of information from text.

In this blog post, we will show how Amenity Analytics improved the continuous integration (CI) pipeline speed by 15x. We hope that this example can help other customers achieve high scalability using AWS Step Functions Express.

Amenity Analytics’ models are developed using both a test-driven development (TDD) and a behavior-driven development (BDD) approach. We verify the model accuracy throughout the model lifecycle—from creation to production, and on to maintenance.

One of the actions in the Amenity Analytics model development cycle is backtesting. It is an important part of our CI process. This process consists of two steps running in parallel:

Unit tests (TDD): checks that the code performs as expected
Backtesting tests (BDD): validates that the precision and recall of our models is similar or better than previous

The backtesting process utilizes hundreds of thousands of annotated examples in each “code build.” To accomplish this, we initially used the AWS Step Functions default workflow. AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability.

Challenge with the existing Step Functions solution

We found that Step Functions standard workflow has a bucket of 5,000 state transitions with a refill rate of 1,500. Each annotated example has ~10 state transitions. This creates millions of state transitions per code build and we often faced delays and timeouts due to the standard limits. Although limits could be increased, developers wanted to find a better solution to coordinate their work with each other and avoid slowing down the development cycle.

To resolve these challenges, we migrated from Step Functions standard workflows to Step Functions Express workflows, which have no limits on state transitions. In addition, we changed the way each step in the pipeline is initiated, from an async call to a sync API call.

Step Functions Express workflow solution

When a model developer merges their new changes, the CI process starts the backtesting for all existing models.

Each model is checked to see if the annotated examples were already uploaded and saved in the Amazon Simple Storage Service (S3) cache. The check is made by a unique key representing the list of items. Once a model is reviewed, the review items will rarely be changed.
If the review items haven’t been uploaded yet, it uploads them and initiates an unarchive process. This way the review items can be used in the next phase.
When the items are uploaded, an API call is invoked using Amazon API Gateway with the review items keys, see Figure 1.
The request is forwarded to an AWS Lambda function. It is responsible for validating the request and sending a job message to an Amazon Simple Queue Service (SQS) queue.
The SQS messages are consumed by concurrent Lambda functions, which synchronously invoke a Step Function. The number of Lambda functions are limited to ensure that they don’t exceed their limit in the production environment.
When an item is finished in the Step Function, it creates an SQS notification message. This message is inserted into a queue and consumed as a message batch by a Lambda function. The function then sends an AWS IoT message containing all relevant messages for each individual user.

Figure 1. Step Functions Express workflow solution

Main Step Function Express workflow pipeline

Step Functions Express supports only sync calls. Therefore, we replaced the previous async Amazon Simple Notification Service (SNS) and Amazon SQS, with sync calls to API Gateway.

Figure 2 shows the workflow for a single document in Step Function Express:

Generate a document ID for the current iteration
Perform base NLP analysis by calling another Step Function Express wrapped by an API Gateway
Reformat the response to be the same as all other “logic” steps results
Verify the result by the “Choice” state – if failed go to end, otherwise, continue
Perform the Amenity core NLP analysis in three model invocations: Group, Patterns, and Business Logic (BL)
For each of the model runtime steps:
- Check if the result is correct
- If failed, go to end, otherwise continue
Return a formatted result at the end

Figure 2. Workflow for a single document

Base NLP analysis Step Function Express

For our base NLP analysis, we use Spacy. Figure 3 shows how we used it in Step Functions Express:

Confirm if text exists in cache (this means it has been previously analyzed)
If it exists, return the cached result
If it doesn’t exist, split the text to a manageable size (Spacy has text size limitations)
All the texts parts are analyzed in parallel by Spacy
Merge the results into a single, analyzed document and save it to the cache
If there was an exception during the process, it is handled in the “HandleStepFunctionExceptionState”
Send a reference to the analyzed document if successful
Send an error message if there was an exception

Figure 3. Base NLP analysis for a single document

Results

Our backtesting migration was deployed on August 10, and unit testing migration on September 14. After the first migration, the CI was limited by the unit tests, which took ~25 minutes. When the second migration was deployed, the process time was reduced to ~6 minutes (P95).

Figure 4. Process time reduced from 50 minutes to 6 minutes

Conclusion

By migrating from standard Step Functions to Step Functions Express, Amenity Analytics increased processing speed 15x. A complete pipeline that used to take ~45 minutes with standard Step Functions, now takes ~3 minutes using Step Functions Express. This migration removed the need for users to coordinate workflow processes to create a build. Unit testing (TDD) went from ~25 mins to ~30 seconds. Backtesting (BDD) went from taking more than 1 hour to ~6 minutes.

Switching to Step Functions Express allows us to focus on delivering business value faster. We will continue to explore how AWS services can help us drive more value to our users.

For further reading:

AWS Architecture Blog