AWS Partner Network (APN) Blog

Reducing Inference Times by 87% for Darwinbox’s Talent Search Engine Using AWS Inferentia

By Rony K Roy, Sr. Specialist Partner Solutions Architect, AI/ML – AWS
By Anirban De, Lead Data Analytics and AI – Minfy Technologies
By Blesson Davis, Data Scientist – Minfy Technologies
By Manish Dixit, Sr. Partner Management SA – AWS

Minfy-AWS-Partners-2024
Minfy Technologies
Minfy-APN-Blog-CTA-2024

Darwinbox is a new-age, enterprise-ready human capital management (HCM) platform that enables organizations to automate day-to-day HR processes while simplifying human interactions and delivering actionable insights to build better workplaces.

Darwinbox is building an in-house product for finding the right talent for defined job requirements. As its user base grew into thousands, the 2.2+ seconds it took to match candidates’ resumes to job descriptions was no longer acceptable.

This led Darwinbox to seek solutions that sped up the inferencing times the platform’s PyTorch models took in processing and matching resumes against job descriptions.

In this post, we will explore how Minfy Technologies helps Darwinbox overcome these challenges and deliver an 87% reduction in inferencing times using AWS Inferentia.

Minfy is an AWS Premier Tier Services Partner and Managed Service Provider (MSP) that helps customers navigate digital journeys leveraging artificial intelligence (AI) and cloud.

Project Scope and Background

Darwinbox leverages the transformer-based PyTorch models—one model for semantic search and the other for named entity recognition—for its search and match solution to score and shortlist the best fit resume for a given job description. This solution was developed by training two HuggingFace models on proprietary datasets. However, inferencing a large number of resumes was taking a significant amount of time.

Darwinbox wanted to reduce the time to process a resume as much as possible and thus create a better user experience for end users. The team observed a pattern that reveals how the traffic fluctuates throughout the week, and wanted minimal changes in their approach and existing code while adhering to the requirements of the engineering team.

Solution Overview

Darwinbox’s existing PyTorch models were running on compute-optimized instances, but were unable to deliver the performance the organization needed. A deep learning accelerator-powered instance using GPUs or AWS Inferentia would be optimal.

Based on the customer’s specific requirements and architecture, Minfy approached the challenge in two ways:

  • Run these PyTorch models on deep learning accelerator-powered instances purpose-built for natural language processing (NLP) with minimal code changes. This was achieved by compiling the existing models using the Neuron SDK and extending the pre-built Amazon SageMaker Docker containers with the required dependencies.
  • The real-time endpoint was configured for both models separately and used AWS Lambda with Amazon API Gateway to invoke the endpoints. This was configured per the API design requirement of the customer’s DevOps team. To run inference on large datasets, Minfy proposed sending requests in parallel as mini batches of a few resumes. SageMaker Inference Recommender was used to determine how many concurrent calls can be made to the model per minute.

minfy inf1 architecture diagram

Figure 1 – High level solution architecture.

The diagram above gives a high-level overview of the architecture (development stage). Key services used are:

  • Amazon SageMaker: To process the data, train the model, and deploy the model to endpoints, which are configured with autoscaling to address high traffic at peak hours.
  • Amazon S3: All of the data used for training the models were stored in Amazon Simple Storage Service (Amazon S3) buckets. After training, the model artifacts were also stored in S3 buckets.
  • AWS Lambda: Lambda was configured to invoke the two different endpoints in a linear manner and send the response to Amazon API Gateway.
  • Amazon API Gateway: The entire solution was made available as an API using Amazon API Gateway, which made it easy to manage and monitor REST API at scale.
  • AWS Inferentia : Darwinbox’s PyTorch models were compiled into Inferentia-compatible formats using the Neuron SDK.

In this post, we will be focusing on optimizing the HuggingFace models by reducing their inference time for large datasets.

HuggingFace Transformers

HuggingFace is a popular platform when it comes to transformer libraries. The HuggingFace Hub has a collection of models for different types of NLP applications like sentiment analysis, named entity recognition, and text classification to name a few. These models can be directly used in Amazon SageMaker to train them using custom data and deploy them in production.

See the documentation for blogs and resources to get started with HuggingFace on Amazon SageMaker.

Note that this use case was started before the introduction of Optimum Neuron, which simplifies the adoption of Inferentia and Trainium accelerators even further. SageMaker provides built-in container images for training and to run inference on these models. These containers are stored in a public Amazon Elastic Container Registry (Amazon ECR) repository.

For this use case, we’ll explore how to use a PyTorch container and install custom libraries and save it to our ECR repository.

Building and Deploying the Models

NLP and computer vision workloads deal with a large matrix of numbers at the fundamental level. GPUs became popular for their ability to perform these matrix calculations in parallel, thanks to the large number of threads they employ.

Using GPU-powered instances speeds up the model training process by a significant amount. Once the training is completed, one of the key requirements during the inference phase is to make these models available to millions of users at the same time.

Inferencing at Scale Using AWS Inferentia

AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost for most deep learning inference applications. These accelerators were built from the ground up to increase inference times by up to 2.3x and reduce inference costs by up to 70%.

Compiling the HuggingFace Models with Neuron SDK

In order to use Inferentia chips for our NLP workload, the model needs to be compiled into an Inferentia-optimized representation. AWS provides the Neuron SDK to run deep learning workloads on Inferentia instances, and it supports both PyTorch and TensorFlow frameworks.

After training the model on custom data, we can deploy the model on Inferentia. To do this, we’ll trace and compile the model with the Neuron SDK, which optimizes the model for Inferentia accelerators at the lowest cost. It’s not directly possible to deploy the model on Inferentia instances unless we trace and compile the model using Neuron SDK.

Here, we’ll use a pre-trained HuggingFace model to demonstrate how to compile a PyTorch model to deploy, trace, and compile the model. You may access the entire code used for this blog in this GitHub repository.

Step 1: Install and Import All the Libraries

The following code installs transformers and torch-neuron packages:

!pip install --upgrade --no-cache-dir torch-neuron neuron-cc[tensorflow] torchvision torch --extra-index-url=https://pip.repos.neuron.amazonaws.com
!pip install --upgrade --no-cache-dir 'transformers==4.6.0'
!pip install transformers

import torch
import torch_neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

Step 2: Create and Organize Example Inputs

The following code defines the tokenizer, model, sample inputs, and converts sample inputs into a format compatible with torchscript:

# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gchhablani/bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("gchhablani/bert-base-cased-finetuned-mrpc", return_dict=False)

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

max_length=128
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

# Run the original PyTorch model on compilation example
paraphrase_classification_logits = model(**paraphrase)[0]

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

Step 3: Compile and Save the Model into Neuron-Optimized TorchScript

Next, convert the existing model into an AWS Neuron-optimized format that can be readily run on Inferentia accelerators:

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase, verbose=1, compiler_workdir='./compilation_artifacts')

# Save the TorchScript for later use
model_neuron.save('neuron_compiled_model.pt')

Step 4: Package and Upload the Model to Amazon S3

Finally, upload the model artifacts to Amazon S3:

# Create a model.tar.gz file to be used by SageMaker endpoint
!tar -czvf model.tar.gz neuron_compiled_model.pt

import boto3
import time
from sagemaker.utils import name_from_base
import sagemaker

# upload model to S3
role = sagemaker.get_execution_role()
sess=sagemaker.Session()
region=sess.boto_region_name
bucket=sess.default_bucket()
sm_client=boto3.client('sagemaker')
model_key = '{}/model/model.tar.gz'.format('inf1_compiled_model')
model_path = 's3://{}/{}'.format(bucket, model_key)
boto3.resource('s3').Bucket(bucket).upload_file('model.tar.gz', model_key)
print("Uploaded model to S3:")
print(model_path)

Extending Pre-Built Docker Containers

Once the model is compiled and saved to Amazon S3, we need to create a container image to install the transformers library since our inference code requires it and it’s not installed in the default SageMaker PyTorch image.

In case you’re using a HuggingFace DLC (deep learning container) image, you’ll require this step to install the neuron library instead.

We’ll use a pre-built custom image and extend it by installing the additional libraries required by inference code. SageMaker has all the necessary permissions to create, access, and push container images to Amazon ECR.

Step 1: Create a New Folder and Text File Named ‘Dockerfile’

In this step, we create a folder named container and a text file named Dockerfile (without .txt extension) with the following commands. The code pulls a public image from Amazon ECR with a tag 1.7.1-neuron-py36-ubuntu18.04 which is preconfigured with PyTorch, Neuron SDK, and other dependencies for running inference on AWS Inferentia instances.

After pulling the base image, it installs the Python package transformers, which is part of the HuggingFace Transformers library with a specific version.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuron:1.7.1-neuron-py36-ubuntu18.04

# Install packages
RUN pip install "transformers==4.7.0"

Extending prebuilt docker containersFigure 2 – Create a Dockerfile in a new folder ‘container’.

Step 2: Create an ECR Repository and Pull the SageMaker PyTorch Container Image

The public Amazon ECR image is a read-only file, and for this tutorial we’ll pull this image into our repository in order make changes before using it for inference. For this, we’ll first create an ECR repository and pull the public image into this repository. To pull this image from ECR, we retrieve the login command from ECR to authenticate Docker and login.

The AWS region specified during the SageMaker Notebook creation should match the region of the repository hosting the custom Docker image. Failure to synchronize these regions can result in errors during the execution of this step.

%%sh

# The name of our algorithm
algorithm_name=neuron-py36-inference
cd container
account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-ap-south-1}
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository does not exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR in order to pull the SageMaker PyTorch image
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

Step 3: Build the Docker Image Locally

After a successful login, we proceed to build the Docker image locally using the Dockerfile we created in the container directory. The following code also sets the image name to ${algorithm_name} and passes the REGION build argument while building the Docker image locally. The Docker image built is then tagged with the full ECR repository Uniform Resource Identifier (URI).

%%sh

# Build the docker image locally with the image name and then push it to ECR with the full name.
docker build  -t ${algorithm_name} . --build-arg REGION=${region}
docker tag ${algorithm_name} ${fullname}

Step 4: Push the Extended Container Image to ECR with the Tag

Next, we push this extended pre-build image to the ECR repository we created. To do this, use the same approach as above. We first retrieve the authentication token from ECR for the specified region. Once Docker is authenticated, the following command can be used to push the Docker image tagged with the specified full ECR repository URI (${fullname}) to our ECR repository.

%%sh 
# The name of our algorithm
algorithm_name=neuron-py36-inference cd container account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-ap-south-1}
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository does not exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then     aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null fi 
# Get the login command from ECR in order to pull the SageMaker PyTorch image
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com 
# Build the docker image locally with the image name and then push it to ECR with the full name.
docker build  -t ${algorithm_name} . --build-arg REGION=${region}
docker tag ${algorithm_name} ${fullname}

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com docker push ${fullname}

Deploying the Container on AWS Inferentia

Next, we can deploy the model on AWS Inferentia instances with the following steps:

  1. Create an inference.py file that includes four key functions: input_fn(), output_fn(), model_fn() and predict_fn() along the lines of this inference file.
  2. Once the container image is created and inference code is ready, deploy the PyTorch predictor by creating a PyTorch estimator object and specifying details like location of model artifacts, ECR image URI, PyTorch version, role, and inference.py file as the entry_point.

We first import the PyTorchModel class from the SageMaker Python SDK, which is used to define a SageMaker model based on PyTorch framework. While configuring this, we need to specify the URI of the image we pushed to our ECR repository (the URI of this custom image can also be accessed from the ECR console).

Additionally, we need to specify the model is already compiled using the AWS Neuron compiler. Since the model is already compiled, there’s no need for on-the-fly compilation during inference ensuring low latency and efficient hardware utilization. We then proceed to deploy the model on Inferentia by configuring the instance_type.

# Create a PyTorch Estimator using the inference code and the ECR image
from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=pretrained_model_data,
    role=role,
    source_dir="code",
    framework_version="1.7.1",
    entry_point="inference.py",
    image_uri=ecr_image
)

# indicate SageMaker that the model has been compiled using neuron-cc. 
pytorch_model._is_compiled_model = True

# Deploy the model
predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.inf1.2xlarge")

Mini Batch Processing

In this use case, Darwinbox stored the input data in a MongoDB database and had logic in place to send the resume data along with the job description to the API created by Minfy. The incoming data was either real-time, or taken from the database. As mentioned earlier, the traffic has peaks at certain hours of the day.

It was therefore possible to invoke the SageMaker endpoints by sending the data in small batches of job description and resumes. This logic was implemented and data were sent from the database in mini batches—Minfy bypassed the dependency on S3 which was one of the requirements from the customer.

However, there were some key considerations to be made to implement this:

  • What is the ideal number of resumes and job descriptions in these mini-batches?
  • What’s the maximum number of concurrent endpoint invocations?
  • How many instances should be configured for autoscaling?

Inference Recommender

In the machine learning development cycle, there are many decisions to be made, such as a data scientist determining the right size of instances and to configure autoscaling parameter reasonably. AWS provides an interesting feature, Amazon SageMaker Inference Recommender, to compare performance of different instances across parameters like cost, latency, and throughput. This can be used either from the SageMaker console or from notebook itself.

Key steps involved are:

  1. Create a sample payload (2-3files) and zip it to the tar.gz file.
  2. Specify the model artifact’s location.
  3. Specify parameters like stopping criteria and instance types you want to test and compare.

Note that the inference file used must contain all of the functions mentioned before: input_fn(), output_fn(), model_fn() and predict_fn(). You may use the default PyTorch inference handler and customize based on your use case.

After the inference recommender job is completed, we use the output to compare the performances of different instances (across sizes and types) to configure the right instance type that fits our requirements, and then configure autoscaling based on the expected peak traffic.

The recommender job compares the model latency, maximum invocations per minute, and cost incurred for all of the instances types specified. The “maximum invocations per minute” metric helps to configure autoscaling. This AWS blog post discusses the details of configuring autoscaling for SageMaker endpoints.

Using the “model latency” results of the SageMaker Inference Recommender, we can get the time taken for model prediction. Since we’re using AWS Lambda and Amazon API Gateway to invoke our model endpoints, there are two key service level limitations we encounter: Lambda has a payload limit of 6MB, and API Gateway has a timeout of 29 seconds. With this, we can calculate the number of resumes that can be processed in a single request before API Gateway timeout.

Customer Benefits

Darwinbox achieved over 87% reduction in inference time for both the models, respectively, by running the model on Inferentia1 instances using the Neuron SDK. Further improvement was achieved by sending the resumes in mini batches, effectively bringing down the overall inference time by 91%.

Minfy also identified other benefits like:

  • Lower cost: Inferentia instances deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon Elastic Compute Cloud (Amazon EC2) instances, thus reducing the carbon footprint.
  • Increased productivity of data science team: Amazon SageMaker is a fully managed service and provides features to automate various steps in the ML lifecycle.
  • Scalable solution: The combination of AWS Lambda, Amazon API Gateway, and the autoscaling feature of SageMaker endpoints enabled Darwinbox to efficiently support peak loads.

Conclusion

Darwinbox’s growing user base had to be accompanied with processes and workflows that could keep up with the growth. Minfy leveraged the capabilities of Amazon SageMaker and AWS Inferentia to realize an 87% reduction in inferencing time for Darwinbox’s PyTorch models, without retraining or making any major changes to the existing code or model artifacts.

Read more about Amazon SageMaker, AWS Inferentia, and AWS Trainium accelerators, and how to get started with AWS Neuron.

.
Minfy-APN-Blog-Connect-2024
.


Minfy Technologies – AWS Partner Spotlight

Minfy Technologies is an AWS Premier Tier Services Partner and MSP that helps customers navigate digital journeys leveraging AI and cloud.

Contact Minfy | Partner Overview