AWS Machine Learning Blog

How Patsnap used GPT-2 inference on Amazon SageMaker with low latency and cost

This blog post was co-authored, and includes an introduction, by Zilong Bai, senior natural language processing engineer at Patsnap.

You’re likely familiar with the autocomplete suggestion feature when you search for something on Google or Amazon. Although the search terms in these scenarios are pretty common keywords or expressions that we use in daily life, in some cases search terms are very specific to the scenario. Patent search is one of them. Recently, the AWS Generative AI Innovation Center collaborated with Patsnap to implement a feature to automatically suggest search keywords as an innovation exploration to improve user experiences on their platform.

Patsnap provides a global one-stop platform for patent search, analysis, and management. They use big data (such as a history of past search queries) to provide many powerful yet easy-to-use patent tools. These tools have enabled Patsnap’s global customers to have a better understanding of patents, track recent technological advances, identify innovation trends, and analyze competitors in real time.

At the same time, Patsnap is embracing the power of machine learning (ML) to develop features that can continuously improve user experiences on the platform. A recent initiative is to simplify the difficulty of constructing search expressions by autofilling patent search queries using state-of-the-art text generation models. Patsnap had trained a customized GPT-2 model for such a purpose. Because there is no such existing feature in a patent search engine (to their best knowledge), Patsnap believes adding this feature will increase end-user stickiness.

However, in their recent experiments, the inference latency and queries per second (QPS) of a PyTorch-based GPT-2 model couldn’t meet certain thresholds that can justify its business value. To tackle this challenge, AWS Generative AI Innovation Center scientists explored a variety of solutions to optimize GPT-2 inference performance, resulting in lowering the model latency by 50% on average and improving the QPS by 200%.

Large language model inference challenges and optimization approaches

In general, applying such a large model in a real-world production environment is non-trivial. The prohibitive computation cost and latency of PyTorch-based GPT-2 made it difficult to be widely adopted from a business operation perspective. In this project, our objective is to significantly improve the latency with reasonable computation costs. Specifically, Patsnap requires the following:

  • The average latency of model inference for generating search expressions needs to be controlled within 600 milliseconds in real-time search scenarios
  • The model requires high throughput and QPS to do a large number of searches per second during peak business hours

In this post, we discuss our findings using Amazon Elastic Compute Cloud (Amazon EC2) instances, featuring GPU-based instances using NVIDIA TensorRT.

In a short summary, we use NVIDIA TensorRT to optimize the latency of GPT-2 and deploy it to an Amazon SageMaker endpoint for model serving, which reduces the average latency from 1,172 milliseconds to 531 milliseconds

In the following sections, we go over the technical details of the proposed solutions with key code snippets and show comparisons with the customer’s status quo based on key metrics.

GPT-2 model overview

Open AI’s GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on the WebText dataset, containing 8 million web pages. The GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and let it generate a lengthy continuation. In this situation, we exploit it to generate search queries. As GPT models keep growing larger, inference costs are continuously rising, which increases the need to deploy these models with acceptable cost.

Achieve low latency on GPU instances via TensorRT

TensorRT is a C++ library for high-performance inference on NVIDIA GPUs and deep learning accelerators, supporting major deep learning frameworks such as PyTorch and TensorFlow. Previous studies have shown great performance improvement in terms of model latency. Therefore, it’s an ideal choice for us to reduce the latency of the target model on NVIDIA GPUs.

We are able to achieve a significant reduction in GPT-2 model inference latency with a TensorRT-based model on NVIDIA GPUs. The TensorRT-based model is deployed via SageMaker for performance tests. In this post, we show the steps to convert the original PyTorch-based GPT-2 model to a TensorRT-based model.

Converting the PyTorch-based GPT-2 to the TensorRT-based model is not difficult via the official tool provided by NVIDIA. In addition, with such straightforward conversions, no obvious model accuracy degradation has been observed. In general, there are three steps to follow:

  1. Analyze your GPT-2. As of this writing, NVIDIA’s conversion tool only supports Hugging Face’s version of GPT-2 model. If the current GPT-2 model isn’t the original version, you need to modify it accordingly. It’s recommended to strip out custom code from the original GPT-2 implementation of Hugging Face, which is very helpful for the conversion.
  2. Install the required Python packages. The conversion process first converts the PyTorch-based model to the ONNX model and then converts the ONNX-based model to the TensorRT-based model. The following Python packages are needed for this two-step conversion:
  1. Convert your model. The following code contains the functions for the two-step conversion:
def torch2onnx():
    metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=True), other=GPT2Metadata(kv_cache=False))
    gpt2 = GPT2TorchFile('cpu'), metadata)
    onnx_path = ('Your own path to save ONNX-based model') # e.g, ./model_fp16.onnx
    gpt2.as_onnx_model(onnx_path, force_overwrite=False)
    return onnx_path, metadata
def onnx2trt(onnx_path, metadata):
    trt_path = 'Your own path to save TensorRT-based model' # e.g., ./model_fp16.onnx.engine
    batch_size = 10
    max_sequence_length = 42
    profiles = [Profile().add(
        min=(1, 1),
        opt=(batch_size, max_sequence_length // 2),
        max=(batch_size, max_sequence_length),
    gpt2_engine = GPT2ONNXFile(onnx_path, metadata).as_trt_engine(output_fpath=trt_path, profiles=profiles)
    gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config, max_sequence_length=42, batch_size=10)

Latency comparison: PyTorch vs. TensorRT

JMeter is used for performance benchmarking in this project. JMeter is an Apache project that can be used as a load testing tool for analyzing and measuring the performance of a variety of services. We record the QPS and latency of the original PyTorch-based model and our converted TensorRT-based GPT-2 model on an AWS P3.2xlarge instance. As we show later in this post, due to the powerful acceleration ability of TensorRT, the latency of GPT-2 is significantly reduced. When the request concurrency is 1, the average latency has been reduced by 274 milliseconds (2.9 times faster). From the perspective of QPS, it is increased to 7 from 2.4, which is around a 2.9 times boost compared to the original PyTorch-based model. Moreover, as the concurrency increases, QPS keeps increasing. This suggests lower costs with acceptable latency increase (but still much faster than the original model).

The following table compares latency:

. Concurrency QPS Maximum Latency Minumum Latency Average Latency
Customer PyTorch version (on p3.2xlarge) 1 2.4 632 105 417
2 3.1 919 168 636
3 3.4 1911 222 890
4 3.4 2458 277 1172
AWS TensorRT version (on p3.2xlarge) 1 7 (+4.6) 275 22 143 (-274 ms)
2 7.2 (+4.1) 274 51 361 (-275 ms)
3 7.3 (+3.9) 548 49 404 (-486 ms)
4 7.5 (+4.1) 765 62 531 (-641 ms)

Deploy TensorRT-based GPT-2 with SageMaker and a custom container

TensorRT-based GPT-2 requires a relatively recent TensorRT version, so we choose the bring your own container (BYOC) mode of SageMaker to deploy our model. BYOC mode provides a flexible way to deploy the model, and you can build customized environments in your own Docker container. In this section, we show how to build your own container, deploy your own GPT-2 model, and test with the SageMaker endpoint API.

Build your own container

The container’s file directory is presented in the following code. Specifically, Dockerfile and are used to build the Docker container. gpt2 and implement the model and the inference API. serve, nginx.conf, and provide the configuration for the NGINX web server.

├── Dockerfile    # build our docker based on this file.
├──      # create our own image and push it to Amazon ECR
├── gpt2          # model directory
├──  # backend function for invoke the model
├── serve         # web server setting file
├── nginx.conf    # web server setting file
└──       # web server setting file

You can run sh ./ to build the container.

Deploy to a SageMaker endpoint

After you have built a container to run the TensorRT-based GPT-2, you can enable real-time inference via a SageMaker endpoint. Use the following code snippets to create the endpoint and deploy the model to the endpoint using the corresponding SageMaker APIs:

import boto3from time import gmtime, strftime
from sagemaker import get_execution_role

sm_client = boto3.client(service_name='sagemaker')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime')
account_id = boto3.client('sts').get_caller_identity()['Account']
region = boto3.Session().region_name
s3_bucket = '${Your s3 bucket}'
role = get_execution_role()
model_name = '${Your Model Name}'
# you need to upload your container to S3 first
container = '${Your Image Path}'
instance_type = 'ml.p3.2xlarge'
container = {
    'Image': container
create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    Containers = [container])
# Endpoint Setting
endpoint_config_name = '${Your Endpoint Config Name}'
print('Endpoint config name: ' + endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
        'InstanceType': instance_type,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic'}])
print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

# Deploy Model
endpoint_name = '${Your Endpoint Name}'
print('Endpoint name: ' + endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Endpoint Status: " + status)
print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
waiter = sm_client.get_waiter('endpoint_in_service')

Test the deployed model

After the model is successfully deployed, you can test the endpoint via the SageMaker notebook instance with the following code:

import json
import boto3

sagemaker_runtime = boto3.client("sagemaker-runtime", region_name='us-east-2')
endpoint_name = "${Your Endpoint Name}"
request_body = {"input": "amazon"}
payload = json.dumps(request_body)
content_type = "application/json"
response = sagemaker_runtime.invoke_endpoint(
                            Body=payload # Replace with your own data.
result = json.loads(response['Body'].read().decode())


In this post, we described how to enable low-latency GPT-2 inference on SageMaker to create business value. Specifically, with the support of NVIDIA TensorRT, we can achieve 2.9 times acceleration on the NVIDIA GPU instances with SageMaker for a customized GPT-2 model.

If you want help with accelerating the use of GenAI models in your products and services, please contact the AWS Generative AI Innovation Center. The AWS Generative AI Innovation Center can help you make your ideas a reality faster and more effectively. To get started with the Generative AI Innovation Center, visit here.

About the Authors

Hao Huang
is an applied scientist at the AWS Generative AI Innovation Center. He specializes in Computer Vision (CV) and Visual-Language Model (VLM). Recently, he has developed a strong interest in generative AI technologies and has already collaborated with customers to apply these cutting-edge technologies to their business. He is also a reviewer for AI conferences such as ICCV and AAAI.

Zilong Bai is a senior natural language processing engineer at Patsnap. He is passionate about research and proof-of-concept work on cutting-edge techniques for generative language models.

Yuanjun Xiao is a Solution Architect at AWS. He is responsible for AWS architecture consulting and design. He is also passionate about building AI and analytic solutions.

Xuefei Zhang is an applied scientist at the AWS Generative AI Innovation Center, works in NLP and AGI areas to solve industry problems with customers.

Guang Yang is a senior applied scientist at the AWS Generative AI Innovation Center where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.