AWS Machine Learning Blog

Reduce computer vision inference latency using gRPC with TensorFlow serving on Amazon SageMaker

AWS customers are increasingly using computer vision (CV) models for improved efficiency and an enhanced user experience. For example, a live broadcast of sports can be processed in real time to detect specific events automatically and provide additional insights to viewers at low latency. Inventory inspection at large warehouses capture and process millions of images across their network to identify misplaced inventory.

CV models can be built with multiple deep learning frameworks like TensorFlow, PyTorch, and Apache MXNet. These models typically have a large input payload of images or videos of varying size. Advanced deep learning models for use cases like object detection return large response payloads ranging from tens of MBs to hundreds of MBs in size. Large request and response payloads can increase model serving latency and subsequently negatively impact application performance. You can further optimize model serving stacks for each of these frameworks for low latency and high throughput.

Amazon SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker provides state-of-the-art open-source serving containers for XGBoost (container, SDK), Scikit-Learn (container, SDK), PyTorch (container, SDK), TensorFlow (container, SDK) and Apache MXNet (container, SDK).

In this post, we show you how to serve TensorFlow CV models with SageMaker’s pre-built container to easily deliver high-performance endpoints using TensorFlow Serving (TFS). As with all SageMaker endpoints, requests arrive using REST, as shown in the following diagram. Inside of the endpoint, you can add preprocessing and postprocessing steps and dispatch the prediction to TFS using either RESTful APIs or gRPC APIs. For small payloads, either API yields similar performance. We demonstrate that for CV tasks like image classification and object detection, using gRPC inside of a SageMaker endpoint reduces overall latency by 75% or more. The code for these use cases is available in the following GitHub repo.


For image classification, we use a Keras model MobileNetV2 pre-trained with 1,000 classes from the ImageNet dataset. The default input image resolution is 224*224*3 and output is a dense vector of probabilities for each of the 1,000 classes. For object detection, we use a TensorFlow2 model EfficientDet D1 [alternative URL:] pre-trained with 91 classes from the COCO 2017 dataset. The default input image resolution is 640*640*3, and the output is a dictionary of number of detections, bounding box coordinates, detection classes, detection scores, raw detection boxes, raw detection scores, detection anchor indexes, and detection multiclass scores. You can fine-tune both models by a transfer learning task on a custom dataset with SageMaker, and use SageMaker to deploy and serve the models.

The following is an example of image classification.

Class ID : 281 , probability = 0.76 , class label = Tabby cat

The following is an example of object detection.

Bird, detection scores = 0.88, 0.76. 0.66, classID = 16, bounding boxes

Model deployment on SageMaker

The code to deploy the preceding pre-trained models is in the following GitHub repo. SageMaker provides a managed TensorFlow Serving environment that makes it easy to deploy TensorFlow models. The SageMaker TensorFlow Serving container works with any model stored in TensorFlow’s SavedModel format and allows you to add customized Python code to process input and output data.

We download the pre-trained models and extract them with the following code:

# Image classification
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
model = MobileNetV2()'model/1/')

# Object detection
!tar -xvf efficientdet_d1_coco17_tpu-32.tar.gz --no-same-owner
!mv efficientdet_d1_coco17_tpu-32/saved_model/* model/1/

SageMaker models need to be packaged in .tar.gz format. We archive the TensorFlow SavedModel bundle and upload it to Amazon Simple Storage Service (Amazon S3):

     └── <version number>
         ├── saved_model.pb
         └── variables
             └── ...

We can add customized Python code to process input and output data via input_handler and output_handler methods. The customized Python code must be named and specified through the entry_point parameter. We add preprocessing to accept an image byte stream as input and read and transform the byte stream with tensorflow.keras.preprocessing:

# Pre-processing
from tensorflow.keras.preprocessing import image
from PIL import Image

if context.request_content_type == 'application/x-image':
        stream = io.BytesIO(
        img ='RGB')   
        img = img.resize((WIDTH, HEIGHT))
        img_array = image.img_to_array(img)

        #Add additional model specific preprocessing
        img_array = img_array.reshape((HEIGHT, WIDTH, 3)).astype(np.uint8) #"channels_last"
        x = np.expand_dims(img_array, axis=0) #"(1,HEIGHT,WIDTH,3)

After we have the S3 model artifact path, we can use the following code to deploy a SageMaker endpoint:

from sagemaker.tensorflow.serving import TensorFlowModel
model_data = '<your-model-s3-path>'
model = TensorFlowModel(source_dir='code',entry_point='',
predictor = model.deploy(initial_instance_count=1, instance_type='ml.g4dn.xlarge')

Calling deploy starts the process of creating a SageMaker endpoint. This process includes the following steps:

  • Starts initial_instance_countAmazon Elastic Compute Cloud (Amazon EC2) instances of the type instance_type.
  • On each instance, SageMaker does the following:
    • Starts a Docker container optimized for TensorFlow Serving (see SageMaker TensorFlow Serving containers).
    • Starts a TensorFlow Serving process configured to run your model.
    • Starts an HTTP server that provides access to TensorFlow Server through the SageMaker InvokeEndpoint

REST communication with TensorFlow Serving

We have complete control over the inference request by implementing the handler method in the entry point inference script. The Python service creates a context object. We convert the preprocessed image NumPy array to JSON and retrieve the REST URI from the context object to trigger a TFS invocation via REST.

# REST communication with TFS
# Convert input image to json
inst_json = json.dumps({'instances': instance.tolist()})
print('rest call')
# Use context object to retrieve the rest uri
response =, data=inst_json)

gRPC communication with TensorFlow Serving

Alternatively, we can use gRPC for in-server communication with TFS via the handler method. We import the gRPC libraries, retrieve the gRPC port from the context object, and trigger a TFS invocation via gRPC:

import grpc from tensorflow.compat.v1 
import make_tensor_protofrom tensorflow_serving.apis 
import predict_pb2from tensorflow_serving.apis 
import prediction_service_pb2_grpc

request = predict_pb2.PredictRequest() = 'model'

# specify the serving signature from the model
request.model_spec.signature_name = 'serving_default'
    options = [
        ('grpc.max_send_message_length', MAX_GRPC_MESSAGE_LENGTH),
        ('grpc.max_receive_message_length', MAX_GRPC_MESSAGE_LENGTH)

# retrieve the gRPC port from the context object
channel = grpc.insecure_channel(f'{context.grpc_port}', options=options)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# make a call that immediately and without blocking returns a 
# gRPC future for the asynchronous-in-the-background gRPC.
result_future = stub.Predict.future(request, 30)  # 5 seconds  

# retrieve the output based on the model output types
output_tensor_proto = result_future.result().outputs['predictions']
output_shape = [dim.size for dim in output_tensor_proto.tensor_shape.dim]

# convert bytes to numpy array
output_np = np.array(output_tensor_proto.float_val).reshape(output_shape)

# create JSON response
prediction_json = {'predictions': output_np.tolist()}

Prediction invocation comparison

We can invoke the deployed model with an input image to retrieve image classification or object detection outputs:

import boto3 
input_image = open('image.jpg', 'rb').read()
runtime_client = boto3.client('runtime.sagemaker') 
response = runtime_client.invoke_endpoint(
res = response['Body'].read().decode('ascii')

We then trigger 100 invocations to generate latency statistics for comparison:

import time
results = []
for i in (1,100):
    start = time.time()
        response = runtime_client.invoke_endpoint(
    results.append((time.time() - start) * 1000)
print("\nPredictions for TF2 serving : \n")
print('\nP95: ' + str(np.percentile(results, 95)) + ' ms\n')    
print('P90: ' + str(np.percentile(results, 90)) + ' ms\n')
print('Average: ' + str(np.average(results)) + ' ms\n')

The following table summarizes our results from the invocation tests. The results show a 75% improvement in latency with gRPC compared to REST calls to TFS for image classification, and 85% improvement for object detection models. We observe that the performance improvement depends on the size of the request payload and response payload from the model.

Use case Model Input image size Request payload size Response payload size Average invocation latency via REST Average invocation latency via gRPC Performance gain via gRPC
Image classification MobileNetv2 20 kb 600 kb 15 kb 266 ms 58 ms 75%
Object detection EfficientDetD1 100 kb 1 mb 110 mb 4057 ms 468 ms 85%


In this post, we demonstrated how to reduce model serving latency for TensorFlow computer vision models on SageMaker via in-server gRPC communication. We walked through a step-by-step process of in-server communication with TensorFlow Serving via REST and gRPC and compared the performance using two different models and payload sizes. For more information, see Maximize TensorFlow performance on Amazon SageMaker endpoints for real-time inference to understand the throughput and latency gains you can achieve from tuning endpoint configuration parameters such as the number of threads and workers.

SageMaker provides a powerful and configurable platform for hosting real-time computer vision inference in the cloud with low latency. In addition to using gRPC, we suggest other techniques to further reduce latency and improve throughput, such as model compilation, model server tuning, and hardware and software acceleration technologies. Amazon SageMaker Neo lets you compile and optimize ML models for various ML frameworks to a wide variety of target hardware. Select the most appropriate SageMaker compute instance for your specific use case, including g4dn featuring NVIDIA T4 GPUs, a CPU instance type coupled with Amazon Elastic Inference, or inf1 featuring AWS Inferentia.

About the Authors

Hasan Poonawala is a Machine Learning Specialist Solutions Architect at AWS, based in London, UK. Hasan helps customers design and deploy machine learning applications in production on AWS. He is passionate about the use of machine learning to solve business problems across various industries. In his spare time, Hasan loves to explore nature outdoors and spend time with friends and family.



Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.