AWS Machine Learning Blog

Speed up YOLOv4 inference to twice as fast on Amazon SageMaker

Machine learning (ML) models have been deployed successfully across a variety of use cases and industries, but due to the high computational complexity of recent ML models such as deep neural networks, inference deployments have been limited by performance and cost constraints. To add to the challenge, preparing a model for inference involves packaging the model in the right format and optimizing the model for each target hardware such as CPU, GPU, or AWS Inferentia. ML acceleration technologies have evolved to close the gap between productivity-focused ML frameworks and performance-oriented and efficiency-oriented hardware backends. However, optimizing a model for target hardware still involves assembling a complex tool chain of framework-specific converters and hardware-specific compilers, each with their own dependencies and configuration choices that can be difficult to understand, and then using it to compile the model.

Amazon SageMaker is a fully managed service that enables data scientists and developers to build, train, and deploy ML models at 50% lower total cost of ownership than self-managed deployments on Amazon Elastic Compute Cloud (Amazon EC2). Amazon SageMaker Neo is a capability of SageMaker that automatically compiles ML models for any ML framework and to any target hardware. With Neo, you don’t need to set up third-party or framework-specific compiler software, or tune the model manually for optimizing inference performance. We’re continually updating Neo to support more operators and expand model coverage for frameworks, including TensorFlow, PyTorch, XGBoost, MXNet, Darknet, and ONNX.

In this post, we show you how to deploy a PyTorch YOLOv4 model on a SageMaker ML CPU-based instance. You download a pre-trained model artifact, compile your pre-trained model using Neo, set up a SageMaker endpoint for both compiled and uncompiled model versions, and benchmark performance to evaluate latency, comparing a compiled and uncompiled YOLOv4 model on the same instance.

In our performance comparison, deploying YOLOv4 with Neo improved performance on SageMaker ML instances. Benchmark testing on a SageMaker ML c5.9xlarge instance revealed improved inference performance compared to a baseline model without Neo optimizations running on the same instance type. The Neo compiled model achieved a speedup in latency twice as fast compared to an uncompiled model on the same SageMaker ML instance.

You Only Look Once

Object detection stands out as a computer vision (CV) task that has seen large accuracy improvements due to deep learning (DL) model architectures. An object detection model tries to localize and classify objects in an image, allowing for applications ranging from real-time inspection of manufacturing defects to medical imaging.

YOLO (You Only Look Once) is part of the DL single-stage object detection model family, which includes models such as Single Shot Detector (SSD) and RetinaNet. These models are built by stacking neural networks (backbone, neck, and head) that together perform detection and classification tasks. The prediction outputs are bounding boxes with confidence scores for identified objects and associated classes.

The backbone network takes care of extracting features of the input image, while the head gets trained on a supervised prediction task to predict the edges of the bounding box and classify its contents. The addition of a neck neural network allows the head network to process features from intermediate steps of the backbone. The whole pipeline processes the images only once, hence the name You Only Look Once.

Single-stage models allow for multiple predictions of the same object in a single image. These predictions get disambiguated by a process called non-maximal suppression (NMS), which takes care of leaving only the highest detection probability bounding boxes that don’t overlap significantly. It’s a less computationally expensive workflow than the two-stage approach and is commonly used in real-time inference. With YOLOv4, you can achieve real-time inference above the human perception of around 30 frames per second (FPS). In this post, you explore ways to push the performance of this model even further using Neo as an accelerator for real-time object detection.


For this walkthrough, you need an AWS account and an environment running Python 3.x.


First, we need to ensure we have SageMaker Python SDK 1.x and import the necessary Python packages. If you’re using SageMaker notebook instances, select conda_pytorch_p36 as your kernel. You may have to restart your kernel after upgrading packages. Use the following code to import your packages:

import numpy as np
import time
import json
import requests
import boto3
import os
import sagemaker

Next, we get the AWS Identity and Access Management (IAM) execution role and a few other SageMaker-specific variables from our notebook environment, so that SageMaker can access resources in your AWS account later:

from sagemaker import get_execution_role
from sagemaker.session import Session

role = get_execution_role()
sess = Session()
region = sess.boto_region_name
bucket = sess.default_bucket()

import torch


import sys

3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01)
[GCC 9.3.0]

Import pre-trained YOLOv4

The original pre-trained model is from GitHub. For this post, we provide a traced version of the model artifact packaged in a tarball. Tracing requires no changes to your Python code and converts your PyTorch model to TorchScript, a more portable format for usage with the model server included in SageMaker containers. See the following code:

model_archive = 'yolov4.tar.gz'
--2021-03-30 20:07:02--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 239656714 (229M) [application/x-gzip]
Saving to: ‘yolov4.tar.gz’

yolov4.tar.gz       100%[===================>] 228.55M  87.7MB/s    in 2.6s    

2021-03-30 20:07:05 (87.7 MB/s) - ‘yolov4.tar.gz’ saved [239656714/239656714]

We upload the model archive to Amazon Simple Storage Service (Amazon S3) with the following code:

from sagemaker.utils import name_from_base
compilation_job_name = name_from_base('torchvision-yolov4-neo-1')
prefix = compilation_job_name+'/model'
model_path = sess.upload_data(path=model_archive, key_prefix=prefix)
compiled_model_path = 's3://{}/{}/output'.format(bucket, compilation_job_name)

Create a SageMaker model and endpoint

Now that the model archive is in Amazon S3, we can create a SageMaker model and deploy it to a SageMaker endpoint. An entry_point script isn’t necessary and can be a blank file. The environment variables in the env parameter are also optional. Create the model and deploy it with the following code:

framework_version = '1.6'
py_version = 'py3'
instance_type = 'ml.c5.9xlarge'
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor

sm_model = PyTorchModel(model_data=model_path,
                               env={"COMPILEDMODEL": 'False', 'MMS_MAX_RESPONSE_SIZE': '100000000', 'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'}
uncompiled_predictor = sm_model.deploy(initial_instance_count=1, instance_type=instance_type)

Use Neo to compile the model

Next, we can compile the model using Neo. The resulting compiled_model is also a SageMaker model and can be deployed to a SageMaker endpoint. When the compiled model is deployed, SageMaker automatically integrates the TVM runtime to interpret the compiled model. Compile the model with the following code:

input_layer_name = 'input0'
input_shape = [1,3,416,416]
data_shape = json.dumps({input_layer_name: input_shape})
target_device = 'ml_c5'
framework = 'PYTORCH'
sm_model_compiled = PyTorchModel(model_data=model_path,
                               framework_version = framework_version,
compiled_model = sm_model_compiled.compile(target_instance_family=target_device, 
compiled_model.env = compiled_env

Deploy the compiled model as an optimized predictor with the following code:

optimized_predictor = compiled_model.deploy(initial_instance_count = 1,
                                  instance_type = instance_type

Make predictions using the endpoints

Finally, we can compare the performance between the uncompiled and compiled models. We run 1,000 sequential iterations and calculate the round trip latency for each endpoint request:

iters = 1000
warmup = 100
client = boto3.client('sagemaker-runtime', region_name=region)

content_type = 'application/x-image'

sample_img_url = ""
body = requests.get(sample_img_url).content
compiled_perf = []
uncompiled_perf = []
for i in range(iters):
    t0 = time.time()
    response = client.invoke_endpoint(EndpointName=optimized_predictor.endpoint_name, Body=body, ContentType=content_type)
    t1 = time.time()
    #convert to millis
    compiled_elapsed = (t1-t0)*1000

    t0 = time.time()
    response = client.invoke_endpoint(EndpointName=uncompiled_predictor.endpoint_name, Body=body, ContentType=content_type)
    t1 = time.time()
    #convert to millis
    uncompiled_elapsed = (t1-t0)*1000

    if warmup == 0:
        print(f'warmup ({i}, {iters}) : c - {compiled_elapsed} ms . uc - {uncompiled_elapsed} ms')
        warmup = warmup – 1

Performance comparison

The following graph shows the measured latency speedup of the compiled model compared with a uncompiled model on the same instance. The default SageMaker PyTorch container uses Intel one-DNN libraries for inference acceleration, so any speedup from Neo is on top of what’s provided by Intel libraries. Speedup is specific to the model and instance type, so the performance gain achieved with Neo varies based on your model architecture and target instance type.

On the ml.c5.9xlarge instance, we see an average latency of 397 milliseconds for the baseline endpoint and 188 milliseconds for the Neo optimized endpoint. Similarly, for the tail latency (95th percentile), we see 446 milliseconds for the baseline endpoint and 254 milliseconds for the Neo optimized endpoint. Optimizing the model with Neo resulted in twice as fast performance.

Speedup across common models and frameworks

As you saw in the preceding section, using Neo for model compilation provides a speedup over an uncompiled model using Intel one-DNN libraries alone. The following table lists latency speedups that you might see from a few other common models across frameworks in CPU and GPU instances.

Task Framework Model Target SageMaker Speedup
Image Classification TensorFlow mobilenetv2 GPU 200%
Image Classification TensorFlow resnet50 CPU 286%
Image Classification PyTorch resnet152 CPU 33%
Semantic Segmentation TensorFlow u-net CPU 22%

These numbers are only benchmarks and vary for your specific model, instance type, and payload. The numbers in the table are measured end to end on SageMaker. Other optimizations such as pruning and quantization are also worth looking into as part of your overall model optimization strategy.


In this post, we deployed a PyTorch YOLOv4 model on a SageMaker ML CPU-based instance and compared performance between an uncompiled model and a model compiled with Neo. We saw a performance increase in the Neo compiled model—twice as fast compared to an uncompiled model on the same SageMaker ML instance.

We continue to improve Neo’s operator coverage and performance across different frameworks and models. If you have any questions or comments, use the Amazon SageMaker Discussion Forums or send an email to

About the Author

Santosh Bhavani is a Senior Technical Product Manager with the Amazon SageMaker Elastic Inference team. He focuses on helping SageMaker customers accelerate model inference and deployment. In his spare time, he enjoys traveling, playing tennis, and drinking lots of Pu’er tea.



Vamshidhar Dantu is a Software Developer with AWS Deep Learning. He focuses on building scalable and easily deployable deep learning systems. In his spare time, he enjoy spending time with family and playing badminton.