AWS Machine Learning Blog

Scale YOLOv5 inference with Amazon SageMaker endpoints and AWS Lambda

After data scientists carefully come up with a satisfying machine learning (ML) model, the model must be deployed to be easily accessible for inference by other members of the organization. However, deploying models at scale with optimized cost and compute efficiencies can be a daunting and cumbersome task. Amazon SageMaker endpoints provide an easily scalable and cost-optimized solution for model deployment. The YOLOv5 model, distributed under the GPLv3 license, is a popular object detection model known for its runtime efficiency as well as detection accuracy. In this post, we demonstrate how to host a pre-trained YOLOv5 model on SageMaker endpoints and use AWS Lambda functions to invoke these endpoints.

Solution overview

The following image outlines the AWS services used to host the YOLOv5 model using a SageMaker endpoint and invoke the endpoint using Lambda. The SageMaker notebook accesses a YOLOv5 PyTorch model from an Amazon Simple Storage Service (Amazon S3) bucket, converts it to YOLOv5 TensorFlow SavedModel format, and stores it back to the S3 bucket. This model is then used when hosting the endpoint. When an image is uploaded to Amazon S3, it acts as a trigger to run the Lambda function. The function utilizes OpenCV Lambda layers to read the uploaded image and run inference using the endpoint. After the inference is run, you can use the results obtained from it as needed.

AWS architecture

In this post, we walk through the process of utilizing a YOLOv5 default model in PyTorch and converting it to a TensorFlow SavedModel. This model is hosted using a SageMaker endpoint. Then we create and publish a Lambda function that invokes the endpoint to run inference. Pre-trained YOLOv5 models are available on GitHub. For the purpose of this post, we use the yolov5l model.


As a prerequisite, we need to set up the following AWS Identity and Access Management (IAM) roles with appropriate access policies for SageMaker, Lambda, and Amazon S3:

  • SageMaker IAM role – This requires AmazonS3FullAccess policies attached for storing and accessing the model in the S3 bucket
  • Lambda IAM role – This role needs multiple policies:
    • To access images stored in Amazon S3, we require the following IAM policies:
      • s3:GetObject
      • s3:ListBucket
    • To run the SageMaker endpoint, we need access to the following IAM policies:
      • sagemaker:ListEndpoints
      • sagemaker:DescribeEndpoint
      • sagemaker:InvokeEndpoint
      • sagemaker:InvokeEndpointAsync

You also need the following resources and services:

  • The AWS Command Line Interface (AWS CLI), which we use to create and configure Lambda.
  • A SageMaker notebook instance. These come with Docker pre-installed, and we use this to create the Lambda layers. To set up the notebook instance, complete the following steps:
    • On the SageMaker console, create a notebook instance and provide the notebook name, instance type (for this post, we use ml.c5.large), IAM role, and other parameters.
    • Clone the public repository and add the YOLOv5 repository provided by Ultralytics.

Host YOLOv5 on a SageMaker endpoint

Before we can host the pre-trained YOLOv5 model on SageMaker, we must export and package it in the correct directory structure inside model.tar.gz. For this post, we demonstrate how to host YOLOv5 in the saved_model format. The YOLOv5 repo provides an file that can export the model in many different ways. After you clone the YOLOv5 and enter the YOLOv5 directory from command line, you can export the model with the following command:

$ cd yolov5
$ pip install -r requirements.txt tensorflow-cpu
$ python --weights --include saved_model --nms

This command creates a new directory called yolov5l_saved_model inside the yolov5 directory. Inside the yolov5l_saved_model directory, we should see the following items:

    ├─ assets
    ├─ variables
    │    ├──
    │    └── variables.index
    └── saved_model.pb

To create a model.tar.gz file, move the contents of yolov5l_saved_model to export/Servo/1. From the command line, we can compress the export directory by running the following command and upload the model to the S3 bucket:

$ mkdir export && mkdir export/Servo
$ mv yolov5l_saved_model export/Servo/1
$ tar -czvf model.tar.gz export/
$ aws s3 cp model.tar.gz "<s3://BUCKET/PATH/model.tar.gz>"

Then, we can deploy a SageMaker endpoint from a SageMaker notebook by running the following code:

import os
import tensorflow as tf
from tensorflow.keras import backend
from sagemaker.tensorflow import TensorFlowModel

model_data = '<s3://BUCKET/PATH/model.tar.gz>'
role = '<IAM ROLE>'

model = TensorFlowModel(model_data=model_data,
                        framework_version='2.8', role=role)

INSTANCE_TYPE = 'ml.m5.xlarge'
ENDPOINT_NAME = 'yolov5l-demo'

predictor = model.deploy(initial_instance_count=1,

The preceding script takes approximately 2–3 minutes to fully deploy the model to the SageMaker endpoint. You can monitor the status of the deployment on the SageMaker console. After the model is hosted successfully, the model is ready for inference.

Test the SageMaker endpoint

After the model is successfully hosted on a SageMaker endpoint, we can test it out, which we do using a blank image. The testing code is as follows:

import numpy as np

ENDPOINT_NAME = 'yolov5l-demo'

modelHeight, modelWidth = 640, 640
blank_image = np.zeros((modelHeight,modelWidth,3), np.uint8)
data = np.array(resized_image.astype(np.float32)/255.)
payload = json.dumps([data.tolist()])
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,

result = json.loads(response['Body'].read().decode())
print('Results: ', result)

Set up Lambda with layers and triggers

We use OpenCV to demonstrate the model by passing an image and getting the inference results. Lambda doesn’t come with external libraries like OpenCV pre-built, therefore we need to build it before we can invoke the Lambda code. Furthermore, we want to make sure that we don’t build external libraries like OpenCV every time Lambda is being invoked. For this purpose, Lambda provides a functionality to create Lambda layers. We can define what goes in these layers, and they can be consumed by the Lambda code every time it’s invoked. We also demonstrate how to create the Lambda layers for OpenCV. For this post, we use an Amazon Elastic Compute Cloud (Amazon EC2) instance to create the layers.

After we have the layers in place, we create the script, which is the Lambda code that uses the layers, runs the inference, and gets results. The following diagram illustrates this workflow.

Lambda permission

Create Lambda layers for OpenCV using Docker

Use Dockerfile as follows to create the Docker image using Python 3.7:

FROM amazonlinux

RUN yum update -y
RUN yum install gcc openssl-devel bzip2-devel libffi-devel wget tar gzip zip make -y

# Install Python 3.7
RUN wget
RUN tar -xzvf Python-3.7.12.tgz
WORKDIR /Python-3.7.12
RUN ./configure --enable-optimizations
RUN make altinstall

# Install Python packages
RUN mkdir /packages
RUN echo "opencv-python" >> /packages/requirements.txt
RUN mkdir -p /packages/opencv-python-3.7/python/lib/python3.7/site-packages
RUN pip3.7 install -r /packages/requirements.txt -t /packages/opencv-python-3.7/python/lib/python3.7/site-packages

# Create zip files for Lambda Layer deployment
WORKDIR /packages/opencv-python-3.7/
RUN zip -r9 /packages/ .
WORKDIR /packages/
RUN rm -rf /packages/opencv-python-3.7/

Build and run Docker and store the output ZIP file in the current directory under layers:

$ docker build --tag aws-lambda-layers:latest <PATH/TO/Dockerfile>
$ docker run -rm -it -v $(pwd):/layers aws-lambda-layers cp /packages/ /layers

Now we can upload the OpenCV layer artifacts to Amazon S3 and create the Lambda layer:

$ aws s3 cp layers/ s3://<BUCKET>/<PATH/TO/STORE/ARTIFACTS>
$ aws lambda publish-layer-version --layer-name cv2 --description "Open CV" --content S3Bucket=<BUCKET>,S3Key=<PATH/TO/STORE/ARTIFACTS>/ --compatible-runtimes python3.7

After the preceding commands run successfully, you have an OpenCV layer in Lambda, which you can review on the Lambda console.

Lambda console

Create the Lambda function

We utilize the script to create the Lambda function and use OpenCV. In the following code, change the values for BUCKET_NAME and IMAGE_LOCATION to the location for accessing the image:

import os, logging, json, time, urllib.parse
import boto3, botocore
import numpy as np, cv2

logger = logging.getLogger()
client = boto3.client('lambda')

s3 = boto3.resource('s3')

ENDPOINT_NAME = 'yolov5l-demo'
config = botocore.config.Config(read_timeout=80)
runtime = boto3.client('runtime.sagemaker', config=config)
modelHeight, modelWidth = 640, 640

def lambda_handler(event, context):
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')

    # INPUTS - Download Image file from S3 to Lambda /tmp/
    input_imagename = key.split('/')[-1]'Input Imagename: {input_imagename}')
    s3.Bucket(BUCKET_NAME).download_file(IMAGE_LOCATION + '/' + input_imagename, '/tmp/' + input_imagename)

    # INFERENCE - Invoke the SageMaker Inference Endpoint'Starting Inference ... ')
    orig_image = cv2.imread('/tmp/' + input_imagename)
    if orig_image is not None:
        start_time_iter = time.time()
        # pre-processing input image
        image = cv2.resize(orig_image.copy(), (modelWidth, modelHeight), interpolation = cv2.INTER_AREA)
        data = np.array(image.astype(np.float32)/255.)
        payload = json.dumps([data.tolist()])
        # run inference
        response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME, ContentType='application/json', Body=payload)
        # get the output results
        result = json.loads(response['Body'].read().decode())
        end_time_iter = time.time()
        # get the total time taken for inference
        inference_time = round((end_time_iter - start_time_iter)*100)/100'Inference Completed ... ')

    # OUTPUTS - Using the output to utilize in other services downstream
    return {
        "statusCode": 200,
        "body": json.dumps({
                "message": "Inference Time:// " + str(inference_time) + " seconds.",
                "results": result

Deploy the Lambda function with the following code:

$ zip
$ aws s3 cp s3://<BUCKET>/<PATH/TO/STORE/FUNCTION>
$ aws lambda create-function --function-name yolov5-lambda --handler app.lambda_handler --region us-east-1 --runtime python3.7 --environment "Variables={BUCKET_NAME=$BUCKET_NAME,S3_KEY=$S3_KEY}" --code S3Bucket=<BUCKET>,S3Key="<PATH/TO/STORE/FUNCTION/>"

Attach the OpenCV layer to the Lambda function

After we have the Lambda function and layer in place, we can connect the layer to the function as follows:

$ aws lambda update-function-configuration --function-name yolov5-lambda --layers cv2

We can review the layer settings via the Lambda console.

Lambda settings

Trigger Lambda when an image is uploaded to Amazon S3

We use an image upload to Amazon S3 as a trigger to run the Lambda function. For instructions, refer to Tutorial: Using an Amazon S3 trigger to invoke a Lambda function.

You should see the following function details on the Lambda console.

function overview

Run inference

After you set up Lambda and the SageMaker endpoint, you can test the output by invoking the Lambda function. We use an image upload to Amazon S3 as a trigger to invoke Lambda, which in turn invokes the endpoint for inference. As an example, we upload the following image to the Amazon S3 location <S3 PATH TO IMAGE>/test_image.png configured in the previous section.

test image

After the image is uploaded, the Lambda function is triggered to download and read the image data and send it to the SageMaker endpoint for inference. The output result from the SageMaker endpoint is obtained and returned by the function in JSON format, which we can use in different ways. The following image shows example output overlayed on the image.

inference result

Clean up

Depending on the instance type, SageMaker notebooks can require significant compute usage and cost. To avoid unnecessary costs, we advise stopping the notebook instance when it’s not in use. Additionally, Lambda functions and SageMaker endpoints incur charges only when they’re invoked. Therefore, no cleanup is necessary for those services. However, if an endpoint isn’t being used any longer, it’s good practice to remove the endpoint and the model.


In this post, we demonstrated how to host a pre-trained YOLOv5 model on a SageMaker endpoint and use Lambda to invoke inference and process the output. The detailed code is available on GitHub.

To learn more about SageMaker endpoints, check out Create your endpoint and deploy your model and Build, test, and deploy your Amazon SageMaker inference models to AWS Lambda, which highlights how you can automate the process of deploying YOLOv5 models.

About the authors

Kevin Song is an IoT Edge Data Scientist at AWS Professional Services. Kevin holds a PhD in Biophysics from The University of Chicago. He has over 4 years of industry experience in Computer Vision and Machine Learning. He is involved in helping customers in the sports and life sciences industry deploy  Machine Learning models.

Romil Shah is an IoT Edge Data Scientist at AWS Professional Services. Romil has over 6 years of industry experience in Computer Vision, Machine Learning and IoT edge devices. He is involved in helping customers optimize and deploy their Machine Learning models for edge devices for industrial setup.