AWS Machine Learning Blog

Deploying PyTorch models for inference at scale using TorchServe

Many services you interact with today rely on machine learning (ML). From online search and product recommendations to speech recognition and language translation, these services need ML models to serve predictions. As ML finds its way into even more services, you face the challenge of taking the results of your hard work and deploying the model quickly and reliably to production. And as the number of people consuming these services increase, it’s even more challenging to make sure that the models deliver low-latency predictions securely and reliably to millions of users simultaneously.

Developers use many different open-source frameworks for model development. Over the last few years, PyTorch has become the deep learning framework of choice for many researchers, developers, and data scientists developing ML-powered applications. They prefer PyTorch for its simplicity and Pythonic way of implementing and training models, and the ability to seamlessly switch between eager and graph modes. However, until now wasn’t an easy and natively supported way to serve PyTorch models in production at scale.

AWS is excited to share the experimental release of TorchServe, an open-source model serving library for PyTorch.

AWS developed TorchServe in partnership with Facebook. AWS and Facebook will maintain and continue contributing to TorchServe, along with the broader PyTorch community. With over 83% of the cloud-based PyTorch projects happening on AWS, we are excited to launch TorchServe to address the difficulty of deploying PyTorch models. With TorchServe, you can deploy PyTorch models in either eager or graph mode using TorchScript, serve multiple models simultaneously, version production models for A/B testing, load and unload models dynamically, and monitor detailed logs and customizable metrics.

TorchServe is easy to use. It comes with a convenient CLI to deploy locally and is easy to package into a container and scale out with Amazon SageMaker or Amazon EKS. With default handlers for common problems such as image classification, object detection, image segmentation, and text classification, you can deploy with just a few lines of code—no more writing lengthy service handlers for initialization, preprocessing, and post-processing. TorchServe is open-source, which means it’s fully open and extensible to fit your deployment needs.

This post takes an in-depth look at TorchServe, its key features and capabilities, and how to use it. It provides code examples to illustrate its benefits and key concepts, and also shows an example for scaling PyTorch inference using TorchServe and Amazon SageMaker.

Before the release of TorchServe, if you wanted to serve PyTorch models, you had to develop your own model serving solutions. You had to develop custom handlers for your model, develop a model server, build your own Docker container, figure out how to make it accessible via the network and integrate it with your cluster orchestration system, and so on.

With TorchServe, you get many features out-of-the-box. It gives you full flexibility of deploying trained PyTorch models performantly at scale without having to write custom handlers for popular models. You can go from a trained model to production deployment with just a few lines of code.

Getting started with TorchServe

Getting started with TorchServe is easy. This post tested the examples on a c5.xlarge Amazon EC2 instance running the Deep Learning AMI. You can also try the steps on your local laptop or desktop. For instructions on launching an Amazon EC2 instance, see Getting Started with Amazon EC2.

To install TorchServe, follow the instructions on GitHub. It is recommended to use a Conda or other virtual environment to manage dependencies. After you install TorchServe, you are ready to deploy your first model by completing the following steps:

  1. Download the TorchServe repository for access to examples. Run the following code:
    mkdir torchserve-examples
    cd torchserve-examples
    git clone
  2. Download a DenseNet image classification model from the official PyTorch model repository. Run the following code:
  3. Convert the model from PyTorch to TorchServe format.TorchServe uses a model archive format with the extension .mar. A .mar file packages model checkpoints or model definition file with state_dict (dictionary object that maps each layer to its parameter tensor). You can use the torch-model-archiver tool in TorchServe to create a .mar file. You don’t have to create a custom handler—just specify --handler image_classifier, and it automatically sets up a handler for you. Now that you have the .mar file, host it with TorchServe. Run the following code:
    torch-model-archiver --model-name densenet161 \
    --version 1.0 --model-file serve/examples/image_classifier/densenet_161/ \
    --serialized-file densenet161-8d451a50.pth \
    --extra-files serve/examples/image_classifier/index_to_name.json \
    --handler image_classifier
    ls *.mar

    You receive the following output:

    densenet161.mar serve
  4. Host a model. Run the following code:
    mkdir model_store
    mv densenet161.mar model_store/
    torchserve --start --model-store model_store --models densenet161=densenet161.mar
  5. Test TorchServe by opening another terminal on the same host and running the following code (you can use tmux to manage multiple sessions):
    curl -O
    curl -X POST -T kitten.jpg

    You receive the following output:

     "tiger_cat": 0.4693356156349182
     "tabby": 0.46338796615600586
     "Egyptian_cat": 0.06456131488084793
     "lynx": 0.0012828155886381865
     "plastic_bag": 0.00023323005007114261

Zero-code change deployment for standard models with default handlers

Many deep learning use cases fall under one of the following categories: image classification, object detection, image segmentation, and text classification. If you’re working on one of these applications, you can deploy with zero code changes, as in the previous section. You don’t need to convert from eager to TorchScript or vice versa, or write service handlers for initialization, preprocessing, and post-processing.

The TorchServe torch-model-archiver tool can automatically detect and handle PyTorch’s different representations (eager mode and TorchScript). For common models supported by packages such as TorchVision, TorchText, and TorchAudio, torch-model-archiver uses a default handler for initialization, preprocessing, and post-processing. You can still write custom handlers if you have a custom model or want to introduce custom logic to extend the default handlers.

For a full list of supported default handlers, see the GitHub repo. For instructions on writing your own custom handler, see the GitHub repo.

Hosting multiple models and scaling workers

TorchServe provides a management API to list registered models, register new models to existing servers, unregistering current models, increasing or decreasing number of workers per model, describing the status of a model, adding versions, and setting default versions. The Management API is listening on port 8081 and only accessible from localhost by default, but you can change the default behavior.

To register a new model, complete the following steps:

  1. Download a new model with the following code:
    torch-model-archiver --model-name fastrcnn --version 1.0 \
    --model-file serve/examples/object_detector/fast-rcnn/ \
    --serialized-file fasterrcnn_resnet50_fpn_coco-258fb6c6.pth \
    --handler object_detector \
    --extra-files serve/examples/object_detector/index_to_name.json
    mv fastrcnn.mar model_store/
  2. Register the new model with the following code:
    curl -X POST "http://localhost:8081/models?url=fastrcnn.mar"

    You receive the following output:

     "status": "Model \"fastrcnn\" registered"

    You can also query the list of registered models with the following code:

    curl "http://localhost:8081/models"

    You receive the following output:

        "models": [
                "modelName": "densenet161",
                "modelUrl": "densenet161.mar"
                "modelName": "fastrcnn",
                "modelUrl": "fastrcnn.mar"
  3. Scale workers for a model. A new model has no workers assigned to it, so set a minimum number of workers with the following code:
    curl -v -X PUT "http://localhost:8081/models/fastrcnn?min_worker=2"
    curl "http://localhost:8081/models/fastrcnn"

    You receive the following output:

        "modelName": "fastrcnn",
        "modelVersion": "1.0",
        "modelUrl": "fastrcnn.mar",
        "runtime": "python",
        "minWorkers": 2,
        "maxWorkers": 2,
        "batchSize": 1,
        "maxBatchDelay": 100,

    If your model is hosted on a CPU with many cores such as the c5.24xlarge EC2 instance with 96 vCPUs, you can easily scale the number of threads by using the method described previously.

  4. Unregister the model with the following code:
    curl -X DELETE http://localhost:8081/models/fastrcnn/
  5. To version a model, when calling torch-model-archiver, pass a version number to the --version See the following code:
    torch-model-archiver --model-name fastrcnn --version 1.0 ...

Running batch predictions (or batch inferences)

For some applications, you may need to run inferences in batches. If you have a large dataset and want to generate inferences offline, it’s computationally more efficient to gather your dataset into large batches and process them. In some real-time use cases where you have a more tolerant latency budget, you can batch several requests and serve results and improve resource utilization and reduce the operational expense in the process. Another use case of batching is preprocessing training data with other models before training a new model.

TorchServe supports batch inferences natively. For instructions on running TorchServe batch inferences, see the GitHub repo.

Logging and metrics

TorchServe gives you easy access to logs and metrics that are fully customizable. By default, TorchServe prints log messages to stderr and stout. TorchServe uses log4j, and you can customize logging by modifying log4j properties.

TorchServe also collects system-level metrics such as CPUUtilization, DiskUtilization, and others by default. You can also specify custom metrics using the metrics API. The following screenshot shows the default log output when an inference is requested from TorchServe.

Large-scale PyTorch deployments using TorchServe and Amazon SageMaker

Amazon SageMaker is a fully managed service with capabilities for data labeling, model development, large-scale model training, and model deployments. For deployment and hosting, Amazon SageMaker offers convenient one-click deployment of models trained on Amazon SageMaker training clusters. However, it’s also fully modular—you can bring in your own algorithms and containers and use only the services that you need.

This post demonstrates how you can build a TorchServe container and host it using Amazon SageMaker. Amazon SageMaker provides a fully-managed hosting experience. Just specify the type of instance, and the maximum and minimum number desired, and Amazon SageMaker takes care of the rest.

With a few lines of code, you can ask Amazon SageMaker to launch the instances, download your model from Amazon S3 to your TorchServe container, and set up the secure HTTPS endpoint for your application. On the client side, you get predictions with a simple API call to this secure endpoint backed by TorchServe.

Running the example

You can run the example on Amazon SageMaker Notebook instances, Amazon EC2, or your laptop or desktop. If you’re using a local laptop or desktop, make sure you download and install the AWS CLI and configure it, AWS SDK for Python (boto3), and Amazon SageMaker Python SDK. After you deploy, the models are hosted on Amazon SageMaker fully managed deployment instances.

For the most convenient experience, launch an Amazon SageMaker notebook instance, which offers a Jupyter notebook interface and comes with all AWS libraries installed and ready to go. For more information, see Use Amazon SageMaker Notebook Instances.

This post assumes that you are running the following steps on an Amazon SageMaker notebook instance.

IAM roles and policies

When you create a new notebook instance, you are prompted to create a new IAM role. You want to create a role with Amazon S3 access. The Amazon SageMaker console guides you through this process.

Your new role looks like AmazonSageMaker-ExecutionRole-XXX, with a unique identifier for XXX. Because you need to build and push a TorchServe container to Amazon ECR, you also need to add the AmazonEC2ContainerRegistryFullAccess policy to your notebook instance role. You can do this on the IAM console or by using the AWS CLI. See the following code:

aws iam attach-role-policy \
    --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess \
    --role-name AmazonSageMaker-ExecutionRole-XXX

For more information, see Adding and Removing IAM Identity Permissions.

Following along

The code, configuration files, Jupyter notebooks, and Dockerfiles used in this post are available on GitHub. The steps in the following example are from the deploy_torchserve.ipynb Jupyter notebook.

To follow along, open the deploy_torchserve.ipynb notebook and execute each cell. Alternatively, you could copy and paste the following steps into a separate cell in a new Jupyter notebook and run them in order.

The following screenshot shows the deploy_torchserve.ipynb example.

Cloning the example repository

To clone the example repository, enter the following code:

git clone
cd torchserve-examples

Clone the TorchServe repository and install torch-model-archiver

Use `torch-model-archiver` tool to create a model archive file. The .mar model archive file contains model checkpoints along with it’s `state_dict` (dictionary object that maps each layer to its parameter tensor).

!git clone
!pip install serve/model-archiver/

Downloading a PyTorch model, creating a TorchServe archive and uploading it to Amazon S3

To download a PyTorch model and create a TorchServe archive, enter the following code:

!wget -q

model_file_name = 'densenet161'
!torch-model-archiver --model-name {model_file_name} \
--version 1.0 --model-file serve/examples/image_classifier/densenet_161/ \
--serialized-file densenet161-8d451a50.pth \
--extra-files serve/examples/image_classifier/index_to_name.json \
--handler image_classifier

Uploading the model to Amazon S3

To upload the model to Amazon S3, complete the following steps:

  1. Create a boto3 session and get the Region and account information
    import boto3, time, json
    sess    = boto3.Session()
    sm      = sess.client('sagemaker')
    region  = sess.region_name
    account = boto3.client('sts').get_caller_identity().get('Account')
    import sagemaker
    role = sagemaker.get_execution_role()
    sagemaker_session = sagemaker.Session(boto_session=sess)
  2. Get the default Amazon SageMaker S3 bucket name
    bucket_name = sagemaker_session.default_bucket()
    prefix = 'torchserve'
  3. Create a compressed tar.gz file out of the densenet161.mar file, because Amazon SageMaker expects models to be in a tar.gz file.
    !tar cvfz {model_file_name}.tar.gz densenet161.mar
  4. Upload the model to your S3 bucket under the models’ directory.
!aws s3 cp {model_file_name}.tar.gz s3://{bucket_name}/{prefix}/model

Creating an Amazon ECR registry

Create a new Docker container registry for your TorchServe container images. Amazon SageMaker pulls the TorchServe container from this registry. See the following code:

registry_name = 'torchserve'
!aws ecr create-repository --repository-name torchserve

Building a TorchServe Docker container and pushing it to Amazon ECR

The repository for this post already contains a Dockerfile for building a TorchServe container. Build a Docker container image locally and push it to your Amazon ECR repository you created in the previous step. See the following code:

image_label = 'v1'
image = f'{account}.dkr.ecr.{region}{registry_name}:{image_label}'

!docker build -t {registry_name}:{image_label} .
!$(aws ecr get-login --no-include-email --region {region})
!docker tag {registry_name}:{image_label} {image}
!docker push {image}

You get the following output confirming that the container was built and pushed to Amazon ECR successfully:

Hosting an inference endpoint

There are multiple ways to host an inference endpoint and make predictions. The quickest approach is to use the Amazon SageMaker Python SDK. However, if you’re going to invoke the endpoint from a client application, you should use Amazon SDK for the language of your choice.

Hosting an inference endpoint and making predictions with Amazon SageMaker SDK

To host an inference endpoint and make predictions using Amazon SageMaker SDK, complete the following steps:

  1. Create a model. The model function expects the name of the TorchServe container image and the location of your trained models. See the following code:
    import sagemaker
    from sagemaker.model import Model
    from sagemaker.predictor import RealTimePredictor
    role = sagemaker.get_execution_role()
    model_data = f's3://{bucket_name}/models/{model_file_name}.tar.gz'
    sm_model_name = 'torchserve-densenet161'
    torchserve_model = Model(model_data = model_data, 
                            image = image,
                            role = role,
                            name = sm_model_name)

    For more information about the model function, see Model.

  2. On the Amazon SageMaker console, to see the model details, choose Models.
  3. Deploy the model endpoint. Specify the instance type and number of instances you want Amazon SageMaker to run the container on. See the following code:
    endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
    predictor = torchserve_model.deploy(instance_type='ml.m4.xlarge',
     endpoint_name = endpoint_name)

    You can also set it up to automatically scale based on metrics, such as the total number of invocations. For more information, see Automatically Scale Amazon SageMaker Models.

  4. On the Amazon SageMaker console, to see the hosted endpoint, choose Endpoints.
  5. Test the model with the following code:
    !wget -q 
    file_name = 'kitten.jpg'
    with open(file_name, 'rb') as f:
     payload =
     payload = payload
    response = predictor.predict(data=payload)
    print(*json.loads(response), sep = '\n')

The following screenshot shows the output of invoking the model hosted by TorchServe. The model thinks the kitten in the image is either a tiger cat or a tabby cat.

If you’re building applications such as mobile apps or webpages that need to invoke the TorchServe endpoint for getting predictions on new data, you can use Amazon API rather than the Amazon SageMaker SDK. For example, if you’re using Python on the client side, use the Amazon SDK for Python (boto3). For an example of how to use boto3 to create a model, configure an endpoint, create an endpoint, and finally run inferences on the inference endpoint, refer to this example Jupyter notebook on GitHub.


This post introduced TorchServe and its key features and benefits. TorchServe is easy to use for both developers getting models ready for production and Ops engineers deploying containers in production. TorchServe supports eager mode and TorchScript and comes with default handlers for the most commonly deployed models, so you can deploy with zero code changes. TorchServe can host multiple models simultaneously, and supports versioning. For a full list of features, see the  GitHub repo.

This post also presented an end-to-end demo of deploying PyTorch models on TorchServe using Amazon SageMaker. You can use this as a template to deploy your own PyTorch models on Amazon SageMaker. A complete example is available on GitHub.

If you have questions or comments about TorchServe or this post, please leave a comment below or create an issue on the GitHub repo.

About the Authors

Shashank Prasanna is an AI & Machine Learning Technical Evangelist at Amazon Web Services (AWS) where he focuses on helping engineers, developers and data scientists solve challenging problems with machine learning. Prior to joining AWS, he worked at NVIDIA, MathWorks (makers of MATLAB & Simulink) and Oracle in product marketing, product management, and software development roles.




Manoj Rao is a Software Developer with AWS. He works on making ML Runtime / Inference as AWSome as possible. In his spare time, he tinkers with the Linux Kernel, Emacs, and scribbles on his blog.