AWS Startups Blog

How Startups Deploy Pretrained Models on Amazon SageMaker

By Allie K. Miller, Head of US Head of AI Business Development for Startups and Venture Capital and Sean Wilkinson, Solutions Architect, Machine Learning, AWS

For most machine learning startups, the most valuable resource is time. They want to focus on developing the unique aspects of their business, not managing the dynamic compute infrastructure needed to run their applications. Productionizing machine leaning should be easier, and that’s where AWS comes in. In this blog post and corresponding GitHub repo, you will learn how to bring a pre-trained model to Amazon SageMaker to have production-ready model serving in under 15 minutes.

SageMaker is a managed service designed to accelerate machine learning development. It includes components for building, training, and deploying machine learning models. Each SageMaker component is modular, so you can pick and choose which features you want—from experiment management to concept drift detection. One SageMaker feature frequently used by startups is model hosting. With model hosting, you can quickly deploy models on SageMaker as a RESTful API without worrying about scaling it as your startup grows.

What you will need:

SageMaker hosting creates a managed API endpoint for models that your applications can use to retrieve real-time predictions. Each endpoint supports load balancing, auto-scaling, A/B testing, and advanced security features (like end-to-end encryption with custom keys) so they can scale with your startup. SageMaker also includes built-in containers for popular frameworks like PyTorch and TensorFlow that come with robust model serving stacks, so all you need to provide is your model and inference code. This example uses the pre-built PyTorch container, but you’re encouraged to use the TensorFlow containers or your own custom containers as needed.

Though SageMaker provides the container, you’ll need to supply a model and an inference script. This example uses the popular GPT-2 model developed by OpenAI to generate text. The inference script runs in the SageMaker container and loads our model, makes predictions, and performs input/output processing. A simple example of model loading is provided below, but you can reference the SageMaker Python SDK documentation for a thorough overview of inference script requirements.

def model_fn(model_dir):
    Load the model for inference

    # Load GPT2 tokenizer from disk.
    vocab_path = os.path.join(model_dir, 'model/vocab.json')
    merges_path = os.path.join(model_dir, 'model/merges.txt')
    tokenizer = GPT2Tokenizer(vocab_file=vocab_path,

    # Load GPT2 model from disk.
    model_path = os.path.join(model_dir, 'model/')
    model = GPT2LMHeadModel.from_pretrained(model_path)

    return TextGenerationPipeline(model=model, tokenizer=tokenizer)

The inference script loads the GPT-2 model from the Hugging Face Tranformers library, which isn’t included by default in the PyTorch container. To ensure the library is available at runtime, you’ll create a requirements.txt file that specifies the external libraries needed. The packages listed in the requirements file are installed when the container starts.

Once you have completed the requirements file, you are ready to create a deployment package. The deployment package should include the serialized model, inference script, and requirements file. When using the built-in containers, the directory structure of the package must conform to the structure specified in the documentation. SageMaker expects a tar archive with gzip compression, which you can create with the following code:

import tarfile

zipped_model_path = os.path.join(model_path, "model.tar.gz")

with, "w:gz") as tar:

Now that the deployment package is complete, you can use the SageMaker Python SDK to deploy the endpoint. Notice that you specify the PyTorch version (1.5) and Python version (3) when creating the endpoint. The SDK uses this information to select the compatible container. You may choose different instance types at this point. This example uses an m5 instance because it has a high memory/vCPU ratio so that it can hold multiple copies of the large GPT-2 model in memory.

model = PyTorchModel(entry_point='', 

predictor = model.deploy(initial_instance_count=1, 

It will take a few minutes for the instance to deploy, but once it completes, you can use the SageMaker Runtime API to query the endpoint for predictions:

response = sm.invoke_endpoint(EndpointName=endpoint_name, 

By providing the endpoint with the prompt: “Working with SageMaker makes machine learning “, GPT-2 generates the following output: ““Working with SageMaker makes machine learning” a lot easier than it used to be.”

You have successfully created a scalable API that is backed by a GPT-2 model – awesome! For an example of another popular natural language processing model, BERT, visit this notebook. To avoid incurring unnecessary charges, shut down the endpoint with the following code once you are done:


Amazon SageMaker not only improves startups’ speed-of-deployment, but all phases of machine learning. If you are interested in experimenting with more advanced features of SageMaker Hosting, check out Model Monitor to detect concept drift, Autoscaling to dynamically adjust the number of instances, or VPC config to control network access to/from your endpoint. You could even try deploying a different open source model and share your results with others!

And as always, stay up to date on the latest machine learning news for startups here.


Interested in ML on AWS? Contact us today!