AWS Machine Learning Blog

Deploy BLOOM-176B and OPT-30B on Amazon SageMaker with large model inference Deep Learning Containers and DeepSpeed

April 2023: This post was reviewed and updated for accuracy.

The last few years have seen rapid development in the field of deep learning. Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly encounter issues deploying their large deep learning models for applications such as natural language processing (NLP).

In an earlier post, we discussed capabilities and configurable settings in Amazon SageMaker model deployment that can make inference with these large models easier. Today, we announce a new Amazon SageMaker Deep Learning Container (DLC) that you can use to get started with large model inference in a matter of minutes. This DLC packages some of the most popular open-source libraries for model parallel inference, such as DeepSpeed and Hugging Face Accelerate.

In this post, we use a new SageMaker large model inference DLC to deploy two of the most popular large NLP models: BigScience’s BLOOM-176B and Meta’s OPT-30B from the Hugging Face repository. In particular, we use Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve 0.1 second latency per token in a text generation use case.

You can find our complete example notebooks in our GitHub repository.

Large model inference techniques

Language models have recently exploded in both size and popularity. With easy access from model zoos such as Hugging Face and improved accuracy and performance in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, large models are often too big to fit within the memory of a single accelerator. For example, the BLOOM-176B model can require more than 350 gigabytes of accelerator memory, which far exceeds the capacity of hardware accelerators available today. This necessitates the use of  model parallel techniques from libraries like DeepSpeed and Hugging Face Accelerate to distribute a model across multiple accelerators for inference. In this post, we use the SageMaker large model inference container to generate and compare latency and throughput performance using these two open-source libraries.

DeepSpeed and Accelerate use different techniques to optimize large language models for inference. The key difference is DeepSpeed’s use of optimized kernels. These kernels can dramatically improve inference latency by reducing bottlenecks in the computation graph of the model. Optimized kernels can be difficult to develop and are typically specific to a particular model architecture; DeepSpeed supports popular large models such as OPT and BLOOM with these optimized kernels. In contrast, Hugging Face’s Accelerate library doesn’t include optimized kernels at the time of writing. As we discuss in our results section, this difference is responsible for much of the performance edge that DeepSpeed has over Accelerate.

A second difference between DeepSpeed and Accelerate is the type of model parallelism. Accelerate uses pipeline parallelism to partition a model between the hidden layers of a model, whereas DeepSpeed uses tensor parallelism to partition the layers themselves. Pipeline parallelism is a flexible approach that supports more model types and can improve throughput when larger batch sizes are used. Tensor parallelism requires more communication between GPUs because model layers can be spread across multiple devices, but can improve inference latency by engaging multiple GPUs simultaneously. You can learn more about parallelism techniques in Introduction to Model Parallelism and Model Parallelism.

Solution overview

To effectively host large language models, we need features and support in the following key areas:

  • Building and testing solutions – Given the iterative nature of ML development, we need the ability to build, rapidly iterate, and test how the inference endpoint will behave when these models are hosted, including the ability to fail fast. These models can typically be hosted only on larger instances like p4dn or g5, and given the size of the models, it can take a while to spin up an inference instance and run any test iteration. Local testing usually has constraints because you need a similar instance in size to test, and these models aren’t easy to obtain.
  • Deploying and running at scale – The model files need to be loaded onto the inference instances, which presents a challenge in itself given the size. Tar / Un-Tar as an example for the Bloom-176B takes about 1 hour to create and another hour to load. We need an alternate mechanism to allow easy access to the model files.
  • Loading the model as singleton – For a multi-worker process, we need to ensure the model gets loaded only once so we don’t run into race conditions and further spend unnecessary resources. In this post, we show a way to load directly from Amazon Simple Storage Service (Amazon S3). However, this only works if we use the default settings of the DJL. Furthermore, any scaling of the endpoints needs to be able to spin up in a few minutes, which calls for reconsidering how the models might be loaded and distributed.
  • Sharding frameworks – These models typically need to be , usually by a tensor parallelism mechanism or by pipeline sharding as the typical sharding techniques, and we have advanced concepts like ZeRO sharding built on top of tensor sharding. For more information about sharding techniques, refer to Model Parallelism. To achieve this, we can have various combinations and use frameworks from NIVIDIA, DeepSpeed, and others. This needs the ability to test BYOC or use 1P containers and iterate over solutions and run benchmarking tests. You might also want to test various hosting options like asynchronous, serverless, and others.
  • Hardware selection – Your choice in hardware is determined by all the aforementioned points and further traffic patterns, use case needs, and model sizes.

In this post, we use DeepSpeed’s optimized kernels and tensor parallelism techniques to host BLOOM-176B and OPT-30B on SageMaker. We also compare results from Accelerate to demonstrate the performance benefits of optimized kernels and tensor parallelism. For more information on DeepSpeed and Accelerate, refer to DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale and Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate.

We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about the DJL and DJLServing, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

It’s worth noting that optimized kernels can result in precision changes and a modified computation graph, which could theoretically result in changed model behavior. Although this could occasionally change the inference outcome, we do not expect these differences to materially impact the basic evaluation metrics of a model. Nevertheless, practitioners are advised to confirm the model outputs are as expected when using these kernels.

The following steps demonstrate how to deploy a BLOOM-176B model in SageMaker using DJLServing and a SageMaker large model inference container. The complete example is also available in our GitHub repository.

Using the DJLServing SageMaker DLC image

Leverage the SageMaker SDK to retrieve the DJL Serving SageMaker DLC image corresponding to a specific version. Use the following code after replacing the region with your specific region you are running the notebook in:

inference_image_uri = image_uris.retrieve(

Create our configuration file

First, we create a file called This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. The DJL Inference Image ships with a number of built-in inference handlers for a wide variety of tasks including:

  • text-generation
  • question-answering
  • text-classification
  • token-classification

You can refer to this GitRepo for a list of additional handlers and available NLP Tasks. These handlers can be utilized as is without having to write any custom inference code. We simply need to create a text file with our desired hosting options and package it into a tar.gz artifact. A complete example that illustrates the hosting of a large language model like OPT-30B using the built-in inference handlers of the container can be found in this notebook.

Here’s the configuration file to host OPT-30B on an instance with 4 GPUs:

engine = DeepSpeed

Let’s go through the list of settings that we use in this configuration file.

  • engine: The engine for DJL to use. In this case, it is DeepSpeed.
  • option.entryPoint: The entrypoint python file or module that will be used to host the model. djl_python.deepspeed refers to module from the djl_python repository.
  • option.model_id: Set this to the URI of the Amazon S3 bucket that contains the model artifacts.
  • option.tensor_parallel_degree: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.

In this post, we demonstrate how to deploy a BLOOM-176B model with custom inference code. We have hosted the model in a public S3 location for ease of use. In that case, you should use the parameter option.s3url. Set this to the URI of the Amazon S3 bucket that contains the model artifacts. The DJL container automatically downloads the model artifacts from the S3 bucket to the hosting instance using the highly optimized s5cmd. The model artifacts are downloaded into /tmp. SageMaker makes the mounted EBS volume specified by VolumeSizeInGB available under /tmp on the container. This same location is also used when mapping the SSD memory available for instances which support SSD.

Here’s the code from the :


DJL Serving general settings has additional configurations and settings that can be used in for DeepSpeed.

Create custom inference code (Optional)

Next, we create our file, which defines the code needed to load and then serve the model. We read the value of tensor_parallel_degree from the properties and use it to set the number of partitions that need to be created. Note that DeepSpeed provides a few built-in partition definitions, including one for BLOOM models. We use it by specifying replace_method and relpace_with_kernel_inject. If you have a customized model and need DeepSpeed to partition effectively, you need to change relpace_with_kernel_inject to false and add injection_policy to make the runtime partition work. For more information, refer to Initializing for Inference. For our example, we used the pre-partitioned BLOOM model on DeepSpeed.

from djl_python import Input, Output
import deepspeed
import torch
import logging
import math
import os
from transformers import AutoConfig,

model = None
tokenizer = None
generator = None

def load_model(properties):
    tensor_parallel = properties["tensor_parallel_degree"]
    model_location = properties["model_id"]"Loading model in {model_location}")

    tokenizer = AutoTokenizer.from_pretrained(model_location)

    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(
            AutoConfig.from_pretrained(model_location), torch_dtype=torch.bfloat16
        )"Starting DeepSpeed init with TP={tensor_parallel}")
    model = deepspeed.init_inference(
    model = model.module
    return model, tokenizer

def run_inference(model, tokenizer, data, params):
    generate_kwargs = params
    tokenizer.pad_token = tokenizer.eos_token
    input_tokens = tokenizer.batch_encode_plus(data, return_tensors="pt", padding=True)
    for t in input_tokens:
        if torch.is_tensor(input_tokens[t]):
            input_tokens[t] = input_tokens[t].to(torch.cuda.current_device())
    outputs = model.generate(**input_tokens, **generate_kwargs)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

def handle(inputs: Input):
    global model, tokenizer
    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None
    data = inputs.get_as_json()

    input_sentences = data["inputs"]
    params = data["parameters"]

    outputs = run_inference(model, tokenizer, input_sentences, params)
    result = {"outputs": outputs}
    return Output().add_as_json(result)

We have created a directory called code , that contains the, and files. To view the files, you can run the following code from the terminal:

mkdir -p code
cat code/ 
cat code/

The following figure shows the structure of the model.tar.gz.

Lastly, we create the model file and upload it to Amazon S3:

tar cvfz model.tar.gz code
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

Download and store the model from Hugging Face (Optional)

We have provided the steps in this section in case you want to download the model to Amazon S3 and use it from there. The steps are provided in the Jupyter file on GitHub. The following screenshot shows a snapshot of the steps.

Create a SageMaker model

We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image URI retrieved earlier and the model artifact from the previous step to create the SageMaker model. See the following code:

from sagemaker.utils import name_from_base

model_name = name_from_base(f"bloom-djl-ds")

create_model_response = sm_client.create_model(
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    # Uncomment if providing networking configs
    # VpcConfig=privateVpcConfig
model_arn = create_model_response["ModelArn"]

After you run the preceding cell in the Jupyter file, you see output similar to the following:

    "ModelArn": "arn:aws:sagemaker:us-east-1:<account_id>:model/bloom-djl-ds-<date_time>"

Create a SageMaker endpoint

You can use any instances with multiple GPUs for testing. In this demo, we use a p4d.24xlarge instance. In the following code, note how we set the ModelDataDownloadTimeoutInSeconds, ContainerStartupHealthCheckTimeoutInSeconds, and VolumeSizeInGB parameters to accommodate the large model size. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

endpoint_config_response = sm_client.create_endpoint_config(
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 200,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,

Lastly, we create a SageMaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name

You see it printed out in the following code:

    "EndpointArn": "arn:aws:sagemaker:us-east-1:<aws-account-id>:endpoint/bloom-djl-ds-<date_time>"

Starting the endpoint might take a while. You can try a few more times if you run into the InsufficientInstanceCapacity error, or you can raise a request to AWS to increase the limit in your account.

Performance tuning

If you intend to use this post and accompanying notebook with a different model, you may want to explore some of the tunable parameters that SageMaker, DeepSpeed, and the DJL offer. Iteratively experimenting with these parameters can have a material impact on the latency, throughput, and cost of your hosted large model. To learn more about tuning parameters such as number of workers, degree of tensor parallelism, job queue size, and others, refer to DJL Serving configurations and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.


We have benchmarked a number of prominent large language models on Amazon Sagemaker using the DJL container and leveraging DeepSpeed. The following charts contain the results.

Customers are responsible for making their own independent assessment of the information in this document. This benchmark: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers, or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. The end-to-end latency and throughput depends on various factors including but not restricted to model size, underlying protocol used to communicate with the inference server, overhead related to creating new TLS connections, deserialization time of the request/response payload, request queuing and batching features provided by the underlying inference server, request scheduling capabilities provided by the underlying inference server, underlying runtime performance of the inference server, performance of preprocessing and postprocessing libraries before calling the model prediction function, underlying ML framework backend performance, model-specific and hardware-specific optimizations (i.e. quantization, compression, etc.), underlying infrastructure hardware (i.e. compute, storage, and networking), customer inference code, model sharding/partitioning library performance, underlying accelerator’s capabilities and inter-accelerator communication, batch size, model parallel techniques, model architecture, model-specific inference-related hyperparameters (i.e. top-k, temperature, search strategy, etc.), storage performance, and many more. The above benchmarking numbers are only for reference and may not represent the most optimal experiment setup.

These results demonstrate the difference in latency and throughput of different model sizes. If you have strict latency, throughput, or cost limitations, consider using the smallest model possible that will still achieve your functional requirements.

Clean up

As part of best practices it is always recommended to delete idle instances. The below code shows you how to delete the instances.

# - Delete the end point

# - In case the end point failed we still want to delete the model

Optionally delete the model check point from your S3

!aws s3 rm --recursive s3://<your_bucket>/{s3_model_prefix}


In this post, we demonstrated how to use SageMaker large model inference containers to host two large language models, BLOOM-176B and OPT-30B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker ML instance.

For more details about Amazon SageMaker and its large model inference capabilities, refer to Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas and Real-time inference.

About the authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.

Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or catching up with sports.