AWS Open Source Blog

Deploy Large Language Models Easily with the New ezsmdeploy Python SDK

AWS created the ezsmdeploy open source Python package to help easily deploy machine learning models and provide a variety of options such as passing one or more model files, automatic selection of instances, and autoscaling.

This blog post will cover the new features of the ezsmdeploy 2.0 SDK. We’ll also provide some code examples to demonstrate how to launch and interact with popular foundation models (FM). The new ezsmdeploy Python SDK from AWS makes it much simpler to deploy large language models (LLMs) from Hugging Face Hub and SageMaker Jumpstart as production-ready APIs on Amazon SageMaker.

Version 2.0 of the SDK has several new capabilities to further simplify deploying foundation models. With just a few lines of code, users can now deploy popular large language models such as Llama 2, Falcon, and Stable Diffusion. The SDK automatically selects instance types, configures autoscaling, and handles other details required to launch production-ready deployments.

Overall, the updates in ezsmdeploy 2.0 make it easier for developers to take large language models from Hugging Face Hub and SageMaker Jumpstart and launch them as production services. By handling instance selection, autoscaling, and other deployment details automatically, ezsmdeploy reduces the code needed to put state-of-the-art models into production from hundreds of lines to just a few.

New features of ezsmdeploy 2.0

Ezsmdeploy is the Python SDK that lets you go from local model files or S3 tarballs to fully-managed, autoscaling SageMaker endpoints. It handles everything from Docker builds to load testing so you can focus on your machine learning. You can test out multi-model endpoints by simply passing in a list of models, or low-cost serverless inference by passing in serverless=True. You can even capture model data and monitor drift automatically with a single parameter, all without becoming a DevOps expert or changing how you do model prediction.

Version 2.0 of Ezsmdeploy adds new features related to support for SageMaker Jumpstart foundational models. SageMaker Jumpstart provides pre-built foundational models that can be used to bootstrap machine learning projects. This release allows users to deploy those pre-built models with one line of code.

The project has also added support for Hugging Face Hub models. Hugging Face Hub is a repository of machine learning models. Users can now utilize models easily from the Hugging Face Hub for SageMaker.

Additionally, OpenChatKit support has been included for conversational AI models. OpenChatKit is a toolkit for building chat models. Users can now leverage compatible OpenChatKit models for conversational applications. Testing has been performed on the variety of models and configurations.

Code examples

The following code can be executed in Amazon SageMaker notebooks. For more information on the machine learning development environments that SageMaker offers, please visit here.

To install the ezsmdeploy package on a SageMaker notebook instance, or any notebook environment with permissions to access AWS resources, run the following commands:

#If you have any previous version of ezsmdeploy installed, uninstall

%pip uninstall -y ezsmdeploy --quiet

%pip install -U ezsmdeploy

You can now deploy state-of-the-art models like GPT-3, Falcon, and Bloom directly from Hugging Face or Jumpstart to SageMaker, without having to build custom containers or write complex deployment code.

For example, to deploy the 40B parameter Falcon instruct model from Hugging Face, here is the Python code:

from ezsmdeploy import Deploy

 ez_falcon = Deploy(model="tiiuae/falcon-40b-instruct",
             foundation_model=True,
             huggingface_model=True)

That’s it! ezsmdeploy will use an appropriate service provided Docker container (or build one), deploy the model on Amazon SageMaker, and provide an endpoint to query the model. This is considerably simpler than the standard way of deploying the same model (in this case, the Falcon 40B Instruct).

Here are some more examples that demonstrate deploying large models using ezsmdeploy.

1. Jumpstart LLM deployment

After importing ezsmdeploy, we can now call the Deploy method passing in the model name, foundation model flag, and instance type. This will deploy the model to Amazon SageMaker’s hosted endpoints.

import ezsmdeploy

ez_flant5 = ezsmdeploy.Deploy(model = "huggingface-text2text-flan-t5-xxl-fp16",
                             foundation_model=True,
                             instance_type='ml.g5.12xlarge')

To test the deployed model, a payload is passed to the model with hyperparameters. Here, the query is about “steps to make a pizza.”

payload = {
         "text_inputs": "Steps to make a pizza\n",
         "max_length": 100,
         "max_time": 10,
         "top_k": 50,
         "top_p": 0.95,
         "do_sample": True}
         
 response = ez_flant5.predictor.predict(payload)

These are the steps to make a pizza

  • Make dough and let rise
  • Preheat oven to 500°F
  • Roll out dough thin
  • Top with sauce, cheese and toppings of choice
  • Bake on a hot stone or baking sheet for 10-15 minutes until crust is crisp and cheese is melted and brown
  • Allow to cool slightly
  • Slice into wedges
  • Serve warm and enjoy!

It is considered a best practice to not leave resources running on AWS. To delete the endpoint, we just call the delete_endpoint API.

ez_flant5.predictor.delete_endpoint()

2. Hugging Face model deployment from hub

In the following tasks, we show how to deploy the state-of-the-art pre-trained Falcon 7B models from Hugging Face for Text2Text Generation. You can directly use a Falcon 7B model for many Natural Language Processing (NLP) tasks, without fine-tuning the model. To deploy the Falcon 7B model from Hugging Face, first import ezsmdeploy and then call the .Deploy method with the model name, huggingface_model flag, foundation_model flag and instance type as follows:

import ezsmdeploy

import ezsmdeploy

 ez_falcon = ezsmdeploy.Deploy(model = "tiiuae/falcon-7b-instruct", 
                            huggingface_model=True,
                            foundation_model=True,
                            instance_type='ml.g5.16xlarge'

The input is sent to the model using the predictor method:

response = ez_falcon.predictor.predict({"inputs": "..."})

In this example, the input is “Paris is the capital of” and the Falcon model completes the sentence and responses the full text as “Paris is the capital of France.”

response = ez_falcon.predictor.predict({"inputs":"Paris is the capital of "})
 response 

 [{'generated_text': 'Paris is the capital of France.'}]

3. Huggingface model deployment on serverless

In this example, we show how to deploy a FM on a serverless instance easily by just enabling the serverless flag.

ez_tinybert = ezsmdeploy.Deploy(model = "Intel/dynamic_tinybert",
                            huggingface_model=True,
                            huggingface_model_task='question-answering',
                            serverless=True, 
                            serverless_memory=6144
                            )

 payload  = {"inputs": {
     "question": "Who discovered silk?",
     "context": "Legend has it that the process for making silk cloth was first invented by the wife of the Yellow Emperor, Leizu, around the year 2696 BC. The idea for silk first came to Leizu while she was having tea in the imperial gardens." + "The production of silk originates in China in the Neolithic (Yangshao culture, 4th millennium BCE). Silk remained confined to China until the Silk Road opened at some point during the later half of the first millennium BCE. "
 }}

 response = ez_tinybert.predictor.predict(payload)

4. Chat interface with LLM chat models

Chat interfaces powered by large language models like RedPajama are becoming increasingly sophisticated. With the eszmdeploy library, developers can easily integrate LLM chat models into their applications, creating engaging and interactive experiences for users. These models can be deployed easily with just one line of code as shown in the code sample here:

ez_redpajama = ezsmdeploy.Deploy(model = "togethercomputer/RedPajama-INCITE-7B-Chat",
                                 huggingface_model=True,
                                 foundation_model=True,
                                 instance_type='ml.g5.8xlarge'
 )

 ez_redpajama.chat()

ezsmdeploy open chat kit

5. Image generation with Stable Diffusion 2-1

Stable Diffusion is a text-to-image model that enables you to create photorealistic images from just a text prompt. A diffusion model trains by learning to remove noise that was added to a real image. This de-noising process generates a realistic image. These models can also generate images from text alone by conditioning the generation process on the text. For instance, Stable Diffusion is a latent diffusion where the model learns to recognize shapes in a pure noise image and gradually brings these shapes into focus if the shapes match the words in the input text.

In this example, you will learn how to use ezsmdeploy to deploy a Stable Diffusion model (“model-txt2img-stabilityai-stable-diffusion-v2-1-base”) from JumpStart. Then you will use the predictor to generate artistic images and add prompts.

model_id = 'model-txt2img-stabilityai-stable-diffusion-v2-1-base'

 ez_stable = ezsmdeploy.Deploy(model=model_id,
                               foundation_model=True)

Now that the model is deployed, the prompt and image weight and height are determined to send as a payload to the model.

prompt_text = "Empire state building, realistic style"
 content_type = "application/x-text"
 accept = "application/json"
 image_w = 512
 image_h = 512
 payload = {"prompt": prompt_text, "width": image_w, "height": image_h}
 encoded_payload = json.dumps(payload).encode("utf-8")
 response = ez_stable.predictor.predict(
                                             encoded_payload,
                                             {
                                             "ContentType": content_type,
                                             "Accept": accept,
                                             },
 )

Empire state building image

We can parse the image and prompt from the response and display the generated image.

img,prompt = response["generated_image"], response["prompt"]
display_img_and_prompt(img,prompt)

Clean Up

Once we are done experimenting, we should delete the endpoint. In the last example, we simply call the delete_endpoint() method.

ez_stable.predictor.delete_endpoint()

End-to-end examples

You can find more end-to-end examples in the GitHub repo for JumpStart, HuggingFace, PyTorch, and so on.

Conclusion

The new ezsmdeploy Python SDK introduced in this post enables deploying large machine learning models with a single API call. It significantly reduces the complexity of deploying large models in production. By handling deployment intricacies behind the scenes, data scientists can go from model to production API in minutes. The SDK drastically lowers deployment barriers so you can focus on your model while the SDK handles the infrastructure.

Shreyas Subramanian

Shreyas Subramanian

Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Ray Khorsandi

Ray Khorsandi

Ray Khorsandi is an AI/ML specialist at AWS, supporting strategic customers with AI/ML best practices. With an M.Sc. and Ph.D. in Electrical Engineering and Computer Science, he leads enterprises to build secure, scalable AI/ML and big data solutions to optimize their cloud adoption. His passions include computer vision, NLP, generative AI, and MLOps. Ray enjoys playing soccer and snowboarding in the mountains.