Unleash the power of GPUs and pre-trained models: Deploying the Whisper voice-to-text model in Switzerland

Amazon SageMaker is a fully managed machine learning service that offers a comprehensive set of tools and features to simplify and accelerate the process of building, training, and deploying ML models. The seamless integration between Amazon SageMaker and the Hugging Face model hub allows for easy access and deployment of thousands of pre-trained AI models to power applications.

This blog post will describe how to leverage the powerful combination of Amazon SageMaker and the GPU-backed Amazon Elastic Compute (Amazon EC2) instances available in the AWS Europe (Zurich) Region to run OpenAI Whisper, an advanced speech recognition (ASR) model that can transcribe and translate speech in multiple languages, including Swiss German.

A detailed explanation will show how to spin up a GPU-backed endpoint on Amazon SageMaker to start transcribing audio files with low latency and high performance.

Introduction

GPUs offer significant benefits for machine learning (ML) and generative AI applications, making them an integral part of these technologies. The speed and parallelism that GPUs provide allow for faster inference and generation of outputs from trained ML and generative AI models as well as accelerate model training and new model generation. GPUs are generally more performant than CPUs for workloads that are easy to run in parallel such as graphics and video processing or deep learning algorithms.

The AWS Europe (Zurich) Region provides Amazon EC2 G6 instances powered by NVIDIA L4 Tensor Core GPUs. G6 instances feature up to 8 NVIDIA L4 Tensor Core GPUs with 24 GB of memory per GPU and third generation AMD EPYC processors. They also support up to 192 vCPUs, up to 100 Gbps of network bandwidth, and up to 7.52 TB of local NVMe SSD storage.

You can find which instances are available by running the below AWS Command Line Interface command:

aws ec2 describe-instance-types --query 'InstanceTypes[?GpuInfo.Gpus[0].Manufacturer==`NVIDIA`].[InstanceType, GpuInfo.Gpus[0].Count,  GpuInfo.Gpus[0].Manufacturer, GpuInfo.Gpus[0].MemoryInfo.SizeInMiB, GpuInfo.Gpus[0].Name]' --region eu-central-2 --output text

In the next section we will explain how to deploy and run the Whisper models in the AWS Europe (Zurich) Region.

Deploying Whisper in Amazon SageMaker

SageMaker is a fully managed service to prepare data, build, train, and deploy machine learning (ML) models, including large language models (LLMs) with fully managed infrastructure, tools, and workflows. You can deploy your own model or rely on pre-trained models available via SageMaker on the HuggingFace repository.

In this blog a real-time inference example will be shown, but it is important to mention that SageMaker provides the flexibility to choose among different endpoint types depending on your specific inference requirements:

Real-time inference is ideal for workloads where you have real-time, interactive, low latency requirements. It is a suitable option for payload sizes up to 6 MB.

User invoking a SageMaker real-time endpoint

Figure 1: Amazon SageMaker real-time endpoint

Asynchronous inference queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1 GB) and/or long processing times (up to 1 hour) that are processed on arrival.

Figure 2: Amazon SageMaker asynchronous endpoint

Batch transform allows you to run predictions on both large and small batch data. There is no need to break down the dataset into multiple chunks or manage real-time endpoints. With a simple API, you can request predictions for a large number of data records and transform the data quickly and easily.

Deployment walkthrough of the Whisper model on a real-time inference endpoint:

Pre-requisites:

Create an AWS account and enable theAWS Europe (Zurich) Region as described in this blog
Setup SageMaker as described here
Create an Amazon SageMaker notebook instance
Create or download a sample mp3 file on your local drive, name it sample_file_to_transcribe.mp3, and drag and drop it into your Jupyter notebook file section

Follow the implementation steps below in your newly created notebook instance:

Endpoint deployment steps:

Import all necessary dependencies.

import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serializers import DataSerializer
from timeit import default_timer as timer

Get the SageMaker execution role. You can find additional configuration details in this link.

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

Select the HuggingFace model (whisper-base in this example) and the task (automatic-speech-recognition or translation) to create the model. It is recommended to check for updated versions of the deep learning container images at this location.

endpoint_name= "whisper-base-async-endpoint"

# Hub Model selection
hub = {
    'HF_MODEL_ID':'openai/whisper-base',
    'HF_TASK':'automatic-speech-recognition'
}

 # create Hugging Face Model
huggingface_model = HuggingFaceModel(
    transformers_version='4.37.0',
    pytorch_version='2.1.0',
    py_version='py310',
    env=hub,
    role=role
)

Deploy the model to an EC2 instance. In this example, a g6.2xlarge instance type has been selected.

predictor = huggingface_model.deploy(
    initial_instance_count=1,          # number of instances
    instance_type='ml.g6.2xlarge',  # ec2 instance type
    endpoint_name=endpoint_name
)

Alternatively, if you rely on an already existing deployed SageMaker endpoint, you can create the predictor with the following code.

from sagemaker.predictor import Predictor

#Retrieve predictor from an already deployed SageMaker endpoint
predictor = Predictor(endpoint_name=endpoint_name)

Execute inference to transcribe the audio file:

Set the audio data serializer, run the inference, and display the transcription and measured execution time.

predictor.serializer = DataSerializer(content_type='audio/x-audio')

# Make sure the input file does exist
with open("sample_file_to_transcribe.mp3", "rb") as f:
    data = f.read()

start = timer()
output = predictor.predict(data)
end = timer()
print(output)
print(end - start)

Clean-up resources:

Clean-up resources to avoid incurring on unnecessary costs.

# delete resources
predictor.delete_endpoint()

Additional considerations to take into account are:

Rightsizing the EC2 instances (CPU/GPU) can help you minimize the costs while maintaining the desired inference performance. Features such as Amazon SageMaker Inference Recommender can help you select the instance type needed for your model.
SageMaker also allows multi-model endpoints, capable of deploying several models under a single endpoint. This can be an efficient solution especially when models have similar sizes, latency requirements or different utilisation patterns.
Auto scaling can be enabled to automatically increase the number of instances serving the models based on the incoming traffic.
Finally, to prepare your workload for production deployment, you may need to consider a few key factors such as the storage or any orchestration required. AWS services such as Amazon Single Storage Service (S3), AWS Step Functions and AWS Lambda can help you build and automate the flow of your workload processes, ensuring efficient and reliable execution.

Pricing considerations

The latest g6.2xlarge EC2 instances in the AWS Europe (Zurich) Region have an on-demand price, as of today, of $1.31796 per hour. To provide some context, transcribing 70 seconds of audio on the previous example took less than 2 seconds using this instance type.

If you want to transcribe 1 hour of similar audio it would cost approximately 3.5 cents on a g6.2xlarge instance. This demonstrates the cost-efficiency of leveraging powerful GPU EC2 instances for voice-to-text workloads.

To optimize the costs of your machine learning workloads, you can consider signing up for an AWS Machine Learning Savings Plan. This allows you to save up to 64% on your compute costs in exchange for committing to a consistent level of usage, measured in dollars per hour.

Finally, we recommend using the AWS Pricing Calculator, our free web-based planning tool to create detailed cost estimates for various AWS services, so that you can accurately plan and budget for your cloud infrastructure needs.

Conclusion

The availability of GPU-powered EC2 instances in the AWS Europe (Zurich) Region expands the capabilities of AWS for customers in Switzerland. AI and machine learning models, including some of the latest large language models, can be deployed with even greater performance and flexibility.

One of the key advantages of the Amazon SageMaker service is the ease with which you can operationalize models using the automation and managed capabilities it provides. Whether you need to build a custom computer vision model or deploy an open-source LLM, SageMaker makes it seamless to get your models into production.

The Hugging Face hub provides thousands of pre-trained models ready to be deployed on SageMaker, such as Whisper. By deploying Whisper on SageMaker, you can enjoy enterprise-grade scalability and operationalization while keeping your data within your desired geographic boundaries.

Alternatively to the option described in this blog, customers can use Amazon Transcribe in Regions such as AWS Europe (Frankfurt) (full list here). Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. It is a fully managed service that includes multiple features such as real-time and asynchronous transcription, summarization, or sentiment analysis, to name a few. More than 100 languages are supported.

AWS in Switzerland and Austria (Alps)

Unleash the power of GPUs and pre-trained models: Deploying the Whisper voice-to-text model in Switzerland

Introduction

Deploying Whisper in Amazon SageMaker

Deployment walkthrough of the Whisper model on a real-time inference endpoint:

Endpoint deployment steps:

Execute inference to transcribe the audio file:

Clean-up resources:

Pricing considerations

Conclusion

Learn

Resources

Developers

Help