Host the Whisper Model with Streaming Mode on Amazon EKS and Ray Serve

OpenAI Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has demonstrated strong ASR performance across various languages, including the ability to transcribe speech in multiple languages and translate them into English. The Whisper model is open-sourced under the Apache 2.0 license, making it accessible for developers to build useful applications, such as transcription services, voice assistants, and enhancing accessibility for individuals with hearing impairments.

For the apps such as transcriptions of live broadcast and online meeting, ASR streaming mode is required. Streaming has different APIs, fully duplex bi-directional streams, and a lower latency requirement, usually in seconds and that needs maintaining states. Overall it needs a new system architecture that is different from the one used for batch.

For these streaming ASR applications, they have a lower latency requirement and scaling becomes a challenge. To solve the scaling problem, Ray Serve is a scalable and easy-to-use framework for building and deploying machine learning (ML) models as web services. It is a part of the Ray ecosystem, which is an open-source project for distributed computing. Ray Serve is specifically designed to simplify the process of turning ML models into production-ready, scalable services that can handle high throughput and dynamic workloads.

In the post, we explore how to build an ML inference solution based on pure Python and Ray Serve that can run locally on a single Amazon Elastic Compute Cloud (Amazon EC2) instance, and expose the streaming ASR service through WebSocket protocol. We use the power of KubeRay and Amazon Elastic Kubernetes Service (Amazon EKS) so that we can transform a simple Python program to scale automatically, transcribe multiple audio streams, and provision GPU instances with minimal code changes.

Solution overview

The following diagram shows the main components of this solution.

Ray Serve Application: It runs a WebSocket-based web service that accepts audio stream, and distributes Voice Activity Detection (VAD) and ASR requests to multiple replicas. Ray Serve Replicas will be autoscaled according to the ongoing request count.
Amazon EKS:The underlying infrastructure of the ML model serving solution. With Data on EKS Blueprint, the EKS cluster and the essential add-ons and tooling such as Karpenter, NodePool, and KubeRay Operator can be deployed with a Terraform module with a single command.

Figure 1. Architecture of Ray on Amazon EKS stack

Here we show a few concepts for converting a local Python ML app on a single EC2 instance to a scalable, distributed app in Ray Serve on Kubernetes.

1. Converting a Python app to a Ray Serve app

We forked the open source project VoiceStreamAI, which is a Python app that accepts the audio stream through a WebSocket connection, and calls the transcribe function in the FasterWhisper library to transcribe the audio. In the context of audio streaming, the connection and audio buffer should be kept at the server side, but in the transcribe call it is better that the compute-intensive part be stateless and distributed to multiple replicas.

The framework we chose is Ray Serve. Ray Serve is a scalable model serving library for building online inference APIs. With only a few code annotations serve.deployment as shown in the following, we transformed a single Python app to be distributed a Ray Serve app.

# src/asr/faster_whisper_asr.py
# the serve.deployment decorator will make the Python object to be a Ray Actor, a scalable deployment 
@serve.deployment(
    ray_actor_options={"num_gpus": 1},
    autoscaling_config={"min_replicas": 1, "max_replicas": 10},
)
class FasterWhisperASR(ASRInterface):

    async def transcribe(self, client):
        ...
        segments, info = self.asr_pipeline.transcribe(
            file_path, word_timestamps=True, language=language)

The original Python app uses two ML models: Voice Activity Detection and Whisper ASR. We use multiple deployments of Ray Serve, which makes it easy to divide the application’s steps into independent deployments that it independently scales. To learn more about converting a Python app to a Ray Serve app, read the documentation Ray Serve – Getting Started.

Figure 2. Streaming ASR system

You can run the same code on your local machine during development and test as you would in production. Furthermore, you can run the code on a local machine EC2 instance with GPU, with AWS Deep Learning based AMI with Ubuntu. You don’t actually need a Kubernetes cluster to build and test a Ray Serve app.

2. Going production with autoscaling and Karpenter on Amazon EKS

After your code is tested locally, you can put your app into production on Kubernetes using the KubeRay RayService custom resource. KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. The RayService custom resource automatically handles important production requirements such as health checking, status reporting, failure recovery, and upgrades. Also, you can also define the autoscaling policy for your Ray Serve application without writing code for the scaling logic.

In a Ray Serve app, the autoscaling has two levels: application level and cluster level. The application level’s autoscaling policy can be specified with target_ongoing_requests.We have three Ray Serve deployments (TranscriptionServer, FasterWhisperASR, and PyannoteVAD) that can scale independently. You can adjust it based on your latency objective (the shorter you want your latency to be, the smaller this number should be).

The cluster level’s autoscaling can be enabled by specifying spec.rayClusterConfig.enableInTreeAutoscaling, and a new Ray worker pod is automatically created if provisioning new workers is necessary. If new EC2 instances are required for a new Ray worker, then Karpenter provisions the GPU instances automatically. Karpenter is an open-source node provisioning project built for Kubernetes, and it observes the resource requests of unscheduled pods and makes decisions to launch nodes.

# Whisper-RayService.yaml
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: whisper-streaming
  
# Ray Serve Replica's Auto Scaling Policy 

...
  serveConfigV2: |
    applications:
      - name: whisper_app
        import_path: src.voice_stream_ai_server:entrypoint
        runtime_env:
          ...
        deployments:
        - name: TranscriptionServer
          max_ongoing_requests: 100
          autoscaling_config:
            target_ongoing_requests: 5
            min_replicas: 1
            max_replicas: 5
        - name: FasterWhisperASR
          max_ongoing_requests: 10
          autoscaling_config:
            target_ongoing_requests: 2
            min_replicas: 1
            max_replicas: 20
        - name: PyannoteVAD
        ...

Follow the Production Guide and Ray Serve Autoscaling if you are interested in converting your ML application into production on a Ray Cluster running on Kubernetes.

Walkthrough

The code samples and detailed deployment guide can be found on GitHub.

Remember that the estimated cost for this solution would be around 3600 USD/month in the Tokyo Region (including one EKS cluster, three managed node group EC2 nodes, and at least three g5.xlarge GPU instances in this example. Your actual cost may vary depending on the region, the load you ingested, and the instances launched by the Karpenter.) You may consider destroying the setup once you finish testing.

Prerequisites

The following prerequisites are required:

The common tools such as Terraform, kubectl, and AWS Command Line Interface (CLI).
A Huggingface token with read permissions to run the demo app. Request access to the Voice-Activity-Detection model through this link.

1. Provision the EKS cluster and add-ons

The deployment of the EKS cluster can be done through the Terraform module. First, you need to clone the repo, also make sure that you have fulfilled the prerequisites mentioned earlier.

git clone https://github.com/aws-samples/ray-serve-whisper-streaming-on-eks.git
cd infra

The infra directory contains what you need to setup the entire infrastructure. Before setting up, modify the variables in dev.auto.tfvars. Only pyannote_auth_token is mandatory, which is essentially a Hugging Face auth token.

cp tfvars-example dev.auto.tfvars

Next, modify variables.tf to choose the AWS Region for your deployment. Then you can start to deploy the required infrastructure with the following command.

./install.sh

The whole process could take around 20 minutes. Once it’s done, you should see the following message:

SUCCESS: Terraform apply completed successfully

You may also find the Terraform output message such as configure_kubectl = "aws eks –region <Your Region> update-kubeconfig --name ray-cluster“. Copy the aws-cli command and run accordingly to setup the Kubernetes access for kubectl

Now you have the cluster setup. Check the status with the following command:

kubectl get node

You can see there are three nodes up and running:

NAME                                              STATUS   ROLES    AGE     VERSION
ip-10-12-18-152.ap-northeast-1.compute.internal   Ready    <none>   11m     v1.29.0-eks-5e0fdde
ip-10-12-38-199.ap-northeast-1.compute.internal   Ready    <none>   11m     v1.29.0-eks-5e0fdde
ip-10-12-7-44.ap-northeast-1.compute.internal     Ready    <none>   11m     v1.29.0-eks-5e0fdde

Check the Karpenter setup with kubectl get karpenter

NAME                                                 AGE
ec2nodeclass.karpenter.k8s.aws/bottlerocket-nvidia   10m
ec2nodeclass.karpenter.k8s.aws/default               10m
NAME                            NODECLASS
nodepool.karpenter.sh/default   default
nodepool.karpenter.sh/gpu       bottlerocket-nvidia

Check that the kuberay-operator is running with kubectl get po -n kuberay-operator

NAME                                READY   STATUS    RESTARTS   AGE
kuberay-operator-7f6f6db64b-fn658   1/1     Running   0          10m

2. Deploy the RayService

Now that you have the infra setup done, it is ready to deploy the Ray application. Change to the root directory of your repository and run the command to deploy the RayService. This YAML configuration file is setting up a RayServe deployment on the EKS cluster to serve the voice transcription model.

kubectl apply -f Whisper-RayService.yaml

Check that the Ray workers and Ray Serve deployments are ready. When the service is ready, you can see the whisper-streaming-serve-svc Service shown in the Kubernetes cluster. Getting the Ray Cluster running takes about 15 minutes, and the cold start time can be reduced with the Reduce container startup time on Amazon EKS with Bottlerocket data volume.

# Check Ray workers are ready
❯ kubectl get pod
NAME                                                      READY   STATUS    RESTARTS       AGE
isper-streaming-raycluster-c2gdq-worker-gpu-group-6vxz5   1/1     Running   0              84m
whisper-streaming-raycluster-c2gdq-head-nxt2g             2/2     Running   0              98m

# Check RayService is ready 
❯ kubectl describe RayService whisper-streaming
Name:         whisper-streaming
Namespace:    default
API Version:  ray.io/v1
Kind:         RayService
Spec:
  ...
Status:

...
  Service Status:  Running

❯ kubectl get node -L node.kubernetes.io/instance-type
NAME STATUS ROLES AGE VERSION INSTANCE-TYPE
ip-10-0-0-113.ap-northeast-1.compute.internal  Ready  <none>  20d v1.29.1-eks-61c0bbb m6a.xlarge
ip-10-0-1-236.ap-northeast-1.compute.internal  Ready  <none>  20d v1.29.1-eks-61c0bbb m6a.xlarge
ip-10-0-15-201.ap-northeast-1.compute.internal  Ready  <none>  42d v1.29.0-eks-5e0fdde m5.large
ip-10-0-18-39.ap-northeast-1.compute.internal  Ready  <none>  20d v1.29.1-eks-61c0bbb m6a.xlarge
ip-10-0-23-60.ap-northeast-1.compute.internal  Ready  <none>  6d3h v1.29.1-eks-61c0bbb g5.2xlarge
ip-10-0-27-125.ap-northeast-1.compute.internal  Ready  <none>  42d v1.29.0-eks-5e0fdde m5.large
ip-10-0-47-246.ap-northeast-1.compute.internal  Ready  <none>  42d v1.29.0-eks-5e0fdde m5.large
ip-10-0-6-145.ap-northeast-1.compute.internal  Ready  <none>  20d v1.29.1-eks-61c0bbb m6a.xlarge


# Get the service names
❯ kubectl get svc
NAME                                          TYPE           CLUSTER-IP       EXTERNAL-IP                                                                         PORT(S)                                                   AGE
whisper-streaming-head-svc                    ClusterIP      172.20.146.174   <none>                                                                              10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   5d5h
whisper-streaming-raycluster-c2gdq-head-svc   ClusterIP      172.20.89.123    <none>                                                                              10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   98m
whisper-streaming-serve-svc                   ClusterIP      172.20.191.110   <none>                                                                              8000/TCP                                                  5d5h

You can access the Ray Dashboard by port-forwarding 8265.

kubectl port-forward svc/whisper-streaming-head-svc 8265:8265

Launch a browser and enter http://localhost:8265. The ASR service is exposed as a WebSocket service. You can port-forward the service 8000.

kubectl port-forward svc/whisper-streaming-serve-svc 8000:8000

To test the app, open VoiceStreamAI_Client in your computer and send live audio stream through a Web UI in a browser. Select Connect > Start Streaming, and the audio recorded in your mic is sent to the service. The transcription is displayed in the output box.

cd client
open VoiceStreamAI_Client.html

Select the Connect button in the top right corner and select Start Streaming.

Figure 3. Demo App Web Interface

3. Test autoscaling with Locust

Locust is an open source performance/load testing tool for HTTP protocol. Its developer friendly approach lets you define your tests in Python code. We use Locust in this solution for two purposes: 1) simulate end-user audio streaming by sending audio chunks to the server, 2) run stress tests and verify the autoscaling capability.

Locust script is written in Python, so you need to have a Python environment available. The following lets you setup the environment:

# assume you are in the base directory of the code repo
cd locust/
pip install -r requirements.txt

For this demonstration, we redirect the Whisper Streaming service endpoint to a local port for temporary purposes. In a production environment, you should establish a more reliable setup, such as a Load Balancer or Ingress Controller to manage production-level traffic securely.

kubectl port-forward svc/whisper-streaming-serve-svc 8000

The target endpoint URL for the Whisper streaming server would be ws://localhost:8000. We use it to run the Locust script.

locust -f locustfile.py --host ws://localhost:8000

The command lets you run the Locust with web interfaces. You can access the interface from http://0.0.0.0:8089/ in your browser. Select START SWARM to run.

Figure 4. Locust start page

Select Edit from the Locust UI to simulate more than single user access to the target server. In this case, we try 20 users.

Figure 5. Edit users settings on Locust UI

Then you gradually see the number of WebScoket connections increase by one per second.

Figure 6. Statistics page on Locust UI

We see the autoscaling is triggered, and several new pods are starting up.

Figure 7. Autoscaling status on Kubernetes cluster

After a while, you see Ray worker nodes are scaled out and reach ALIVE states.

Figure 8. Autoscaling status on Ray dashboard

Locust also provides a prebuilt real-time diagram, in which you can see requests rate, the response time, and how the simulated users are ramping up.

Figure 9. Locust real-time dashboard for load testing status

Cleaning up

To destroy and clean up the infrastructure created in this post, run the provided ./cleanup.sh script. It deletes the RayService and runs Terraform destroy to destroy the infrastructure resources created.

./cleanup.sh

In this post we described how to build a scalable and distributed ML inference solution using the Whisper model for streaming audio transcription, deployed on Amazon EKS and using Ray Serve.

During the development phase, you can convert a local Python ML application into a scalable, distributed app using Ray Serve. Your data science team can implement the code with Ray Serve and test it locally or in a single machine. To put the solution into production, deploy the solution on Amazon EKS with auto-scaling capabilities using Ray Serve and Karpenter. Ray Serve scales replicas based on the number of ongoing requests, and Karpenter provisions GPU worker nodes automatically. Finally, we demoed the autoscaling behavior by running load tests with Locust.

Feel free to check out the sample code for this project on GitHub and share your comments with us.

Containers