Containers
Host the Whisper Model with Streaming Mode on Amazon EKS and Ray Serve
OpenAI Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It has demonstrated strong ASR performance across various languages, including the ability to transcribe speech in multiple languages and translate them into English. The Whisper model is open-sourced under the Apache 2.0 license, making it accessible for developers to build useful applications, such as transcription services, voice assistants, and enhancing accessibility for individuals with hearing impairments.
For the apps such as transcriptions of live broadcast and online meeting, ASR streaming mode is required. Streaming has different APIs, fully duplex bi-directional streams, and a lower latency requirement, usually in seconds and that needs maintaining states. Overall it needs a new system architecture that is different from the one used for batch.
For these streaming ASR applications, they have a lower latency requirement and scaling becomes a challenge. To solve the scaling problem, Ray Serve is a scalable and easy-to-use framework for building and deploying machine learning (ML) models as web services. It is a part of the Ray ecosystem, which is an open-source project for distributed computing. Ray Serve is specifically designed to simplify the process of turning ML models into production-ready, scalable services that can handle high throughput and dynamic workloads.
In the post, we explore how to build an ML inference solution based on pure Python and Ray Serve that can run locally on a single Amazon Elastic Compute Cloud (Amazon EC2) instance, and expose the streaming ASR service through WebSocket protocol. We use the power of KubeRay and Amazon Elastic Kubernetes Service (Amazon EKS) so that we can transform a simple Python program to scale automatically, transcribe multiple audio streams, and provision GPU instances with minimal code changes.
Solution overview
The following diagram shows the main components of this solution.
- Ray Serve Application: It runs a WebSocket-based web service that accepts audio stream, and distributes Voice Activity Detection (VAD) and ASR requests to multiple replicas. Ray Serve Replicas will be autoscaled according to the ongoing request count.
- Amazon EKS:The underlying infrastructure of the ML model serving solution. With Data on EKS Blueprint, the EKS cluster and the essential add-ons and tooling such as Karpenter, NodePool, and KubeRay Operator can be deployed with a Terraform module with a single command.
Figure 1. Architecture of Ray on Amazon EKS stack
Here we show a few concepts for converting a local Python ML app on a single EC2 instance to a scalable, distributed app in Ray Serve on Kubernetes.
1. Converting a Python app to a Ray Serve app
We forked the open source project VoiceStreamAI, which is a Python app that accepts the audio stream through a WebSocket connection, and calls the transcribe function in the FasterWhisper library to transcribe the audio. In the context of audio streaming, the connection and audio buffer should be kept at the server side, but in the transcribe
call it is better that the compute-intensive part be stateless and distributed to multiple replicas.
The framework we chose is Ray Serve. Ray Serve is a scalable model serving library for building online inference APIs. With only a few code annotations serve.deployment
as shown in the following, we transformed a single Python app to be distributed a Ray Serve app.
The original Python app uses two ML models: Voice Activity Detection and Whisper ASR. We use multiple deployments of Ray Serve, which makes it easy to divide the application’s steps into independent deployments that it independently scales. To learn more about converting a Python app to a Ray Serve app, read the documentation Ray Serve – Getting Started.
Figure 2. Streaming ASR system
You can run the same code on your local machine during development and test as you would in production. Furthermore, you can run the code on a local machine EC2 instance with GPU, with AWS Deep Learning based AMI with Ubuntu. You don’t actually need a Kubernetes cluster to build and test a Ray Serve app.
2. Going production with autoscaling and Karpenter on Amazon EKS
After your code is tested locally, you can put your app into production on Kubernetes using the KubeRay RayService custom resource. KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. The RayService custom resource automatically handles important production requirements such as health checking, status reporting, failure recovery, and upgrades. Also, you can also define the autoscaling policy for your Ray Serve application without writing code for the scaling logic.
In a Ray Serve app, the autoscaling has two levels: application level and cluster level. The application level’s autoscaling policy can be specified with target_ongoing_requests
.We have three Ray Serve deployments (TranscriptionServer, FasterWhisperASR, and PyannoteVAD) that can scale independently. You can adjust it based on your latency objective (the shorter you want your latency to be, the smaller this number should be).
The cluster level’s autoscaling can be enabled by specifying spec.rayClusterConfig.enableInTreeAutoscaling
, and a new Ray worker pod is automatically created if provisioning new workers is necessary. If new EC2 instances are required for a new Ray worker, then Karpenter provisions the GPU instances automatically. Karpenter is an open-source node provisioning project built for Kubernetes, and it observes the resource requests of unscheduled pods and makes decisions to launch nodes.
Follow the Production Guide and Ray Serve Autoscaling if you are interested in converting your ML application into production on a Ray Cluster running on Kubernetes.
Walkthrough
The code samples and detailed deployment guide can be found on GitHub.
Remember that the estimated cost for this solution would be around 3600 USD/month in the Tokyo Region (including one EKS cluster, three managed node group EC2 nodes, and at least three g5.xlarge GPU instances in this example. Your actual cost may vary depending on the region, the load you ingested, and the instances launched by the Karpenter.) You may consider destroying the setup once you finish testing.
Prerequisites
The following prerequisites are required:
- The common tools such as Terraform,
kubectl
, and AWS Command Line Interface (CLI). - A Huggingface token with read permissions to run the demo app. Request access to the Voice-Activity-Detection model through this link.
1. Provision the EKS cluster and add-ons
The deployment of the EKS cluster can be done through the Terraform module. First, you need to clone the repo, also make sure that you have fulfilled the prerequisites mentioned earlier.
The infra
directory contains what you need to setup the entire infrastructure. Before setting up, modify the variables in dev.auto.tfvars
. Only pyannote_auth_token
is mandatory, which is essentially a Hugging Face auth token.
Next, modify variables.tf to choose the AWS Region for your deployment. Then you can start to deploy the required infrastructure with the following command.
The whole process could take around 20 minutes. Once it’s done, you should see the following message:
You may also find the Terraform output message such as configure_kubectl = "aws eks –region <Your Region> update-kubeconfig --name ray-cluster
“. Copy the aws-cli command and run accordingly to setup the Kubernetes access for kubectl
Now you have the cluster setup. Check the status with the following command:
kubectl get node
You can see there are three nodes up and running:
Check the Karpenter setup with kubectl get karpenter
Check that the kuberay-operator
is running with kubectl get po -n kuberay-operator
2. Deploy the RayService
Now that you have the infra setup done, it is ready to deploy the Ray application. Change to the root directory of your repository and run the command to deploy the RayService. This YAML configuration file is setting up a RayServe deployment on the EKS cluster to serve the voice transcription model.
Check that the Ray workers and Ray Serve deployments are ready. When the service is ready, you can see the whisper-streaming-serve-svc Service
shown in the Kubernetes cluster. Getting the Ray Cluster running takes about 15 minutes, and the cold start time can be reduced with the Reduce container startup time on Amazon EKS with Bottlerocket data volume.
You can access the Ray Dashboard by port-forwarding 8265.
Launch a browser and enter http://localhost:8265. The ASR service is exposed as a WebSocket service. You can port-forward the service 8000.
To test the app, open VoiceStreamAI_Client
in your computer and send live audio stream through a Web UI in a browser. Select Connect > Start Streaming, and the audio recorded in your mic is sent to the service. The transcription is displayed in the output box.
Select the Connect button in the top right corner and select Start Streaming.
Figure 3. Demo App Web Interface
3. Test autoscaling with Locust
Locust is an open source performance/load testing tool for HTTP protocol. Its developer friendly approach lets you define your tests in Python code. We use Locust in this solution for two purposes: 1) simulate end-user audio streaming by sending audio chunks to the server, 2) run stress tests and verify the autoscaling capability.
Locust script is written in Python, so you need to have a Python environment available. The following lets you setup the environment:
For this demonstration, we redirect the Whisper Streaming service endpoint to a local port for temporary purposes. In a production environment, you should establish a more reliable setup, such as a Load Balancer or Ingress Controller to manage production-level traffic securely.
The target endpoint URL for the Whisper streaming server would be ws://localhost:8000
. We use it to run the Locust script.
The command lets you run the Locust with web interfaces. You can access the interface from http://0.0.0.0:8089/
in your browser. Select START SWARM
to run.
Figure 4. Locust start page
Select Edit from the Locust UI to simulate more than single user access to the target server. In this case, we try 20 users.
Figure 5. Edit users settings on Locust UI
Then you gradually see the number of WebScoket connections increase by one per second.
Figure 6. Statistics page on Locust UI
We see the autoscaling is triggered, and several new pods are starting up.
Figure 7. Autoscaling status on Kubernetes cluster
After a while, you see Ray worker nodes are scaled out and reach ALIVE states.
Figure 8. Autoscaling status on Ray dashboard
Locust also provides a prebuilt real-time diagram, in which you can see requests rate, the response time, and how the simulated users are ramping up.
Figure 9. Locust real-time dashboard for load testing status
Cleaning up
To destroy and clean up the infrastructure created in this post, run the provided ./cleanup.sh script. It deletes the RayService and runs Terraform destroy to destroy the infrastructure resources created.
In this post we described how to build a scalable and distributed ML inference solution using the Whisper model for streaming audio transcription, deployed on Amazon EKS and using Ray Serve.
During the development phase, you can convert a local Python ML application into a scalable, distributed app using Ray Serve. Your data science team can implement the code with Ray Serve and test it locally or in a single machine. To put the solution into production, deploy the solution on Amazon EKS with auto-scaling capabilities using Ray Serve and Karpenter. Ray Serve scales replicas based on the number of ongoing requests, and Karpenter provisions GPU worker nodes automatically. Finally, we demoed the autoscaling behavior by running load tests with Locust.
Feel free to check out the sample code for this project on GitHub and share your comments with us.