How Quora modernized MLOps on Amazon EKS to improve customer experience with scalable ML applications

This blog post was co-written by Lida Li of Quora

Introduction

Quora is a leading Q&A platform with a mission to share and grow the world’s knowledge, serving hundreds of millions of users worldwide every month. Quora uses machine learning (ML) to generate a custom feed of questions, answers, and content recommendations based on each user’s activity, interests, and preferences. ML drives targeted advertising on the platform, where advertisers use Quora’s vast user data and sophisticated targeting capabilities to deliver highly personalized ads to the audience. Moreover, ML plays a pivotal role in maintaining high-quality content for users by effectively filtering spam and moderating content.

Quora successfully modernized its machine learning operations (MLOps) platform from a self-managed, custom Amazon Elastic Compute Cloud (Amazon EC2) platform to Amazon Elastic Kubernetes Service (Amazon EKS). This move enabled a small team of engineers to manage, operate, and enhance Quora’s ML platform with highly efficient MLOps. This platform is currently responsible for training and serving over one hundred models used across the product. This post delves into the design decisions, benefits of running MLOps on Amazon EKS, and how Quora reduced new model deployment time from days to hours.

Solution overview

Previous MLOps platform

Before embracing containers and Amazon EKS, Quora’s MLOps platform was running on Amazon EC2 instances. The platform team had developed frameworks that provided high-level abstractions for launching training jobs or serving models. However, the underlying infrastructure wasn’t scalable and resulted in operational challenges, which slowed down the iteration speed. Some common challenges were:

Compute resource management and cost overrun: Allocating dedicated Amazon EC2 instances for each training job led to increased operational costs and wasted compute resources. Sharing instances between multiple jobs caused problems with resource and environment isolation, requiring manual adjustments that were time-consuming and prone to errors when jobs exceeded the resources of their assigned Amazon EC2 instances.
Training environment management: Quora uses the mono-repo approach to store all code in one Git repository. Launching a new Amazon EC2 instance was necessary for each new training job, but creating a new Amazon Machine Images (AMI) for each commit in the mono-repo was impractical. Instead, all packages were installed during runtime, which was both time-consuming, taking 20–30 minutes, and unreliable.
System reliability and operational overhead: Platform and machine learning engineers spend a substantial amount of time dealing with reliability issues due to lack of scalability of the as-is solution, which can stem from environment failures, misconfigurations, and faulty code. This imposes heavy operational overhead to small ML engineering teams, restricting them from spending more time on innovation.

Walkthrough

Migration to Amazon EKS

In the past, addressing the challenges faced by Quora’s MLOps platform required significant engineering investment and manual effort. The emergence of containers and Kubernetes allowed small engineering teams to create scalable and high-performing MLOps platforms. Kubernetes offers a reliable and scalable solution for deploying containerized applications, in production. Quora evaluated different cloud providers to build cloud-native MLOps Platform on Kubernetes. Quora decided to use Amazon EKS, a managed Kubernetes service for their new MLOps platform. Amazon EKS provided Quora easy deployment, upgrade, and management of Kubernetes clusters — and seamless integration with other AWS services.

The process of migrating model training and serving to Kubernetes took about six months and consisted of the following steps:

Prepare the Amazon EKS cluster: To lower operational overhead, we used Terraform enable infrastructure-as-code to manage the Amazon EKS environment. Terraform files, stored in a Git repository, contain the cluster configurations such as AWS Identity and Access Management (AWS IAM) roles, security groups, subnets, and managed node group settings.
Provision foundational add-ons on the newly created Amazon EKS cluster: This includes the cluster autoscaler, Amazon EBS CSI driver, and Nvidia plugins. We used Istio as our mesh solution for networking support, Elasticsearch, and Beats for log collection, VictoriaMetrics, and Grafana for model metrics monitoring.
Build container images for model training and serving environments: We created a continuous integration and continuous deployment (CI/CD) pipeline to update the images as needed and leveraged Argo Rollouts for canary deployments.
Begin model migration: With the prerequisites in place, we started migrating model training by progressively updating our code to launch new training jobs on Amazon EKS. We began with a smaller percentage of experimental jobs, monitoring and ensuring stability, resolving issues encountered, and gradually ramping up to more jobs. For model serving, we initially shadowed traffic to the model running on Amazon EKS and validated the results. Then, we gradually redirected a small percentage of traffic to the Amazon EKS-based model, eventually ramping up to 100%. This approach provided the flexibility to select the models we wanted to migrate and allowed us to pause or revert the migration if necessary. After that, we migrated ML applications such as the feature extraction service, Ads candidate generation services, and the orchestration service for recommender systems to the Amazon EKS cluster.

Modernized MLOps platform on Amazon EKS

Quora relies on Kafka to transfer collected data, including user actions, updates to questions, answers, and posts, for building machine learning features. Depending on the time sensitivity of the features, the data is stored in either online stores or offline stores . For batch data processing, Spark and Trino are used to read the collected data from a unified data lake, where Hive tables are stored as Parquet files on Amazon Simple Storage Service (Amazon S3). These are later used to generate features for online serving. Apache Airflow is the primary orchestrator for data and ML pipelines, initiating processing and training tasks on an hourly or daily basis.

Figure 1: Quora machine learning data platform

Figure 1: Quora machine learning data platform

To train ML models, Quora uses ML frameworks such as TensorFlow, PyTorch, and scikit-learn, which run in containers on an Amazon EKS cluster. MLFlow tracking is used to monitor training progress and metrics. Once trained, the models are saved in the model repository on Amazon S3. Personalized services, such as the home feed service and ads serving service, utilize the trained models and extracted features to rank results. These services also log the features into Kafka, combined with user actions, form the training dataset.

To simplify the management cost and get better bin-packing results, we run a heterogeneous Amazon EKS cluster that supports diverse workloads including both CPU and GPU workloads. The key stages of Quora’s MLOps practices are:

Data engineering and model experiment: The model lifecycle begins with data engineering and model experimentation, during which ML engineers utilize Jupyter notebooks to generate training data, test model concepts, and assess performance. These notebooks are hosted on pods provisioned in Amazon EKS, using a pre-built container image that mirrors the training environment. Pod resources, such as CPU, GPU, and memory, can be tailored by users to accommodate varying needs. Jupyter notebooks are stored on a shared Amazon Elastic File System (Amazon EFS). Engineers can either train models directly within the notebooks or initiate development training jobs, which is helpful for experimenting with different hyperparameters. This setup empowers ML engineers to rapidly iterate on their ideas without being constrained by computing resources.
Model training – ML engineers initiate training sessions through Python scripts on an ad hoc basis or within Apache Airflow pipelines. During these sessions, they can specify compute resource requirements and runtime configurations. A Kubernetes Job is generated for the session. The model training parameters and metrics are reported to MLFlow Tracking, while training logs, events, and resource utilization data are also collected and easily accessible for ML engineers to troubleshoot any issues that may arise.
Model serving – ML engineers are able to launch their models in a self-service manner. They define the new model using two configuration files in the codebase. One configuration file contains static settings such as assigning a model ID, specifying the model state, and the framework. The other configuration file includes dynamic settings such as CPU/memory, the number of replicas, Horizontal Pod Autoscaling (HPA) rules, and rate limits. Once the code changes have been pushed, the deployment pipeline generates the corresponding Kubernetes manifests and applies them via Skaffold, which helps to keep the manifests and resources in Kubernetes in sync. Once the model is ready, it can serve production traffic. When a new version of the model is pushed and tested, the training pipeline triggers a restart of the specific model deployment to allow it to serve the new version.
Model monitoring – The model training pipeline consistently assesses the newly trained model and evaluates the loss, area under curve (AUC) and model specific metrics (e.g. click rate loss, impression loss, P50 and P90 inference time) on an evaluation dataset. All metrics are logged to MLFlow for tracking purposes. These metrics aid in identifying potential issues with overfitting or underfitting. After deploying the model in production, we log its prediction results and monitor an extensive set of metrics using VictoriaMetrics and Grafana. We monitor model metrics such as prediction accuracy, recall, uptime, and resource consumption. We also monitor key business metrics such as ad click rate, and home feed click rate. The ML engineering team responsible for the model’s ownership handles on-call duties in case any issues are detected through monitoring.

Benefits of modernized architecture

The modernized MLOps platform enables Quora to achieve cost savings, reliability improvements and substantial feature enrichment. Some significant wins observed after the migration include:

Faster deployments and quicker rollbacks: The time to deploy a new model reduced from days to hours, and existing model changes from hours to a few minutes. Rollback was not automated earlier and now it can be done instantly. Provisioning a new environment for model training has been reduced from 30 minutes for each training job to 3 minutes in the new architecture.
Simplified configuration: Configuring model serving is much simpler. By updating a few lines in the model configuration files, ML engineers can adjust the resource requirement, autoscaling, rate limiting and the framework version, which was previously not possible or required writing a lot of customized code.
More time to focus on innovation: With Amazon EKS, engineers don’t need to spend time on undifferentiated infrastructure management. It helps reduce operational burden such as on-call pages. ML engineers can dedicate more time towards experimenting, training, and serving new models for improved customer experience.
Cost reduction: Moving from a self-managed Amazon EC2-based architecture to the Amazon EKS cluster reduced the cost of operations through shared Kubernetes nodes autoscaling, and improved pod density. We were able to bring down the cost by 25% through automation and tuning.

Lessons learned

Adopting new technologies can be a challenging journey, often fraught with unexpected obstacles and setbacks. SHere are some of the lessons we learned:

Achieve consistent performance for diverse set of Amazon EC2 instance types within the same cluster

The use of multiple node groups in Kubernetes, each running different types of instances, brings additional challenges such as Autoscaling group (ASG) limits, and requires usage of complex taints and tolerations. To simplify the system, Quora removed extra taints and relied solely on the Kubernetes scheduler for job scheduling, resulting in 10% improvement in bin-packing efficiency. When using mixed instance types for compute sensitive workloads, performance inconsistencies can occur among pods due to varying CPU usage and request latency, especially if some pods are scheduled to older CPUs or to multi non-uniform memory access (NUMA) nodes. This can result in system unavailability as high CPU usage may overload some pods, but Kubernetes HPA won’t activate since it considers the average across all pods. To avoid this challenge, Quora uses NUMA-aware scheduling on large nodes and uses taints and tolerations to avoid scheduling latency sensitive workloads on older instances before upgraded.

Handle ML experimentation smoothly

Experimentation is a crucial aspect of developing models; ML engineers are able to release a model simply by editing configuration files and pushing the changes. However, for a seamless transition to production in terms of reliability and cost efficiency, ML engineers should follow a standard practice for model releases. Quora provides ML engineers with easy-to-use debugging environments, appropriate resource allocation and HPA rules, pre-warming of models before increasing production traffic, and circuit breaker mechanisms to handle sudden spikes in traffic.

Transition from single, bulky container images to environment-specific images

At the start of the migration, we constructed monolithic container images for both ML training and serving purposes. Virtual environments were utilized to accommodate various frameworks and versions. However, as time progressed, this approach became problematic. With the growing number of ML frameworks and versions, the image size increased significantly, resulting in slow container building time and starting time. This negatively impacted engineering velocity and model uptime as deployments were unable to scale up quickly. Additionally, it was not able to support different CUDA versions. As a solution, we redesigned our container image pipeline to be more flexible and separated each runtime into its own container.

Build cost management culture

Amazon EKS offers an elastic compute infrastructure that enables more efficient workload execution; however, creating a culture of cost management across the organization is also important. Every Quora team has a Directly Responsible Individual (DRI) assigned who is acting as the single point of contact and driver for cost management. The DRI helps setting up the cost goals, regularly reviewing resource usage, and ensuring the awareness and accountability of the cost management in the entire organization.

Effective cost management involves bringing in monitoring and automation whenever possible. Metrics and usage data are collected and summarized with trends daily. The platform team provides detailed Grafana dashboards that can dive deep into the resource consumption. Auto-tuners have been built for adjusting container CPU and memory resource requests based on the usage history. Finally, staying up to date on new Amazon EKS features and best practices ensures that your organization is using the most efficient and cost-effective methods for managing resources.

Conclusion

In this post, we showed you how Quora modernized its MLOps platform on Amazon EKS, which provided a strong foundation for flexible, reliable, and efficient ML applications. This service accelerated the iteration speed of model development, which enabled Quora to quickly adapt to changing business requirements. The key factors that drove the modernization decisions were the ability to scale the ML platform with effective compute resource management for model training and serving, increase system reliability, and reduce cost of operations. The modernized MLOps platform on Amazon EKS also decreased the ongoing operational support burden for engineers, and the scalability of the design improved customer experience and opened up opportunities for innovation and growth.

Quora gained valuable insights and experience in running MLOps on Amazon EKS. We’re excited to share our learnings with the wider community through this blog post, and to support other organizations that are starting their ML journey or looking to improve the existing MLOps pipelines. As part of our experience, we highly recommend modernizing your MLOps pipeline on Amazon EKS with the Amazon Elastic File System (EFS), which can help improve storage scalability, reduce operational complexity, and enhance overall performance.

Author bios

Lida Li, Quora

Lida Li is a Software Engineer of Machine Learning Platform team at Quora. He is an expert on building large scale distributed systems and platform. Passionate on driving and leading the tech innovation in the world

Containers