Quora achieved 3x lower latency and 25% lower Costs by modernizing model serving with Nvidia Triton on Amazon EKS


Quora is a leading Q&A platform with a mission to share and grow the world’s knowledge, serving hundreds of millions of users worldwide every month. Quora uses machine learning (ML) to generate a custom feed of questions, answers, and content recommendations based on each user’s activity, interests, and preferences. ML drives targeted advertising on the platform, where advertisers use Quora’s vast user data and sophisticated targeting capabilities to deliver highly personalized ads to the audience. Moreover, ML plays a pivotal role in maintaining high-quality content for users by effectively filtering spam and moderating content.

Quora launched Poe, a generative artificial intelligence (AI) based chatbot app by leveraging different Large Language Models (LLMs) and offering fast and accurate responses. Poe aims to simplify the user experience and provide continuous back-and-forth dialogue while integrating with the major LLMs and other generative AI models.

Quora successfully modernized its model serving with NVIDIA Triton Inference Server (Triton) on Amazon Elastic Kubernetes Service (Amazon EKS). This move enabled a small team of ML engineers to manage, operate, and enhance model serving efficiently. This post delves into the design decisions, benefits of running NVIDIA Triton Server on Amazon EKS, and how Quora reduced model serving latency by three times and model serving cost by 25%.

Previous model serving architecture

Quora was running its model serving in hybrid mode where around half of the models were hosted on TensorFlow Serving (TFS), and the other half were hosted on a Custom Python Engine. The Custom Python Engine supported different model frameworks, such as PyTorch, XGBoost, Microsoft LightGBM, sklearn, whereas TFS was used only for the TensorFlow model framework.

Figure 1: Previous model serving architecture

Figure 1: Previous model serving architecture

Challenges with previous model serving architecture

Custom Python Engine uses Apache Thrift, whereas TFS uses gRPC framework. Maintaining different frameworks for implementing and managing remote procedure calls (RPC) in model serving architecture added significant complexity.

The existing system faced challenges with using GPUs effectively for serving, which led to unnecessary resource waste and increased costs. Furthermore, both had limited support for GPU optimization techniques that restrict model performance and efficiency.

There was a pressing need at Quora to serve Recommendation models with large embeddings on GPUs instead of CPU to improve cost.

Limitations of Custom Python Engine

  • Performance: Models deployed on Custom Python Engine, which used Apache Thrift for RPC communication, encountered high latency issues that impact model performance. On certain occasions, response time could soar up to 1500 milliseconds (ms), in stark contrast to the anticipated latency of 50 ms.
  • Service Mesh Integration: Quora uses Istio service mesh. gRPC natively supports HTTP2 and integrates seamlessly with service mesh technologies, which provide ease of support for features such as traffic mirroring and rate limiting. Apache Thrift does not support HTTP2 and is not natively integrated with Istio Service mesh.
  • High Traffic management: Custom Python Engine models faced challenges in handling high-traffic scenarios due to limitations in its client-side rate limiting mechanism. gRPC integrates seamlessly with server-side mesh-based rate limiting solutions, providing a much more robust and scalable solution to manage surges in traffic and maintain system stability. This method has been particularly effective for making sure of smooth operation during spikes in queries per second (QPS).

The significant disparity in response times across different models underscores the need for an optimized solution to enhance overall model serving performance and to meet specific latency and throughput requirements, particularly in critical use cases such as ads ranking and user feed. Quora was looking for a new model serving solution that resolves the preceding challenges, and also supports multi-ML frameworks such as ONNX, and TensorRT.

Solution overview

Overview of NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is an open-source software solution purpose-built for serving ML models. It optimizes the deployment of models in production by maximizing hardware use, supporting multiple frameworks, and providing a range of flexible serving options.

Why did Quora select NVIDIA Triton Inference Server on Amazon EKS?

To improve performance and optimize the cost of its model serving, Quora investigated various software and hardware, aiming to reduce latency and increase model throughput. Quora eventually selected NVIDIA Triton Inference Server due to its potential to meet the challenges in its model serving infrastructure. Triton is designed to effectively utilize GPUs for serving a wide variety of models, and flexible deployment options made it an optimal choice for modernizing Quora’s model serving. The reasons for choosing Triton include:

  • Multi-ML frameworks: Triton supports multiple ML frameworks, such as, TensorFlow, PyTorch, ONNX, TensorRT, OpenVINO, HugeCTR, FIL (Forest Inference Library). The broad framework support facilitates the migration of all models from current custom Python engines to Triton.
  • HTTP/GRPC endpoints: Triton provides HTTP/GRPC endpoints for model serving, which simplifies integration with Quora’s existing Istio service mesh.
  • High performance: Triton quickly and efficiently processes requests, making it perfect for applications requiring low latency. It includes essential features such as rate limiting status, health checks, dynamic batching, and concurrent model execution capabilities.
  • Scalability: It can easily scale up to handle large workloads and is designed to handle multiple models and data sources. Additionally, it supports a wide range of hardware (such as GPUs and CPUs), multi-node deployment, model versioning, and ensemble models handling. This makes it easy to deploy models on different hardware configurations.
  • Managed observability: Integration with Prometheus and Grafana for metrics, tools that are already in use at Quora for monitoring ML systems.
  • Recommendation models serving on GPUs: The NVIDIA Merlin HugeCTR (Huge Click-Through-Rate) is a GPU-accelerated deep neural network (DNN) training and inference framework designed for efficiently serving Recommendation models with large embeddings on NVIDIA GPUs.
  • Auto-tuning tools for model optimization:
    • Model Analyzer: Assesses runtime performance and suggests optimized configurations (batch size, instance group, CPU, memory, etc.)
    • Model Navigator: Automates the transition of models from source to optimal format and configuration for Triton deployment


The following walkthrough guides you through this solution.

Architecture of running NVIDIA Triton server on Amazon EKS

Quora chose gRPC as the standard client communication framework and Triton as the model serving for all ML models. There is a separate namespace for training and model serving in the Amazon EKS cluster. Within the model serving, separate node groups are used for the CPU-based models and the GPU-based models. Quora decided to move all new ML models on the following architecture:

Figure 2: Modernized model serving

Migration to NVIDIA Triton Server on Amazon EKS

The existing ML model serving architecture was designed to accommodate multiple ML Serving engines, such as Custom Python Engine and TFS. The following steps are performed to add Triton Server into model serving architecture and migrate GPU models to Triton:

  1. Generate stubs for gRPC service: Quora chose to use the gRPC framework with Triton. To generate the stubs necessary for RPC communication between the server and client sides, we followed HTTP/REST and GRPC Protocol and used Triton’s protobuf specification to generate these stubs.
  2. Setup NVIDIA Triton on Amazon EKS as the serving server
    • Customize the base image of NVIDIA with ONNX framework: NVIDIA provides pre-built Docker containers for the NVIDIA Triton Inference Server, which are available in their NGC Catalog. However, to tailor the Triton container to our specific environment, we followed the instructions detailed in Triton’s customization guide. This process included selecting the particular framework that our environment needs (for example, ONNX) and installing any additional libraries required by our models. To accommodate a variety of our models based on different frameworks, we built multiple Triton packages.
    • Add Triton-specific model configurations: Triton requires specific configuration details, such as the model’s name, version, and procedures for preprocessing inputs and post-processing outputs. Triton is added as the third engine in the model serving architecture to incorporate Triton specific settings within the existing model configuration. These configurations are serialized into the pbtxt file, which serves as the required model configuration in the model repository for Triton deployment.
  3. Prepare the model to deploy on Triton: We took an existing PyTorch model and converted that to the ONNX format and uploaded it to an Amazon Simple Storage Service (Amazon S3) model repository. We used MLFlow model registry for model versioning and incorporated Triton packages into our Continuous Integration/Continuous Deployment (CI/CD) pipeline. With these steps, we successfully integrated the NVIDIA Triton Inference Server into the model serving architecture.
  4. Migrate models to NVIDIA Triton Server: In the initial phase, we successfully migrated four PyTorch models, running on Python engine, and two TensorFlow models, running on TFS engine, to the Triton server running with the ONNX framework. This led to substantial improvements in model availability, reducing latency and cost by at least 50%. After the initial success, three new PyTorch GPU models were added directly to the Triton server.

Benefits of modernized architecture

The modernized model serving platform enables Quora to achieve performance enhancement, cost savings, and substantial feature enrichment. Some significant wins observed after the migration include:

  • Performance enhancement: Latency of the PyTorch GPU model was slashed by seven times (from 230ms to 30ms) and latency for the TensorFlow GPU model was reduced by two times (from 20ms to 8ms). Notably, significant enhancements have been observed in Transformer and BERT-based models, such as DeBERTa, RoBERTa, XLM-RoBERTa, and E5 Text Embedding, with latency reductions exceeding seven times.

Improved performance occurs due to conversion to the ONNX format, and model quantization from fp-32 to fp-16. This reduces the model size and memory usage, using ONNX runtime as inference backend engine and using gRPC for the communication framework

  • Cost savings: The GPU model serving cost is reduced by 52%, which leads to 25% overall savings in model serving. The primary contributors to cost savings are conversion to ONNX, and Model Quantization. The model size gets smaller, and Quora could enhance throughput by two times and GPU utilization by three times. Ultimately, this improves the efficiency and cuts down cost.
  • GPU use: The adoption of ONNX frameworks improved GPU use from 40% to 80%, leading to two times serving efficiency.
  • Unified RPC framework: The new setup promotes a unified framework by migrating all models to use gRPC and service mesh functionalities. This unification simplifies client-side RPC support and streamlines the operations.
  • More time to focus on innovation: With Amazon EKS, engineers don’t need to spend time on undifferentiated infrastructure management. It helps reduce operational burden, such as on-call pages. This allows ML engineers to dedicate more time to experimentation, training, and serving new models for an improved customer experience.

Lessons learned

Adopting new technologies can be a challenging journey, often fraught with unexpected obstacles and setbacks. Here are some of the lessons we learned:

  • ONNX as a Preferred Exchange Format: Quora found ONNX to be an ideal open standard format for model serving. It’s designed for interoperability, making it a perfect choice when working with models trained with various frameworks. After training an ML model in PyTorch or TensorFlow, we could easily convert it to ONNX and apply post-training quantization. This process led to significant improvements in latency and efficiency.
  • gRPC as the communication framework: Quora’s experience has shown gRPC to be a reliable RPC framework offering improved performance and reliability.
  • Remote model repository feature in Triton: Although Triton supports remote model repository in Amazon S3, our testing indicated that this feature did not function as anticipated. We recommend incorporating a step to fetch the model files from Amazon S3 and place them into a predefined local path, such as: /mnt/models/. This method guarantees the availability of model files at a recognized location, a critical need for Triton backends such as the python_backend, which require Python runtime and libraries, or the hugectr_backend, which requires access to embedding files.
  • Support of multi-ML frameworks: NVIDIA Triton Inference Server supports multiple frameworks, such as PyTorch, TensorFlow, TensorRT, or ONNX Runtime with different hardware.
  • Amazon EKS as ML service: Quora needed an extensible, self-serving ML service based on microservice architecture that helps ML Engineers iterate quicker before deploying the model. Ideally, this service should support various training and serving environments, essentially being a truly framework-agnostic training and model serving service. We found Amazon EKS as the most suitable ML service.


In this post, we showed how Quora modernized its model serving with NVIDIA Triton Inference Server on Amazon EKS, which provided a strong foundation for flexible, reliable, and efficient model serving. This service reduced model serving complexity, which enabled Quora to quickly adapt to changing business requirements. The key factors that drove the modernization decisions were the ability to support multiple ML frameworks, scale the model serving with effective compute resource management, increase system reliability, and reduce the cost of operations. The modernized model serving on Amazon EKS also decreased the ongoing operational support burden for engineers, and the scalability of the design improved customer experience and opened up opportunities for innovation and growth.

We’re excited to share our learnings with the wider community through this post, and to support other organizations that are starting their model serving journey or looking to improve the existing model serving pipelines. As part of our experience, we highly recommend modernizing your model serving with NVIDIA Triton on Amazon EKS.

Purna Sanyal

Purna Sanyal

Purna Sanyal is an architect at AWS, helping digital native customers to solve their business problems with successful adoption of cloud native architecture and digital transformation. He has specialization in data strategy, machine learning and Generative AI. He is passionate about building innovative solutions with Kubernetes, database, analytics, and machine learning framework.

Michael Chen

Michael Chen

Michael Chen is a Director of Engineering at Quora. He leads the Platform Engineering Organization focusing on enhancing developer productivity across the company. Michael and his team have helped to transform Quora's machine learning platform to be ready for Generative AI applications.

Tuan Vu

Tuan Vu

Tuan Vu is a Software Engineer on the Machine Learning Platform team at Quora. He leads initiatives to modernize the company's ML Serving systems and optimize cost efficiency across the ML organization. As an expert in Machine Learning and Distributed Systems, Tuan is passionate about building large-scale ML systems that can serve global users with optimal performance. Under his guidance, the team has transformed Quora's ML infrastructure, creating a highly scalable and efficient foundation powering new AI applications with wide-ranging social impact