Baseten Delivers Fast, Scalable Generative AI Inference with AWS and NVIDIA

Benefits

2X

faster delivery throughput for customers in production

50%

decreasing in time to first token with TensorRT-LLM

Early access

to TensorRT-LLM through NVIDIA's Inception program

Overview

Baseten is a San Francisco-based machine learning infrastructure company with a focus on model inference. Offering an advanced machine learning operations (MLOps) platform for model deployment, model serving, and model fine-tuning, customers come to Baseten to run large language models (LLMs) at scale reliably, performantly, and cost-efficiently. With LLM performance as a top priority, Baseten teamed up with AWS Partner NVIDIA and Amazon Web Services (AWS) to deliver measurable throughput and latency improvements—dramatically improving time to first token (TTFT).

About Baseten

Baseten makes going from machine learning models to production-grade applications fast and easy. With Baseten, data science and machine learning teams can build applications without backend, frontend, or MLOps knowledge.

Opportunity | Aiming to Never Keep a Customer Waiting

As a machine learning (ML) infrastructure company with a focus on model inference, Baseten helps customers run their models at scale. In many cases, customers are running LLMs to power generative artificial intelligence (AI) applications, which require high-performance hardware. Without state-of-the-art GPUs, these models may cause lag time for end users and keep them waiting while generative AI applications present a text response. These lags in content generation create frustration, delays, and customer service issues. Reducing this latency—particularly the time it takes to generate an initial token—was a critical issue for Baseten and its customers.

About AWS Partner NVIDIA

Since its founding in 1993, NVIDIA has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI, and is fueling industrial digitalization across markets. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Solution | Building a Foundation with AWS Services

Baseten knew AWS Partner NVIDIA was a leader in AI and accelerated computing and partnered with the company through NVIDIA Inception, a free program for technology startups. “Our customers are running language models, diffusion models, and different large models that require hardware that only a few vendors provide,” said Baseten co-founder and CTO Amir Haghighat. “NVIDIA is one of them—but their value goes beyond GPUs. Aside from their hardware stack, their very extensive software stack allows you to package up your models and get them ready for inference."

As a company built on AWS from day one, Baseten hosted its NVIDIA GPUs on Amazon Elastic Compute Cloud (Amazon EC2). This allowed the team to reduce latency and speed its customers’ TTFT. Amazon EC2 delivers reliable, scalable infrastructure on demand, along with the capacity to scale within minutes and 99.99 percent availability. With security from the AWS Nitro System built into its foundation, Amazon EC2 provides secure compute for Baseten’s applications. Amazon EC2 instances, powered by NVIDIA GPUs, drive some of today's most sophisticated computational workloads.

To support containers running on its NVIDIA GPU-enabled Amazon EC2 instances, Baseten used Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS allows Baseten to run and manage the Kubernetes cluster that serves as the foundation of its infrastructure. In addition, Baseten uses the Karpenter open-source software for scaling clusters as demand for requests, throughputs, and hardware increases.

Baseten joined NVIDIA Inception, a free program designed to nurture startups, providing co-marketing support and opportunities to connect directly with NVIDIA experts. Through the Inception program, NVIDIA gave Baseten early access to TensorRT-LLM, an open-source library for defining, optimizing, and executing LLMs for inference in production. “Our partnership with NVIDIA has been crucial for us. The TensorRT-LLM library has massively improved the experience we can give our customers—now they can run large language models and get the throughput and latency improvements they need to maintain the level of service that sets them apart in the marketplace,” said Haghighat.

NVIDIA’s extensive software stack enabled Baseten to take advantage of the NVIDIA Triton Inference Server, an open-source AI model serving platform that streamlines and accelerates the deployment of AI inference workloads in production. It helps enterprises reduce the complexity of model serving infrastructure, shorten the time needed to deploy new AI models in production, and increase AI inferencing and prediction capacity. Both NVIDIA TensorRT-LLM and Triton Inference Server are included as a part of NVIDIA AI Enterprise, which provides a production-grade, secure, end-to-end software platform for enterprises building and deploying accelerated AI software.

Outcome | Increasing Throughput by 2X and Accelerating TTFT by 50%

By utilizing TensorRT-LLM via AWS, Baseten customers have seen huge improvements in model performance, including faster throughputs, lower latency, and an accelerated TTFT. “We've seen customers in production get roughly a 2X improvement in throughput with TensorRT-LLM, essentially allowing them to service twice as many requests with the same amount of hardware—at the same cost basis,” said Haghighat. On the latency side, TensorRT-LLM has helped Baseten speed up TTFT by 50 percent. “TensorRT-LLM helps reduce latency, which is especially important where there’s a human waiting on the other side for the text to be generated,” Haghighat said.

Working with NVIDIA, Baseten has also gained support for streaming, dynamic batching, continuous batching, and quantization as part of the NVIDIA stack. “Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market quickly and cost-efficiently,” Haghighat said. “It’s really been a game-changer all around.”

Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market more quickly and cost-efficiently.

Amir Haghighat

Co-Founder and CTO, Baseten

AWS Services Used

Amazon Elastic Kubernetes Service

Start, run, and scale Kubernetes without thinking about cluster management

Learn more

Amazon Elastic Compute Cloud

Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.

Baseten Delivers Fast, Scalable Generative AI Inference with AWS and NVIDIA

Benefits

2X

50%

Early access

Overview

About Baseten

Opportunity | Aiming to Never Keep a Customer Waiting

About AWS Partner NVIDIA

Solution | Building a Foundation with AWS Services

Outcome | Increasing Throughput by 2X and Accelerating TTFT by 50%

AWS Services Used

Amazon Elastic Kubernetes Service

Amazon Elastic Compute Cloud

Next Steps

Work with an AWS Partner

Find an AWS Marketplace Product

Did you find what you were looking for today?

Learn

Resources

Developers

Help