Partner Success with AWS / Software & Internet / United States

May 2024

baseten

NVIDIA

Baseten Delivers Fast, Scalable Generative AI Inference with AWS and NVIDIA

Connect with NVIDIA

2X

faster delivery throughput for customers in production

50%

decreasing in time to first token with TensorRT-LLM

Early access

to TensorRT-LLM through NVIDIA's Inception program

Overview

Baseten is a San Francisco-based machine learning infrastructure company with a focus on model inference. Offering an advanced machine learning operations (MLOps) platform for model deployment, model serving, and model fine-tuning, customers come to Baseten to run large language models (LLMs) at scale reliably, performantly, and cost-efficiently. With LLM performance as a top priority, Baseten teamed up with AWS Partner NVIDIA and Amazon Web Services (AWS) to deliver measurable throughput and latency improvements—dramatically improving time to first token (TTFT).

Aiming to Never Keep a Customer Waiting

As a machine learning (ML) infrastructure company with a focus on model inference, Baseten helps customers run their models at scale. In many cases, customers are running LLMs to power generative artificial intelligence (AI) applications, which require high-performance hardware. Without state-of-the-art GPUs, these models may cause lag time for end users and keep them waiting while generative AI applications present a text response. These lags in content generation create frustration, delays, and customer service issues. Reducing this latency—particularly the time it takes to generate an initial token—was a critical issue for Baseten and its customers.

kr_quotemark

Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market more quickly and cost-efficiently.”

Amir Haghighat
Co-Founder and CTO, Baseten

Choosing NVIDIA to Support Large Language Models

Baseten knew AWS Partner NVIDIA was a leader in AI and accelerated computing and partnered with the company through NVIDIA Inception, a free program for technology startups. “Our customers are running language models, diffusion models, and different large models that require hardware that only a few vendors provide,” said Baseten co-founder and CTO Amir Haghighat. “NVIDIA is one of them—but their value goes beyond GPUs. Aside from their hardware stack, their very extensive software stack allows you to package up your models and get them ready for inference."

Building a Foundation with AWS Services

As a company built on AWS from day one, Baseten hosted its NVIDIA GPUs on Amazon Elastic Compute Cloud (Amazon EC2). This allowed the team to reduce latency and speed its customers’ TTFT. Amazon EC2 delivers reliable, scalable infrastructure on demand, along with the capacity to scale within minutes and 99.99 percent availability. With security from the AWS Nitro System built into its foundation, Amazon EC2 provides secure compute for Baseten’s applications. Amazon EC2 instances, powered by NVIDIA GPUs, drive some of today's most sophisticated computational workloads.

To support containers running on its NVIDIA GPU-enabled Amazon EC2 instances, Baseten used Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS allows Baseten to run and manage the Kubernetes cluster that serves as the foundation of its infrastructure. In addition, Baseten uses the Karpenter open-source software for scaling clusters as demand for requests, throughputs, and hardware increases.

Gaining Access to TensorRT-LLM through NVIDIA’s Inception Program

Baseten joined NVIDIA Inception, a free program designed to nurture startups, providing co-marketing support and opportunities to connect directly with NVIDIA experts. Through the Inception program, NVIDIA gave Baseten early access to TensorRT-LLM, an open-source library for defining, optimizing, and executing LLMs for inference in production. “Our partnership with NVIDIA has been crucial for us. The TensorRT-LLM library has massively improved the experience we can give our customers—now they can run large language models and get the throughput and latency improvements they need to maintain the level of service that sets them apart in the marketplace,” said Haghighat.

NVIDIA’s extensive software stack enabled Baseten to take advantage of the NVIDIA Triton Inference Server, an open-source AI model serving platform that streamlines and accelerates the deployment of AI inference workloads in production. It helps enterprises reduce the complexity of model serving infrastructure, shorten the time needed to deploy new AI models in production, and increase AI inferencing and prediction capacity. Both NVIDIA TensorRT-LLM and Triton Inference Server are included as a part of NVIDIA AI Enterprise, which provides a production-grade, secure, end-to-end software platform for enterprises building and deploying accelerated AI software.

Increasing Throughput by 2X and Accelerating TTFT by 50%

By utilizing TensorRT-LLM via AWS, Baseten customers have seen huge improvements in model performance, including faster throughputs, lower latency, and an accelerated TTFT. “We've seen customers in production get roughly a 2X improvement in throughput with TensorRT-LLM, essentially allowing them to service twice as many requests with the same amount of hardware—at the same cost basis,” said Haghighat. On the latency side, TensorRT-LLM has helped Baseten speed up TTFT by 50 percent. “TensorRT-LLM helps reduce latency, which is especially important where there’s a human waiting on the other side for the text to be generated,” Haghighat said.

Working with NVIDIA, Baseten has also gained support for streaming, dynamic batching, continuous batching, and quantization as part of the NVIDIA stack. “Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market quickly and cost-efficiently,” Haghighat said. “It’s really been a game-changer all around.”

About Baseten

Baseten makes going from machine learning models to production-grade applications fast and easy. With Baseten, data science and machine learning teams can build applications without backend, frontend, or MLOps knowledge.

About AWS Partner NVIDIA

Since its founding in 1993, NVIDIA has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI, and is fueling industrial digitalization across markets. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

AWS Services Used

Amazon EKS

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers. In the cloud, Amazon EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks.

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.

More Software & Internet Success Stories

no items found

Software & Internet

Palo Alto Networks Boosts 2,000 Developers’ Productivity Using AI Solutions from AWS, Anthropic, and Sourcegraph

Palo Alto Networks, a leading cybersecurity company, sought to boost developer productivity using generative artificial intelligence (AI) technology. The goal was to create a custom solution that would enhance the speed and quality of coding while maintaining strict security standards. By leveraging Amazon Web Services (AWS), Claude 3.5 Sonnet and Claude 3 Haiku from AWS Partner Anthropic, and Cody from AWS Partner Sourcegraph, Palo Alto Networks developed a secure AI tool for generating, optimizing, and troubleshooting code. Within three months, Palo Alto Networks onboarded 2,000 developers and increased productivity up to 40 percent, with an average of 25 percent. This custom AI solution has empowered both senior and junior developers, and the company expects further improvements in code quality and efficiency.

2024
Software & Internet

IBM Reduces the Co-Selling Lifecycle by 90% and Boosts Sales Opportunities with AWS by 117% with ACE CRM Integration Using Labra Platform

IBM, a global technology enterprise, wanted to simplify the process of creating and sharing AWS co-selling opportunities from Salesforce. IBM deployed an ACE CRM integration from AWS Partner Labra, a provider of software as a service (SaaS) solutions. The integration helps IBM sales and marketing teams move campaign responses and sales opportunities from within Salesforce directly into ACE. With Labra’s co-sell automation, IBM has cut co-sell time by 90 percent, increased co-sell opportunities by 117 percent, increased revenue, and created a custom integration that streamlines marketing nurture tools.

2024
Software & Internet

Starburst Accelerates AWS Co-Selling with Tackle ACE CRM Integration

Starburst, which provides an open data lakehouse platform for global customers, sought to reduce the manual effort required to collaborate with its strategic cloud partner, AWS. Starburst uses Salesforce as its CRM system and needed a solution to replicate relevant opportunity data to the APN Customer Engagements (ACE) pipeline manager. Starburst worked with AWS Partner Tackle and implemented the Tackle ACE CRM integration, which allows Starburst to enter, manage, and monitor sales activity to enable sophisticated co-sell activity within ACE, the AWS Partner Network (APN) sales collaboration tool. As a result, Starburst cut opportunity sharing time by up to 50 percent, reduced opportunity rejections, enhanced opportunity data quality, and enabled teams to focus on other joint go-to-market (GTM) opportunities.

2024
Software & Internet

ERIN’s Cloud Transformation Cuts Costs, Promotes Growth, and Enables New Features

An innovative employee referral platform operating with a team of 25 people, ERIN faced challenges keeping up with rapid growth. A collaboration with AWS Partner SourceFuse improved ERIN’s cloud setup to use its resources more effectively and eliminate unnecessary services. The outcome was transformative: ERIN cut hosting costs by 40 percent, found the flexibility it needed to meet demand, and advanced its feature roadmap. The solution helped ERIN achieve near triple-digit growth in revenue year over year, broaden its reach, and deliver new offerings to customers.

2024

1 …

… 39

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.

Contact Sales