Partner Success with AWS / Software & Internet / United States

May 2024

baseten

NVIDIA

Baseten Delivers Fast, Scalable Generative AI Inference with AWS and NVIDIA

Connect with NVIDIA

2X

faster delivery throughput for customers in production

50%

decreasing in time to first token with TensorRT-LLM

Early access

to TensorRT-LLM through NVIDIA's Inception program

Overview

Baseten is a San Francisco-based machine learning infrastructure company with a focus on model inference. Offering an advanced machine learning operations (MLOps) platform for model deployment, model serving, and model fine-tuning, customers come to Baseten to run large language models (LLMs) at scale reliably, performantly, and cost-efficiently. With LLM performance as a top priority, Baseten teamed up with AWS Partner NVIDIA and Amazon Web Services (AWS) to deliver measurable throughput and latency improvements—dramatically improving time to first token (TTFT).

Aiming to Never Keep a Customer Waiting

As a machine learning (ML) infrastructure company with a focus on model inference, Baseten helps customers run their models at scale. In many cases, customers are running LLMs to power generative artificial intelligence (AI) applications, which require high-performance hardware. Without state-of-the-art GPUs, these models may cause lag time for end users and keep them waiting while generative AI applications present a text response. These lags in content generation create frustration, delays, and customer service issues. Reducing this latency—particularly the time it takes to generate an initial token—was a critical issue for Baseten and its customers.

kr_quotemark

Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market more quickly and cost-efficiently.”

Amir Haghighat
Co-Founder and CTO, Baseten

Choosing NVIDIA to Support Large Language Models

Baseten knew AWS Partner NVIDIA was a leader in AI and accelerated computing and partnered with the company through NVIDIA Inception, a free program for technology startups. “Our customers are running language models, diffusion models, and different large models that require hardware that only a few vendors provide,” said Baseten co-founder and CTO Amir Haghighat. “NVIDIA is one of them—but their value goes beyond GPUs. Aside from their hardware stack, their very extensive software stack allows you to package up your models and get them ready for inference."

Building a Foundation with AWS Services

As a company built on AWS from day one, Baseten hosted its NVIDIA GPUs on Amazon Elastic Compute Cloud (Amazon EC2). This allowed the team to reduce latency and speed its customers’ TTFT. Amazon EC2 delivers reliable, scalable infrastructure on demand, along with the capacity to scale within minutes and 99.99 percent availability. With security from the AWS Nitro System built into its foundation, Amazon EC2 provides secure compute for Baseten’s applications. Amazon EC2 instances, powered by NVIDIA GPUs, drive some of today's most sophisticated computational workloads.

To support containers running on its NVIDIA GPU-enabled Amazon EC2 instances, Baseten used Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS allows Baseten to run and manage the Kubernetes cluster that serves as the foundation of its infrastructure. In addition, Baseten uses the Karpenter open-source software for scaling clusters as demand for requests, throughputs, and hardware increases.

Gaining Access to TensorRT-LLM through NVIDIA’s Inception Program

Baseten joined NVIDIA Inception, a free program designed to nurture startups, providing co-marketing support and opportunities to connect directly with NVIDIA experts. Through the Inception program, NVIDIA gave Baseten early access to TensorRT-LLM, an open-source library for defining, optimizing, and executing LLMs for inference in production. “Our partnership with NVIDIA has been crucial for us. The TensorRT-LLM library has massively improved the experience we can give our customers—now they can run large language models and get the throughput and latency improvements they need to maintain the level of service that sets them apart in the marketplace,” said Haghighat.

NVIDIA’s extensive software stack enabled Baseten to take advantage of the NVIDIA Triton Inference Server, an open-source AI model serving platform that streamlines and accelerates the deployment of AI inference workloads in production. It helps enterprises reduce the complexity of model serving infrastructure, shorten the time needed to deploy new AI models in production, and increase AI inferencing and prediction capacity. Both NVIDIA TensorRT-LLM and Triton Inference Server are included as a part of NVIDIA AI Enterprise, which provides a production-grade, secure, end-to-end software platform for enterprises building and deploying accelerated AI software.

Increasing Throughput by 2X and Accelerating TTFT by 50%

By utilizing TensorRT-LLM via AWS, Baseten customers have seen huge improvements in model performance, including faster throughputs, lower latency, and an accelerated TTFT. “We've seen customers in production get roughly a 2X improvement in throughput with TensorRT-LLM, essentially allowing them to service twice as many requests with the same amount of hardware—at the same cost basis,” said Haghighat. On the latency side, TensorRT-LLM has helped Baseten speed up TTFT by 50 percent. “TensorRT-LLM helps reduce latency, which is especially important where there’s a human waiting on the other side for the text to be generated,” Haghighat said.

Working with NVIDIA, Baseten has also gained support for streaming, dynamic batching, continuous batching, and quantization as part of the NVIDIA stack. “Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market quickly and cost-efficiently,” Haghighat said. “It’s really been a game-changer all around.”

About Baseten

Baseten makes going from machine learning models to production-grade applications fast and easy. With Baseten, data science and machine learning teams can build applications without backend, frontend, or MLOps knowledge.

About AWS Partner NVIDIA

Since its founding in 1993, NVIDIA has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI, and is fueling industrial digitalization across markets. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

AWS Services Used

Amazon EKS

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers. In the cloud, Amazon EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks.

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.

More Software & Internet Success Stories

no items found

Software & Internet

NeuralSpace Accelerates AI Model Training Speed by 96% in Migration to AWS with Rebura

NeuralSpace, a London-based AI startup, had the same problem that many startups have: not enough time, not enough money, and too much to do. It needed to develop and train the AI models that powered its language AI applications—automatic translation of text and speech, automated subtitling, and automated AI dubbing of content—but these processes were taking too long. With 20–30 TB of data being used to train each model, it could take 3–6 months to train just one. And the company needed to train multiple models to develop its products. NeuralSpace knew that it needed to find a way to speed up model training that would fit within its limited budget. With the help of AWS Partner Rebura, NeuralSpace migrated to Amazon Web Services (AWS) to enable faster modeling and a crucial pivot in focus.

2024
Software & Internet

FloQast Uses Tackle ACE CRM Integration to Boost Win Rate by 26% and Cut Deal Cycle Time by 30%

FloQast provides close management software for corporate accounting departments. Working with AWS Partner Tackle, the company wanted to move to a more strategic partnership with Amazon Web Services (AWS) by automating co-selling processes. To address its needs, FloQast deployed the Tackle ACE CRM integration, helping salespeople enter AWS opportunities into ACE directly from Salesforce. This streamlined process has helped FloQast boost its win rate by 26 percent and reduce the average deal cycle time by 30 percent.

2024
Software & Internet

Peak Defence Leverages Generative AI to Transform Cybersecurity Audits

Peak Defence needed to scale up its processes to meet increasing demand for its cybersecurity consulting and solutions services. The company collaborated with AWS Partner Neurons Lab to automate critical security and compliance processes for customers using generative AI and added a Software as a Service (SaaS) offering to its portfolio. With help from the AI solution development experts at Neurons Lab, Peak Defence leveraged advanced, cloud-based AI tools and a scalable, serverless infrastructure to significantly improve operational efficiency. This empowers the company to handle growing demand while maintaining the strong data security measures its customers require.

2024
Software & Internet

Honeycomb Doubles AWS Opportunity Submissions in 11 Days with Clazar ACE CRM Integration

Honeycomb provides an observability platform that helps software engineering teams determine why problems happen and who is impacted. The company sought to reduce manual processes to more effectively grow its AWS Marketplace business. Honeycomb worked with AWS Partner Clazar to implement the Clazar Salesforce ACE CRM integration, which automates data entry and integrates Salesforce and ACE data. As a result, Honeycomb doubled its opportunity submissions to AWS in 11 days and achieved a 100 percent opportunity approval rate from AWS. In addition, Honeycomb can scale to support 110 percent annual growth.

2024

1 …

… 39

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.

Contact Sales