Partner Success with AWS / Software & Internet / United States

May 2024
baseten
NVIDIA

Baseten Delivers Fast, Scalable Generative AI Inference with AWS and NVIDIA

2X

faster delivery throughput for customers in production

50%

decreasing in time to first token with TensorRT-LLM

Early access

to TensorRT-LLM through NVIDIA's Inception program

Overview

Baseten is a San Francisco-based machine learning infrastructure company with a focus on model inference. Offering an advanced machine learning operations (MLOps) platform for model deployment, model serving, and model fine-tuning, customers come to Baseten to run large language models (LLMs) at scale reliably, performantly, and cost-efficiently. With LLM performance as a top priority, Baseten teamed up with AWS Partner NVIDIA and Amazon Web Services (AWS) to deliver measurable throughput and latency improvements—dramatically improving time to first token (TTFT).

Aiming to Never Keep a Customer Waiting

As a machine learning (ML) infrastructure company with a focus on model inference, Baseten helps customers run their models at scale. In many cases, customers are running LLMs to power generative artificial intelligence (AI) applications, which require high-performance hardware. Without state-of-the-art GPUs, these models may cause lag time for end users and keep them waiting while generative AI applications present a text response. These lags in content generation create frustration, delays, and customer service issues. Reducing this latency—particularly the time it takes to generate an initial token—was a critical issue for Baseten and its customers.

kr_quotemark

Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market more quickly and cost-efficiently.”

Amir Haghighat
Co-Founder and CTO, Baseten

Choosing NVIDIA to Support Large Language Models

Baseten knew AWS Partner NVIDIA was a leader in AI and accelerated computing and partnered with the company through NVIDIA Inception, a free program for technology startups. “Our customers are running language models, diffusion models, and different large models that require hardware that only a few vendors provide,” said Baseten co-founder and CTO Amir Haghighat. “NVIDIA is one of them—but their value goes beyond GPUs. Aside from their hardware stack, their very extensive software stack allows you to package up your models and get them ready for inference."

Building a Foundation with AWS Services

As a company built on AWS from day one, Baseten hosted its NVIDIA GPUs on Amazon Elastic Compute Cloud (Amazon EC2). This allowed the team to reduce latency and speed its customers’ TTFT. Amazon EC2 delivers reliable, scalable infrastructure on demand, along with the capacity to scale within minutes and 99.99 percent availability. With security from the AWS Nitro System built into its foundation, Amazon EC2 provides secure compute for Baseten’s applications. Amazon EC2 instances, powered by NVIDIA GPUs, drive some of today's most sophisticated computational workloads.

To support containers running on its NVIDIA GPU-enabled Amazon EC2 instances, Baseten used Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS allows Baseten to run and manage the Kubernetes cluster that serves as the foundation of its infrastructure. In addition, Baseten uses the Karpenter open-source software for scaling clusters as demand for requests, throughputs, and hardware increases.

Gaining Access to TensorRT-LLM through NVIDIA’s Inception Program

Baseten joined NVIDIA Inception, a free program designed to nurture startups, providing co-marketing support and opportunities to connect directly with NVIDIA experts. Through the Inception program, NVIDIA gave Baseten early access to TensorRT-LLM, an open-source library for defining, optimizing, and executing LLMs for inference in production. “Our partnership with NVIDIA has been crucial for us. The TensorRT-LLM library has massively improved the experience we can give our customers—now they can run large language models and get the throughput and latency improvements they need to maintain the level of service that sets them apart in the marketplace,” said Haghighat.

NVIDIA’s extensive software stack enabled Baseten to take advantage of the NVIDIA Triton Inference Server, an open-source AI model serving platform that streamlines and accelerates the deployment of AI inference workloads in production. It helps enterprises reduce the complexity of model serving infrastructure, shorten the time needed to deploy new AI models in production, and increase AI inferencing and prediction capacity. Both NVIDIA TensorRT-LLM and Triton Inference Server are included as a part of NVIDIA AI Enterprise, which provides a production-grade, secure, end-to-end software platform for enterprises building and deploying accelerated AI software.

Increasing Throughput by 2X and Accelerating TTFT by 50%

By utilizing TensorRT-LLM via AWS, Baseten customers have seen huge improvements in model performance, including faster throughputs, lower latency, and an accelerated TTFT. “We've seen customers in production get roughly a 2X improvement in throughput with TensorRT-LLM, essentially allowing them to service twice as many requests with the same amount of hardware—at the same cost basis,” said Haghighat. On the latency side, TensorRT-LLM has helped Baseten speed up TTFT by 50 percent. “TensorRT-LLM helps reduce latency, which is especially important where there’s a human waiting on the other side for the text to be generated,” Haghighat said.

Working with NVIDIA, Baseten has also gained support for streaming, dynamic batching, continuous batching, and quantization as part of the NVIDIA stack. “Our partnership with NVIDIA, plus their software and hardware stack, has allowed our customers to bring their ideas to market quickly and cost-efficiently,” Haghighat said. “It’s really been a game-changer all around.”

About Baseten

Baseten makes going from machine learning models to production-grade applications fast and easy. With Baseten, data science and machine learning teams can build applications without backend, frontend, or MLOps knowledge.

About AWS Partner NVIDIA

Since its founding in 1993, NVIDIA has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI, and is fueling industrial digitalization across markets. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

AWS Services Used

Amazon EKS

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers. In the cloud, Amazon EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks.

Learn more »

Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.

Learn more »

More Software & Internet Success Stories

Showing results: 1-4
Total results: 149

no items found 

  • Software & Internet

    Improvement-IT Uses TechNative to Migrate to AWS, Speeds Customer Onboarding, and Reduces Support Calls by 15%

    Improvement-IT, based in the Netherlands, provides IoT solutions to a variety of organizations with an emphasis on tracking, tracing, and monitoring the status of assets. Together with its other companies Port Pay and Alltrack Medical, it offers these innovative solutions to help customers track assets in the field, manage warehouses, and optimize supply chains. However, it was being hampered by its own managed services provider, which was running both Amazon Web Services (AWS) and on-premises assets for it. It wanted a proactive partner with deep expertise to help optimize its systems, improve client onboarding times, and better detect problems before they affected customers. AWS Partner TechNative has helped it to achieve those goals, reducing customer support calls by 15 percent and cutting onboarding time by 50 percent.

    2025
  • Software & Internet

    Atlassian Reduces Latency by 17% and Saves $2.1 Million with Amazon FSx for NetApp ONTAP

    Atlassian faced a critical challenge when the storage solution hosting its Bitbucket platform’s 2.3 petabytes of data was being retired. To solve this, Atlassian joined forces with AWS Partner NetApp to migrate that data to an Amazon Web Services (AWS) storage service that used NetApp’s ONTAP file system. The migration was seamless, resulting in no discernable customer impact and no service disruptions. Within a one-month period, Atlassian successfully migrated 60 million repositories, achieving $2.1 million in annual cost savings and reducing application latency by 17 percent.

    2025
  • Software & Internet

    Glympse Reduces Onsite Safety Challenges by Enhancing Navigation with HERE and AWS

    Since 2008, Glympse, the pioneer in location-sharing technology, has been providing innovative solutions that predictively visualize and provide notifications and updates about where people, products, and assets are while in motion. When Glympse learned that companies in heavy industrial environments struggle with safety challenges, it saw an opportunity to help. Glympse, with AWS Partner HERE Technologies and supported by Amazon Web Services (AWS), developed a unique solution—one that provides onsite drivers and visitors with web-based, turn-by-turn directions. The Glympse solution reduced collision risk and supports timely hazard response. It keeps yard operators aware of all onsite activity, supporting quicker responses to potential hazards and alerting operators about who is in the yard.

    2025
  • Software & Internet

    OpenText Accelerates FedRAMP Moderate Authorization with InfusionPoints, Schellman, and AWS

    InfusionPoints, an AWS GSCA Partner, and AWS GSCA Partner Schellman Compliance worked with Canada-based OpenText Corporation to achieve a FedRAMP Moderate authorization for the OpenText IT Management Platform. After connecting with InfusionPoints through the AWS Global Security & Compliance Acceleration program, OpenText achieved a FedRAMP Moderate certification in 18 months, enabling the company to serve its US government customers seeking cloud modernization and to expand its business to other federal, state, and local government agencies and contractors.
     

    2025
1 38

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.