Customer Stories / Software & Internet / United States

Fireworks ai logo Delivers 4x Throughput for Generative AI and Cuts Latency by up to 50% Using AWS and NVIDIA

Learn how built a cost-optimized generative AI solution using Amazon EC2 P5 Instances powered by NVIDIA H100 Tensor Core GPUs.

4x higher throughput

per instance than open-source solutions

Up to 50%

cut in latency

4x reduction

in overall costs for some customers


developer access to large language models

Overview set out to build a lightning-fast, affordable, and customizable generative artificial intelligence (AI) inference solution for its customers. With billions of parameters, foundation models require powerful, often costly compute resources as they’re put into production. The company’s founders sought to make these foundation models widely available for developers to incorporate into their applications while keeping costs reasonable for customers, and turned to Amazon Web Services (AWS). powers its solution using specialized instances from Amazon Elastic Compute Cloud (Amazon EC2), which provides secure and resizable compute capacity for virtually any workload. It upgraded to Amazon EC2 P5 Instances powered by NVIDIA H100 Tensor Core GPUs, which are the highest-performance GPU-based instances for deep learning and high-performance computing applications.’s generative AI solution delivers four times higher throughput per instance than open-source solutions, cuts latency in half for some customers, and meets strict enterprise-level security standards.

Woman on computer

Opportunity | Using Amazon EC2 P5 Instances to Build a Flexible, Cost-Optimized Generative AI Solution for

The CEO and cofounder of, Lin Qiao, had led the development and productionization of the open-source, popular deep learning framework PyTorch while she served as senior director of engineering at Meta. In fact, many in the founding team worked on PyTorch to empower developers in the building of fast, secure, affordable generative AI products. “Achieving optimal cost-performance for scale and productionization is a primary challenge for customers developing on PyTorch,” says Dmytro Dzhulgakov, cofounder and chief technology officer. “It’s particularly true with generative AI products and models because of their sheer size as well as how new and fast this field is. We wanted to use AWS to help to bridge this gap.”’s inference solution both hosts generative AI software-as-a-service and supports containerized deployments in a customer’s virtual private cloud. Developers can choose to use state-of-the-art, open-source language, image, and multimodal foundation models off the shelf, or they can customize and fine-tune models. For example, supports Stable Diffusion XL, a deep learning, text-to-image model; Llama 2 large language models with up to 70 billion parameters; and StarCoder, designed specifically for code-related tasks.


Using AWS, helps developers integrate powerful open models into their prototype applications without breaking the bank as they experiment, explore, and play with different models."

Dmytro Dzhulgakov
Cofounder and Chief Technology Officer,

Solution | Helping Customers Triple Traffic per Instance as They Lower Costs 4x

From its inception, has collaborated closely with AWS teams to determine how best to optimize compute power for its highly demanding inference engine. At first, the company used Amazon EC2 P4d Instances, powered by NVIDIA A100 Tensor Core GPUs, which offer high performance for machine learning training and high-performance computing applications in the cloud. In July 2023, AWS announced Amazon EC2 P5 Instances, the next generation of Amazon EC2 P4 instances and the highest-performance GPU-based instances for deep learning and high performance computing applications. “AWS has the latest and greatest hardware,” Dzhulgakov says.

Using Amazon EC2 P5 Instances, meets demanding customer requirements for performance. For example, one customer lowered the latency of its summarization model—which uses natural language processing techniques to condense written content—by 30–50 percent. “We support about three times higher traffic with a single instance,” says Dzhulgakov. “Even though the instance itself is more expensive if you calculate the cost per single request, there actually ends up being overall cost improvement.” In fact, the customer cut total costs by four times.

Another customer, Sourcegraph, uses StarCoder, which runs on, to power inference and output for the open-source edition of its AI coding assistant, Cody. After adopted Amazon EC2 P5 Instances, Cody doubled its completion acceptance rate, a measure of a model’s output efficiency. Additionally, Cody’s backend latency accelerated by more than two times. “Using AWS, helps developers integrate powerful open models into their prototype applications without breaking the bank as they experiment, explore, and play with different models,” Dzhulgakov says. “As their product grows and usage increases, the focus shifts toward speed and cost. Using Amazon EC2 P5 Instances, we provide outstanding cost per performance for our customers’ use cases.”

And because the inference solution built on AWS is HIPAA and SOC2 Type II compliant, it meets important considerations for companies with strict security requirements for their data. “Using AWS as an infrastructure provider does help a lot in achieving certification for our solution,” says Dzhulgakov.

Outcome | Providing Excellent Cost per Performance for Customers’ Use Cases

In addition to receiving technical guidance, has collaborated with the AWS team on marketing opportunities. customers can simplify deployment and billing by purchasing’s generative AI platform-as-a-service through AWS Marketplace, a curated digital catalog that customers can use to find, buy, deploy, and manage third-party software, data, and services to build solutions and run their businesses. aims to add more multimodal foundation models to its solution so that customers can implement function-calling, invoke other external tools, and interact with their applications in richer ways.

“It primarily comes down to the accessibility of the latest hardware, making sure that it works reliably and scales to capacity needs as our customers grow and as we grow,” Dzhulgakov says. “Those are the major highlights of working alongside AWS.”


Founded in 2022, provides a fast, affordable, and customizable solution for generative artificial intelligence that helps product developers run, fine-tune, and share large language models.

AWS Services Used

Amazon Elastic Compute Cloud (Amazon EC2)

Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.

Learn more »

Amazon EC2 P5 Instances

Amazon Elastic Compute Cloud (Amazon EC2) P5 instances, powered by the latest NVIDIA H100 Tensor Core GPUs, deliver the highest performance in Amazon EC2 for deep learning (DL) and high performance computing (HPC) applications.

Learn more »

More Software & Internet Customer Stories

no items found 


Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.