Customer Stories / Software & Internet / United States

Fireworks.ai Delivers 4x Throughput for Generative AI and Cuts Latency by up to 50% Using AWS and NVIDIA
Learn how Fireworks.ai built a cost-optimized generative AI solution using Amazon EC2 P5 Instances powered by NVIDIA H100 Tensor Core GPUs.
4x higher throughput
per instance than open-source solutions
Up to 50%
cut in latency
4x reduction
in overall costs for some customers
Simplified
developer access to large language models
Overview
Fireworks.ai set out to build a lightning-fast, affordable, and customizable generative artificial intelligence (AI) inference solution for its customers. With billions of parameters, foundation models require powerful, often costly compute resources as they’re put into production. The company’s founders sought to make these foundation models widely available for developers to incorporate into their applications while keeping costs reasonable for customers, and Fireworks.ai turned to Amazon Web Services (AWS).
Fireworks.ai powers its solution using specialized instances from Amazon Elastic Compute Cloud (Amazon EC2), which provides secure and resizable compute capacity for virtually any workload. It upgraded to Amazon EC2 P5 Instances powered by NVIDIA H100 Tensor Core GPUs, which are the highest-performance GPU-based instances for deep learning and high-performance computing applications. Fireworks.ai’s generative AI solution delivers four times higher throughput per instance than open-source solutions, cuts latency in half for some customers, and meets strict enterprise-level security standards.

Opportunity | Using Amazon EC2 P5 Instances to Build a Flexible, Cost-Optimized Generative AI Solution for Fireworks.ai
The CEO and cofounder of Fireworks.ai, Lin Qiao, had led the development and productionization of the open-source, popular deep learning framework PyTorch while she served as senior director of engineering at Meta. In fact, many in the Fireworks.ai founding team worked on PyTorch to empower developers in the building of fast, secure, affordable generative AI products. “Achieving optimal cost-performance for scale and productionization is a primary challenge for customers developing on PyTorch,” says Dmytro Dzhulgakov, Fireworks.ai cofounder and chief technology officer. “It’s particularly true with generative AI products and models because of their sheer size as well as how new and fast this field is. We wanted to use AWS to help to bridge this gap.”
Fireworks.ai’s inference solution both hosts generative AI software-as-a-service and supports containerized deployments in a customer’s virtual private cloud. Developers can choose to use state-of-the-art, open-source language, image, and multimodal foundation models off the shelf, or they can customize and fine-tune models. For example, Fireworks.ai supports Stable Diffusion XL, a deep learning, text-to-image model; Llama 2 large language models with up to 70 billion parameters; and StarCoder, designed specifically for code-related tasks.

Using AWS, Fireworks.ai helps developers integrate powerful open models into their prototype applications without breaking the bank as they experiment, explore, and play with different models."
Dmytro Dzhulgakov
Cofounder and Chief Technology Officer, Fireworks.ai
Solution | Helping Customers Triple Traffic per Instance as They Lower Costs 4x
From its inception, Fireworks.ai has collaborated closely with AWS teams to determine how best to optimize compute power for its highly demanding inference engine. At first, the company used Amazon EC2 P4d Instances, powered by NVIDIA A100 Tensor Core GPUs, which offer high performance for machine learning training and high-performance computing applications in the cloud. In July 2023, AWS announced Amazon EC2 P5 Instances, the next generation of Amazon EC2 P4 instances and the highest-performance GPU-based instances for deep learning and high performance computing applications. “AWS has the latest and greatest hardware,” Dzhulgakov says.
Using Amazon EC2 P5 Instances, Fireworks.ai meets demanding customer requirements for performance. For example, one customer lowered the latency of its summarization model—which uses natural language processing techniques to condense written content—by 30–50 percent. “We support about three times higher traffic with a single instance,” says Dzhulgakov. “Even though the instance itself is more expensive if you calculate the cost per single request, there actually ends up being overall cost improvement.” In fact, the customer cut total costs by four times.
Another Fireworks.ai customer, Sourcegraph, uses StarCoder, which runs on Fireworks.ai, to power inference and output for the open-source edition of its AI coding assistant, Cody. After Fireworks.ai adopted Amazon EC2 P5 Instances, Cody doubled its completion acceptance rate, a measure of a model’s output efficiency. Additionally, Cody’s backend latency accelerated by more than two times. “Using AWS, Fireworks.ai helps developers integrate powerful open models into their prototype applications without breaking the bank as they experiment, explore, and play with different models,” Dzhulgakov says. “As their product grows and usage increases, the focus shifts toward speed and cost. Using Amazon EC2 P5 Instances, we provide outstanding cost per performance for our customers’ use cases.”
And because the Fireworks.ai inference solution built on AWS is HIPAA and SOC2 Type II compliant, it meets important considerations for companies with strict security requirements for their data. “Using AWS as an infrastructure provider does help a lot in achieving certification for our solution,” says Dzhulgakov.
Outcome | Providing Excellent Cost per Performance for Customers’ Use Cases
In addition to receiving technical guidance, Fireworks.ai has collaborated with the AWS team on marketing opportunities. Fireworks.ai customers can simplify deployment and billing by purchasing Fireworks.ai’s generative AI platform-as-a-service through AWS Marketplace, a curated digital catalog that customers can use to find, buy, deploy, and manage third-party software, data, and services to build solutions and run their businesses.
Fireworks.ai aims to add more multimodal foundation models to its solution so that customers can implement function-calling, invoke other external tools, and interact with their applications in richer ways.
“It primarily comes down to the accessibility of the latest hardware, making sure that it works reliably and scales to capacity needs as our customers grow and as we grow,” Dzhulgakov says. “Those are the major highlights of working alongside AWS.”
About Fireworks.ai
Founded in 2022, Fireworks.ai provides a fast, affordable, and customizable solution for generative artificial intelligence that helps product developers run, fine-tune, and share large language models.
AWS Services Used
Amazon Elastic Compute Cloud (Amazon EC2)
Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 750 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.
Amazon EC2 P5 Instances
Amazon Elastic Compute Cloud (Amazon EC2) P5 instances, powered by the latest NVIDIA H100 Tensor Core GPUs, deliver the highest performance in Amazon EC2 for deep learning (DL) and high performance computing (HPC) applications.
More Software & Internet Customer Stories
Total results: 827
no items found
-
The Netherlands
Improvement-IT Uses TechNative to Migrate to AWS, Speeds Customer Onboarding, and Reduces Support Calls by 15%
Improvement-IT, based in the Netherlands, provides IoT solutions to a variety of organizations with an emphasis on tracking, tracing, and monitoring the status of assets. Together with its other companies Port Pay and Alltrack Medical, it offers these innovative solutions to help customers track assets in the field, manage warehouses, and optimize supply chains. However, it was being hampered by its own managed services provider, which was running both Amazon Web Services (AWS) and on-premises assets for it. It wanted a proactive partner with deep expertise to help optimize its systems, improve client onboarding times, and better detect problems before they affected customers. AWS Partner TechNative has helped it to achieve those goals, reducing customer support calls by 15 percent and cutting onboarding time by 50 percent.
-
Argentina
Kovix Improves Route Efficiency by 20% With HERE on AWS
To optimize high-volume, complex routes for municipal recycling collection in Argentina, Kovix turned to AWS Partner HERE Technologies, a leader in location data and routing solutions. HERE offered enterprise-grade routing capabilities that Kovix deployed to dynamically manage hundreds of waypoints with scalability and precision. Using HERE Tour Planning, Kovix reduced route times by 20 percent and fuel expenses by 17 percent, improving operational performance for municipalities across Argentina.
-
United States
AWS Partner Pinecone Helps Hyperleap Build Job Seeker-focused AI-powered Job Board
Hyperleap, a company specialized in building SaaS solutions for the recruiting industry, worked with AWS Partner Pinecone to create a job board where job seekers could employ generative AI to stand out in the initial resume filter and put their best foot forward. Together, they developed Jennie Johnson, a job-seeker focused AI-powered job board which increased click-through rates by 50% and provided job seekers customized matches.
-
Palo Alto Networks Boosts 2,000 Developers’ Productivity Using AI Solutions from AWS, Anthropic, and Sourcegraph
Palo Alto Networks, a leading cybersecurity company, sought to boost developer productivity using generative artificial intelligence (AI) technology. The goal was to create a custom solution that would enhance the speed and quality of coding while maintaining strict security standards. By leveraging Amazon Web Services (AWS), Claude 3.5 Sonnet and Claude 3 Haiku from AWS Partner Anthropic, and Cody from AWS Partner Sourcegraph, Palo Alto Networks developed a secure AI tool for generating, optimizing, and troubleshooting code. Within three months, Palo Alto Networks onboarded 2,000 developers and increased productivity up to 40 percent, with an average of 25 percent. This custom AI solution has empowered both senior and junior developers, and the company expects further improvements in code quality and efficiency.
Get Started
Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.