AWS Storage Blog
How Tavily reduced AI search caching costs by 95% with Amazon S3 Express One Zone
Tavily is an AI infrastructure company building the web access layer for agents and large language models (LLMs). The company provides developer-friendly APIs that enable real-time, structured retrieval from the web. Their mission is to make information instantly accessible for intelligent systems, and they’re trusted by thousands of leading research, commercial AI teams, and enterprises worldwide. Tavily’s platform centers on AI-powered Search, Extract, Map, and Crawl APIs—designed to deliver instant, up-to-date, cleaned web content in structured formats.
AI search engines powering autonomous agents face critical performance challenges that directly impact user experience and operational efficiency. These systems require single-digit millisecond response times to maintain conversation flow and enable real-time decision making. Furthermore, they must handle unpredictable traffic patterns and elastic scalability while minimizing operational overhead. Tavily AI search engine, designed specifically for autonomous agents, encountered these challenges with their existing document database caching layer. The cache layer was becoming increasingly expensive and difficult to manage as their user base expanded and workload scaled. Moreover, latency spikes disrupted agent interactions, and manual capacity planning consumed the engineering resources that they needed for core product development.
In this post, we discuss how Tavily successfully migrated their caching layer from a traditional document database to Amazon S3 Express One Zone, achieving performance improvements and reducing costs by up to 95%. We walk through their initial architecture challenges, the decision-making process that led to selecting the S3 Express One Zone storage class, and the step-by-step implementation strategy. The migration reduced Tavily’s caching costs while simultaneously improving response times to consistently meet their single-digit millisecond requirement. Furthermore, we examine the technical approach and performance optimization techniques that made this transformation possible, providing practical insights for organizations that are facing similar challenges with high-performance, cost-effective caching at scale.
The challenge: Why existing architecture couldn’t meet AI workload demands
The document database caching layer was originally designed for human-facing applications, not the high-throughput, low-latency requirements of autonomous AI agents. As Tavily’s platform scaled, two critical issues emerged:
Inconsistent latency
Latency inconsistency became a persistent challenge. The database’s response times varied unpredictably, which made it unsuitable for use as a hot cache. For AI agents built to deliver real-time data, even minor delays were unacceptable—these systems require contextually ranked results within a strict 10 ms window, which is a performance benchmark that the database’s variability couldn’t consistently meet.
Cost inefficiency due to lack of elasticity
Cost inefficiency challenged the platform’s scalability. The caching layer alone accounted for tens of thousands of dollars in Tavily’s monthly database bill, yet the team constantly provisioned for peak capacity that went unused during off-hours.
This wasn’t just about getting a faster database. Tavily was looking to find a solution that scales with their business as they grow in users and activity. Tavily evaluated several alternatives, including self-managed caching solutions. Although they continued using Redis as in-memory caching for their most frequently accessed data (representing about 1% of their cached URLs), they needed a secondary cache that could scale elastically with their traffic, deliver high single-digit millisecond latency, and integrate seamlessly with their existing Amazon S3-based storage workflows. S3 Express One Zone was the ideal fit.
Solution overview
This solution demonstrates using S3 Express One Zone as a high-performance cache.
Why S3 Express One Zone?
S3 Express One Zone is a purpose-built storage class that delivers consistent single-digit millisecond latency, which makes it ideal for performance-sensitive applications such as AI search engines. Unlike traditional storage solutions, S3 Express One Zone scales automatically without provisioning, eliminating capacity planning headaches while maintaining performance under varying loads. Its serverless nature removes operational overhead, so that engineering teams can focus on product development rather than infrastructure management. S3 Express One Zone uses an API-driven architecture to integrate into existing workflows while providing performance characteristics previously available only in premium-priced, specialized database solutions.
Three verified advantages drove Tavily’s decision to adopt S3 Express One Zone:
- The storage class provides predictable performance through its strong consistency model, so that AI agents always receive the most current cached data without stale reads. The migration delivered a significant improvement, transforming Tavily’s ability to handle peak traffic confidently without latency spikes or performance degradation.
- Cost efficiency was equally transformative. Unlike the document database’s provisioned capacity model, S3 Express One Zone charges only for actual usage through a pay-per-request pricing structure. This aligned perfectly with Tavily’s bursty traffic patterns, where demand could spike unpredictably.
- Because Tavily already used S3 for long-term storage, migrating the caching layer required minimal architectural changes. They configured S3 Express One Zone as their primary cache and updated their application logic to write cached data to S3 Standard as a backup for redundancy. This dual-write approach provides durability without relying on automatic tiering or third-party tools.
Co-locating compute and storage for minimum latency
To further minimize latency, Tavily deployed their S3 Express One Zone cache in the same Availability Zone (AZ) as their compute resources. This co-location reduces network hops, so that in-Availability Zone requests achieve single-digit millisecond latency. Cross-AZ requests incur slightly higher latency in the low double-digit millisecond range. The strategic co-location delivered consistent performance that was previously unattainable with their database solution.
Business impact
This section outlines how the solution provides faster performance, lower costs, and effortless scaling.
Performance gains: From spikes to stability
The performance improvements were immediate and measurable. S3 Express One Zone delivered single-digit millisecond median response times for in-Availability Zone cached content and high single-digit millisecond latency for cross-Availability Zone requests. Since the migration, previously persistent latency spikes gave way to reliable performance. This enabled seamless handling of peak traffic without degradation. This consistency in latency response has been critical for maintaining the performance of autonomous AI agents.
Throughput and scalability benefits
S3 Express One Zone can handle hundreds of thousands of requests per second per directory bucket, which meets Tavily’s current performance requirements and removes the bottlenecks that previously plagued their database cluster. The previous system required constant manual scaling and capacity planning. S3 Express One Zone enables the caching layer to scale automatically to accommodate traffic spikes without intervention. With database capacity no longer a concern, the team redirected its focus to advancing AI models and expanding the platform’s feature set.
Cost savings: Paying only for actual usage
The financial impact was equally transformative. Tavily replaced the document database’s provisioned capacity with the S3 Express One Zone pay-only-for-what-you-use pricing, which reduced their caching costs by 95%.
This pay-per-request model eliminated the need for capacity planning. With their previous database solution, every traffic spike required either overpaying for headroom or risking performance issues. With S3 Express One Zone, the system scales elastically.
Scalability: Built for exponential growth
Tavily’s stateless, S3-based architecture enables effortless scaling without manual intervention. The system now handles traffic spikes seamlessly and serves diverse workloads—from latency-critical AI agent queries to cost-sensitive batch processing—within the same framework.
S3 Express One Zone provided a tunable performance mechanism adaptable to each workload—using S3 Express One Zone for low-latency real-time agent queries and S3 Standard for cost-efficient backups. Furthermore, all of this was within a unified architecture and codebase. This flexibility became critical as the user base grew.
How Tavily implemented the solution
1. Creating the directory bucket
Tavily started their migration by creating an S3 Express One Zone directory bucket for the primary cache. Using the AWS Command Line Interface (AWS CLI) commands, the team provisioned the bucket and had it ready in less than five minutes.
2. Migrating data with dual writes
Tavily updated their application logic to write all cached data to both S3 Express One Zone and S3 Standard. This dual-write approach provides redundancy without relying on automatic tiering or third-party tools. The application handles the writes to both buckets, which keeps the architecture streamlined and predictable.
3. Implementing the caching hierarchy
Tavily’s caching layer now operates in a three-tier hierarchy. Redis serves as the first-tier cache for the most frequently accessed URLs (approximately 1% of all cached data, representing approximately 30% of requests). S3 Express One Zone serves as the second-tier cache, providing single-digit millisecond latency for the remaining hot data. S3 Standard serves as both a backup and cold cache for long-tail content, as shown in the following figure.
The application prioritizes S3 Express One Zone for low-latency reads and writes to S3 Standard as a backup for durability. If the primary cache becomes unavailable, then the application falls back to S3 Standard.

4. Monitoring with Amazon CloudWatch
Amazon CloudWatch tracks latency, request volumes, and cache efficiency. Tavily uses these metrics to optimize data placement and fine-tune their caching logic.
Lessons learned and best practices
Tavily’s migration offers several key insights for teams considering similar architectures.
Co-locate compute and storage for optimal latency: Deploying S3 Express One Zone in the same Availability Zone as compute resources is essential for minimizing latency in performance-critical applications. Co-location reduces network hops, which was crucial for meeting Tavily’s latency requirements. This design choice helped Tavily achieve single-digit millisecond response times for in-Availability Zone requests.
Design for redundancy with explicit backup writes: S3 Express One Zone only replicates data within a single Availability Zone. Therefore, Tavily implemented explicit backup writes to S3 Standard using application logic to provide redundancy and durability while keeping the architecture clear.
Monitor performance continuously and iterate: The CloudWatch metrics have been invaluable for optimizing Tavily’s caching layer. The team tracks latency, request volumes, and cache efficiency to identify opportunities for improvement. A data-driven approach to tracking workload-critical metrics enabled simultaneous gains in performance and cost efficiency.
Start with a pilot migration: Tavily began their migration by moving a subset of their cache to S3 Express One Zone, validating performance and cost savings before committing to a full migration. This phased approach minimized risk and enabled rapid iteration based on real-world results.
Conclusion
Tavily’s migration to S3 Express One Zone demonstrates how AI workloads can benefit from serverless, high-performance storage. They combined the low latency of S3 Express One Zone with S3 Standard for backups to build a caching system that scales elastically, performs consistently, and reduced caching costs by 95% when compared to their previous database solution.
If your caching layer struggles with latency inconsistency or provisioning overhead, then evaluate S3 Express One Zone for your workload. Latency-sensitive AI applications can achieve consistent single-digit millisecond in-Availability Zone response times. The pay-per-request pricing model aligns costs directly with usage, while explicit backup writes to S3 Standard ensure multi-Availability Zone resilience. Most importantly, the stateless architecture enables effortless scaling without operational overhead.
To explore S3 Express One Zone for your caching layer, refer to the S3 Express One Zone documentation. To estimate your costs, use the AWS Pricing Calculator. For more guidance, contact your AWS account team or visit the AWS Storage Blog for more customer stories and best practices.