AWS Database Blog
GroundTruth reduces costs by 45% and improves reliability migrating from Aerospike to Amazon ElastiCache for Valkey
This is a guest post by Aman Malkar, Software Engineer at GroundTruth, Xun Gong, Software Engineering Manager at GroundTruth, in partnership with AWS
GroundTruth, an advertising platform leading the way in location- and behavior-based marketing, empowers brands to connect with consumers through real-world behavioral data to drive real business results. GroundTruth delivers high-impact, targeted advertising across mobile, desktop, connected TV, digital out-of-home, digital audio, and emerging digital channels, all with built-in attribution to real-world outcomes, like foot traffic, market share growth, and sales lift.
As our advertising platform scaled to process increased volume of ad requests and third-party segment ingestion, maintaining our Aerospike-based caching infrastructure introduced significant operational complexity and rising costs, while also compromising performance and limiting our ability to scale efficiently. Our system architecture required a robust caching solution to enrich ad requests with audience data while maintaining low latency for our advertising delivery platform.
To meet our requirements we implemented Amazon ElastiCache for Valkey, which streamlined our operations, improved reliability, and reduced costs. ElastiCache for Valkey is a fully managed, in-memory data store offering automatic failover, scaling, and maintenance with zero downtime. The service runs in multiple Availability Zones for high availability and integrates seamlessly with other AWS services through AWS Identity and Access Management (IAM) roles. ElastiCache for Valkey offered performance improvements, open-source innovations, and a significant price advantage.
In this post, we walk through our migration journey, covering the migration strategy we adopted, the optimizations we made to reduce cost by 45%, reliability improvements including reducing write failures by 20x, and operational gains from managed service capabilities.
Solution overview
Our user segment store caching infrastructure serves two critical functions in our advertising platform:
- Read operations – Amazon Elastic Compute Cloud (Amazon EC2) clusters perform API calls to enrich incoming ad requests with user profile data. The user store cache layer maintains historical profiles to enhance targeting attributes for advertising campaigns. This system enables real-time ad delivery through high-throughput, low-latency profile lookups.
- Write operations – Amazon EMR jobs process upstream events, including new user profile additions, user profile updates, and user profile deletions. These writers perform CRUD operations on our current Aerospike cache and make sure current data is available for reader queries. Completeness and robust writes with low error rates are critical for data accuracy.
The following diagram compares our data processing layer architecture with Aerospike (left) and ElastiCache for Valkey (right).During the proof-of-concept (POC) load testing at a sustained throughput of 1 million transactions per second (TPS), both Aerospike and ElastiCache for Valkey achieved comparable p90 read latency of approximately ~4.5 milliseconds. Notably, ElastiCache exhibited a substantially lower error and timeout rate (~0.14%) compared to Aerospike (~1%).
Challenges with Aerospike caching solutions
For GroundTruth, the Aerospike caching deployment presented a range of technical and operational challenges that hinder agility and cost-efficiency:
- Upgrading the Aerospike NoSQL cache to newer versions is a complex and resource-intensive process. It often requires a dedicated engineering effort, consuming significant time and operational bandwidth, without offering immediate functional or cost benefits. This level of overhead slows innovation and increases technical debt.
- Licensing is based on the volume of unique data stored, which is difficult to predict and optimize. As a result, capacity must be provisioned upfront for peak usage, leading to underutilized licenses and missed cost-saving opportunities.
- Scaling the Aerospike cache cluster is a manual and time-consuming process. Nodes can only be added one at a time, and each addition can take hours to complete, which creates operational bottlenecks, especially when rapid scaling is needed to meet demand.
Migration objectives
To guide our migration, we defined four clear objectives:
- Optimize TCO by at least 30% while driving operational efficiency and removing licensing dependencies
- Boost reliability and performance to provide consistent ad delivery at scale
- Unlock scalability to support our rapidly growing business needs
- Preserve low latency essential for real-time ad targeting
Migration journey
We chose Amazon ElastiCache for Valkey because it addressed our scaling and reliability challenges while reducing costs. It eliminated licensing fees, multi-AZ architecture improved availability, and built-in scalability. The fully managed service also simplified operations, freeing our teams to focus on delivering new customer features.
To achieve zero downtime migration, we followed phased migration with built-in roll back procedures and established a comprehensive monitoring framework to track key performance indicators throughout the transition with key tasks accomplished at each stage:
Key space analysis
This migration provided an opportunity to reassess our audience targeting requirements, driving a comprehensive analysis and sanitization of existing data. As a result, we eliminated stale data columns no longer aligned with current needs and removed low-value user profiles—defined as users observed only once in the past few months within the GroundTruth bid stream.
Data modeling optimization
In Aerospike, we modeled user profile data using the JSON data structure. As part of optimizing the data model in ElastiCache for Valkey, we evaluated three options—JSON, Hashes, and Protocol Buffers (Protobuf)—to understand their respective trade-offs, strengths, and limitations. This analysis led us to transition from JSON to Protobuf, delivering significantly improved space efficiency for our use case.
The following table shows a sample data model.
| Key | Attribute1 | Attribute2 | Attribute3 | Attribute4 | Attribute5 |
xxx |
TYPE_A |
MAP('{1001:25847, 2002:1892, 3003:156}') |
LIST('["item1", "item2", "item3", "item4"]') |
42 |
active |
The following table compares the data structures.
| Type | Data Structure & Sample Commands | Pros/Cons |
| JSON | Offered path-based operations and atomic nested updates but consumed more memory and required full document reads for field access. | |
| Hashes | Hash-based approach: Provided space efficiency and fast operations but required manual parsing for each field and lacked built-in nested operations. | |
| Protocol Buffers | For our use case, Protobuf emerged as the optimal solution because it reduced infrastructure costs through memory compression while maintaining type safety for our complex user profile data structures. The full profile read-modify-write access pattern aligned with the Protobuf serialization approach. |
Dual read/write implementation and performance validation
To implement dual read/write mechanism, we used the following steps to facilitate a safe transition to ElastiCache for Valkey:
- Preparation – Used the EMR Loader to extract data from the Aerospike cluster and stored it in Amazon Simple Storage Service (Amazon S3) in Parquet format, then used that to bootstrap the ElastiCache for Valkey cluster
- Dual write/Read – To sync ongoing changes, we maintained parallel writes to both Aerospike and ElastiCache while implementing a configurable read traffic routing mechanism in our ad-servers and data-servers. This allowed us to gradually increase cache utilization from 0% to 100%. We started small—1–5% of traffic to ElastiCache to validate cache performance against Aerospike, monitoring key metrics such as cache hit rates, latency, and error rates at each increment. Traffic was then scaled up in stages (5% → 10% → 25% → 50% → 100%), with the flexibility to reduce the percentage immediately if any issues arose & real-world testing under production load.
- Validation phase – Conducted continuous real-time monitoring and performance comparisons between Aerospike and ElastiCache under production load, ensuring stability, accuracy, and seamless migration
- Cutover phase – For the final migration, we froze writers during full data extraction to ensure synchronization, executed a 27-hour data dump from Aerospike, and performed a one-hour EMR load into ElastiCache before resuming normal operations.
To further improve performance, we implemented two key strategies with ElastiCache for Valkey. First, we used Redisson’s asynchronous APIs (look for readMode:MASTER_SLAVE) to distribute read operations across primary and replica nodes, taking advantage of rapid data replication. Second, we implemented distributed locking to prevent data conflicts. These improvements resolved the consistency issues we previously experienced with our Aerospike implementation, where simultaneous updates could cause data mismatches
As part of our migration strategy to ElastiCache for Valkey, we implemented a dual read/write architecture. The following diagram illustrates the end-to-end data flow—from Aerospike through Amazon EMR processing to ElastiCache clusters—demonstrating how reads and writes are distributed across the system.
Results and benefits
Migrating to ElastiCache for Valkey delivered substantial improvements over our previous implementation, including higher performance, greater reliability, lower costs, and an enhanced end-user experience. Notably, GroundTruth reduced costs by 45% and improved write reliability by 20×. The following table summarizes the full set of improvements.
| Dimensions | Aerospike | Amazon ElastiCache for Valkey | Note |
| Cost | Baseline | 45% savings | Cost savings were realized through three key optimizations:
|
| Reliability: Write Failure Rate | 2.5% (Average) | 0.14% (Average) | 20 times lower |
| Reliability: Reader Timeouts | 0.14% | 0.000006% | |
| Reliability: Reader Exceptions | Occasional | Zero exceptions | |
| Reliability: Write Success Rate | 97.5% (Average) | 99.86% (Average) | |
| Operations | Self-Managed | Fully managed |
|
| Memory Efficiency | Baseline | 40% less memory space | Memory space usage optimized through key space analysis and Protobuf serialization. |
| P90 Read Latency – with sustained throughput of 1 million transactions per second (TPS) | 4.5 milliseconds (approximate) | 4.5 milliseconds (approximate) | Comparable p90 latency |
| Write Latency – with sustained throughput of 1 million transactions per second (TPS) | 10 milliseconds (approximate) (Lua scripts atomic operations) | Higher due to read-modify-write pattern | Our new read-modify-write pattern introduced higher write latency, particularly with Protobuf due to its serialization overhead and distributed writer locks. Whereas JSON and Hashes showed lower latency. Our main priority for writers is robustness—minimizing errors, exceptions, and timeouts. Here, ElastiCache outperformed Aerospike, delivering more reliable performance. The modest latency trade-off was well worth the benefits of reduced costs and improved reliability. |
Conclusion
Our migration to ElastiCache for Valkey has been truly transformational – exceeding our goals of reducing costs, enhancing reliability, eliminated licensing dependencies and simplifying operations through AWS’s fully managed services. The migration achieved near-zero timeouts, maintained exceptional performance, and ensured consistent, real-time ad delivery at scale. With a scalable, low-latency architecture now in place, we’re well-positioned to support future growth and continue delivering an outstanding ad experience to our customers.
For more information about ElastiCache for Valkey, refer to the Amazon ElastiCache for Valkey documentation.

