GroundTruth reduces costs by 45% and improves reliability migrating from Aerospike to Amazon ElastiCache for Valkey

This is a guest post by Aman Malkar, Software Engineer at GroundTruth, Xun Gong, Software Engineering Manager at GroundTruth, in partnership with AWS

GroundTruth, an advertising platform leading the way in location- and behavior-based marketing, empowers brands to connect with consumers through real-world behavioral data to drive real business results. GroundTruth delivers high-impact, targeted advertising across mobile, desktop, connected TV, digital out-of-home, digital audio, and emerging digital channels, all with built-in attribution to real-world outcomes, like foot traffic, market share growth, and sales lift.

As our advertising platform scaled to process increased volume of ad requests and third-party segment ingestion, maintaining our Aerospike-based caching infrastructure introduced significant operational complexity and rising costs, while also compromising performance and limiting our ability to scale efficiently. Our system architecture required a robust caching solution to enrich ad requests with audience data while maintaining low latency for our advertising delivery platform.

To meet our requirements we implemented Amazon ElastiCache for Valkey, which streamlined our operations, improved reliability, and reduced costs. ElastiCache for Valkey is a fully managed, in-memory data store offering automatic failover, scaling, and maintenance with zero downtime. The service runs in multiple Availability Zones for high availability and integrates seamlessly with other AWS services through AWS Identity and Access Management (IAM) roles. ElastiCache for Valkey offered performance improvements, open-source innovations, and a significant price advantage.

In this post, we walk through our migration journey, covering the migration strategy we adopted, the optimizations we made to reduce cost by 45%, reliability improvements including reducing write failures by 20x, and operational gains from managed service capabilities.

Solution overview

Our user segment store caching infrastructure serves two critical functions in our advertising platform:

Read operations – Amazon Elastic Compute Cloud (Amazon EC2) clusters perform API calls to enrich incoming ad requests with user profile data. The user store cache layer maintains historical profiles to enhance targeting attributes for advertising campaigns. This system enables real-time ad delivery through high-throughput, low-latency profile lookups.
Write operations – Amazon EMR jobs process upstream events, including new user profile additions, user profile updates, and user profile deletions. These writers perform CRUD operations on our current Aerospike cache and make sure current data is available for reader queries. Completeness and robust writes with low error rates are critical for data accuracy.

The following diagram compares our data processing layer architecture with Aerospike (left) and ElastiCache for Valkey (right).During the proof-of-concept (POC) load testing at a sustained throughput of 1 million transactions per second (TPS), both Aerospike and ElastiCache for Valkey achieved comparable p90 read latency of approximately ~4.5 milliseconds. Notably, ElastiCache exhibited a substantially lower error and timeout rate (~0.14%) compared to Aerospike (~1%).

Challenges with Aerospike caching solutions

For GroundTruth, the Aerospike caching deployment presented a range of technical and operational challenges that hinder agility and cost-efficiency:

Upgrading the Aerospike NoSQL cache to newer versions is a complex and resource-intensive process. It often requires a dedicated engineering effort, consuming significant time and operational bandwidth, without offering immediate functional or cost benefits. This level of overhead slows innovation and increases technical debt.
Licensing is based on the volume of unique data stored, which is difficult to predict and optimize. As a result, capacity must be provisioned upfront for peak usage, leading to underutilized licenses and missed cost-saving opportunities.
Scaling the Aerospike cache cluster is a manual and time-consuming process. Nodes can only be added one at a time, and each addition can take hours to complete, which creates operational bottlenecks, especially when rapid scaling is needed to meet demand.

Migration objectives

To guide our migration, we defined four clear objectives:

Optimize TCO by at least 30% while driving operational efficiency and removing licensing dependencies
Boost reliability and performance to provide consistent ad delivery at scale
Unlock scalability to support our rapidly growing business needs
Preserve low latency essential for real-time ad targeting

Migration journey

We chose Amazon ElastiCache for Valkey because it addressed our scaling and reliability challenges while reducing costs. It eliminated licensing fees, multi-AZ architecture improved availability, and built-in scalability. The fully managed service also simplified operations, freeing our teams to focus on delivering new customer features.

To achieve zero downtime migration, we followed phased migration with built-in roll back procedures and established a comprehensive monitoring framework to track key performance indicators throughout the transition with key tasks accomplished at each stage:

Key space analysis

This migration provided an opportunity to reassess our audience targeting requirements, driving a comprehensive analysis and sanitization of existing data. As a result, we eliminated stale data columns no longer aligned with current needs and removed low-value user profiles—defined as users observed only once in the past few months within the GroundTruth bid stream.

Data modeling optimization

In Aerospike, we modeled user profile data using the JSON data structure. As part of optimizing the data model in ElastiCache for Valkey, we evaluated three options—JSON, Hashes, and Protocol Buffers (Protobuf)—to understand their respective trade-offs, strengths, and limitations. This analysis led us to transition from JSON to Protobuf, delivering significantly improved space efficiency for our use case.

The following table shows a sample data model.

Key	Attribute1	Attribute2	Attribute3	Attribute4	Attribute5
`xxx`	`TYPE_A`	`MAP('{1001:25847, 2002:1892, 3003:156}')`	`LIST('["item1", "item2", "item3", "item4"]')`	`42`	`active`

The following table compares the data structures.

Type	Data Structure & Sample Commands	Pros/Cons
JSON	`{ "key": "xxx", "attribute1": "TYPE_A", "attribute2": "42", "attribute3": "active", "attribute4": {"1001": 25847}, "attribute5": ["item1", "item2"] }` `JSON.SET user:xxx $ '{"key": "xxx", "attribute1": "TYPE_A", "attribute2": "42", "attribute3": "active", "attribute4": {"1001": 25847}, "attribute5": ["item1", "item2"]}' JSON.SET user:xxx $.attribute1 "TYPE_B" JSON.GET user:xxx $.attribute4 JSON.SET user:xxx $.attribute5 ["new_item"]`	Offered path-based operations and atomic nested updates but consumed more memory and required full document reads for field access.
Hashes	`key: "xxx" attribute1: "TYPE_A" attribute2: "42" attribute3: "active" attribute4: "1001:25847" attribute5: "item1,item2"` `HSET user:xxx key "xxx" attribute1 "TYPE_A" attribute2 "42" attribute3 "active" attribute4 "1001:25847" attribute5 "item1,item2" HSET user:xxx attribute1 "TYPE_B" HGET user:xxx attribute4 HGET user:xxx attribute5`	Hash-based approach: Provided space efficiency and fast operations but required manual parsing for each field and lacked built-in nested operations.
Protocol Buffers	`message UserProfile { optional string key = 1; optional string attribute1 = 2; optional string attribute2 = 3; optional string attribute3 = 4; repeated StringLongPair attribute4 = 5; repeated string attribute5 = 6; message StringLongPair { required string key = 1; required int64 value = 2; } }` `GET user:xxx MGET user:xxx1 user:xxx2 user:xxx3`	For our use case, Protobuf emerged as the optimal solution because it reduced infrastructure costs through memory compression while maintaining type safety for our complex user profile data structures. The full profile read-modify-write access pattern aligned with the Protobuf serialization approach.

Dual read/write implementation and performance validation

To implement dual read/write mechanism, we used the following steps to facilitate a safe transition to ElastiCache for Valkey:

Preparation – Used the EMR Loader to extract data from the Aerospike cluster and stored it in Amazon Simple Storage Service (Amazon S3) in Parquet format, then used that to bootstrap the ElastiCache for Valkey cluster
Dual write/Read – To sync ongoing changes, we maintained parallel writes to both Aerospike and ElastiCache while implementing a configurable read traffic routing mechanism in our ad-servers and data-servers. This allowed us to gradually increase cache utilization from 0% to 100%. We started small—1–5% of traffic to ElastiCache to validate cache performance against Aerospike, monitoring key metrics such as cache hit rates, latency, and error rates at each increment. Traffic was then scaled up in stages (5% → 10% → 25% → 50% → 100%), with the flexibility to reduce the percentage immediately if any issues arose & real-world testing under production load.
Validation phase – Conducted continuous real-time monitoring and performance comparisons between Aerospike and ElastiCache under production load, ensuring stability, accuracy, and seamless migration
Cutover phase – For the final migration, we froze writers during full data extraction to ensure synchronization, executed a 27-hour data dump from Aerospike, and performed a one-hour EMR load into ElastiCache before resuming normal operations.

To further improve performance, we implemented two key strategies with ElastiCache for Valkey. First, we used Redisson’s asynchronous APIs (look for readMode:MASTER_SLAVE) to distribute read operations across primary and replica nodes, taking advantage of rapid data replication. Second, we implemented distributed locking to prevent data conflicts. These improvements resolved the consistency issues we previously experienced with our Aerospike implementation, where simultaneous updates could cause data mismatches

As part of our migration strategy to ElastiCache for Valkey, we implemented a dual read/write architecture. The following diagram illustrates the end-to-end data flow—from Aerospike through Amazon EMR processing to ElastiCache clusters—demonstrating how reads and writes are distributed across the system.

Results and benefits

Migrating to ElastiCache for Valkey delivered substantial improvements over our previous implementation, including higher performance, greater reliability, lower costs, and an enhanced end-user experience. Notably, GroundTruth reduced costs by 45% and improved write reliability by 20×. The following table summarizes the full set of improvements.

Dimensions	Aerospike	Amazon ElastiCache for Valkey	Note
Cost	Baseline	45% savings	Cost savings were realized through three key optimizations: Sanitization of stale data through key space analysis Data modeling optimization by implementing Protobuf serialization Reserved nodes
Reliability: Write Failure Rate	2.5% (Average)	0.14% (Average)	20 times lower
Reliability: Reader Timeouts	0.14%	0.000006%
Reliability: Reader Exceptions	Occasional	Zero exceptions
Reliability: Write Success Rate	97.5% (Average)	99.86% (Average)
Operations	Self-Managed	Fully managed	Minimized complex upgrade processes that required dedicated engineering effort Offered scaling through fully managed service Improved monitoring and observability through Amazon CloudWatch integration
Memory Efficiency	Baseline	40% less memory space	Memory space usage optimized through key space analysis and Protobuf serialization.
P90 Read Latency – with sustained throughput of 1 million transactions per second (TPS)	4.5 milliseconds (approximate)	4.5 milliseconds (approximate)	Comparable p90 latency
Write Latency – with sustained throughput of 1 million transactions per second (TPS)	10 milliseconds (approximate) (Lua scripts atomic operations)	Higher due to read-modify-write pattern	Our new read-modify-write pattern introduced higher write latency, particularly with Protobuf due to its serialization overhead and distributed writer locks. Whereas JSON and Hashes showed lower latency. Our main priority for writers is robustness—minimizing errors, exceptions, and timeouts. Here, ElastiCache outperformed Aerospike, delivering more reliable performance. The modest latency trade-off was well worth the benefits of reduced costs and improved reliability.

Conclusion

Our migration to ElastiCache for Valkey has been truly transformational – exceeding our goals of reducing costs, enhancing reliability, eliminated licensing dependencies and simplifying operations through AWS’s fully managed services. The migration achieved near-zero timeouts, maintained exceptional performance, and ensured consistent, real-time ad delivery at scale. With a scalable, low-latency architecture now in place, we’re well-positioned to support future growth and continue delivering an outstanding ad experience to our customers.

For more information about ElastiCache for Valkey, refer to the Amazon ElastiCache for Valkey documentation.

AWS Database Blog

GroundTruth reduces costs by 45% and improves reliability migrating from Aerospike to Amazon ElastiCache for Valkey

Solution overview

Challenges with Aerospike caching solutions

Migration objectives

Migration journey

Key space analysis

Data modeling optimization

Dual read/write implementation and performance validation

Results and benefits

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help