AWS Database Blog
How Alight Solutions achieved 60% cost savings with Amazon ElastiCache for Valkey
This is a guest post by Vipin Garg, Technical Senior Consultant at Alight, in partnership with AWS.
Alight Solutions is a leading cloud-based human capital technology and services provider that has focused its operations on integrated benefits administration, healthcare navigation, and employee experience solutions. The company serves hundreds of enterprise customers globally across North America, Europe, and Asia-Pacific regions, with services that support millions of people worldwide.
Alight’s technology stack requires a high-performance storage layer for user session data and configuration details that enables rapid data retrieval immediately after user login—essential for delivering seamless user experiences across their human capital technology offerings. Previously, Alight relied on a self-hosted enterprise key-value store with around-the-clock vendor support. However, as part of their strategic evaluation of cloud-based managed services aimed at optimizing costs and enhancing performance, Alight determined that migrating to a fully managed solution would better align with their business goals and technical requirements. After comprehensive assessment, Alight selected Amazon ElastiCache as their managed caching solution, recognizing that it helps deliver the critical capabilities required for their use case: sub-millisecond latency, automatic resource scaling, reduced operational overhead through managed service capabilities, AWS environment integration, and robust support for high-throughput workloads.
In this post, we share how Alight Solutions transformed their caching infrastructure using ElastiCache while maintaining strict performance requirements, achieving over 60% cost reduction, 70-80% reduction in operational overhead, migration of gigabytes of data with sub-0.5 millisecond performance for millions of users, and a 99.99% reduction in incident rate.
Challenges with scaling constraints and rising costs
Although their enterprise solution met their sub-0.5 millisecond latency goal, it came with substantial operational overhead and cost challenges that resulted in inefficient resource utilization and unnecessarily high infrastructure costs.
Alight’s legacy enterprise key-value store infrastructure presented complex technical and operational challenges that needed immediate attention. Their solution required stringent performance benchmarks, specifically sub-0.5 millisecond latency for both reads and writes, while handling a dynamic workload that fluctuated between 55,000 operations per second during non-peak times to 150,000-200,000 operations per second during peak utilization periods. The self-hosted infrastructure posed significant operational hurdles, requiring continuous manual efforts for operating system (OS) patching, installations, and scaling operations, which demanded 200 work hours per year of dedicated engineering time. Additionally, the fixed shard-based licensing model created constraints in their ability to optimize resources effectively.
The most significant challenge was the cost implications of their existing setup. The fixed shard-based licensing model forced them to over-provision resources to handle peak periods, but they couldn’t scale down during non-peak times due to licensing constraints. Faced with these constraints and challenges, Alight Solutions sought a more efficient, scalable, and cost-effective solution to meet their performance requirements without the operational burden. This led Alight to explore transformative alternatives, culminating in a strategic partnership with AWS.
Solution Overview
Alight partnered with AWS to design a modern, cloud-based caching solution that would address their challenges while positioning them for future growth.
The decision to choose ElastiCache was driven by several compelling factors:
- Managed service benefits that would alleviate infrastructure management overhead
- Reduction of licensing complexities and associated fixed costs of traditional caching solutions
- Serverless options for automatic scaling and cost savings through pay-as-you-go pricing
- Dynamic scaling capabilities to scale resources based on actual demand
- Proven performance capabilities to meet their sub-millisecond latency requirements and high throughput demands
A critical decision in Alight’s migration was selecting ElastiCache for Valkey as their caching engine. The multithreaded and asynchronous architecture of Valkey significantly boosted throughput while reducing latency, perfectly aligning with Alight’s sub-0.5 millisecond performance goals. From a cost perspective, Valkey delivered comparable or better performance than their enterprise solution at 60% lower price point, contributing to the overall cost savings. Ongoing investments from AWS and active research in Valkey demonstrated a strong commitment to continuous enhancement, making it a strategic long-term choice.
The performance optimization features that proved invaluable included asynchronous request handling for improved concurrency, multi-threaded I/O for better resource utilization, optimized memory management (enhanced through the lightweight Lettuce driver with non-blocking architecture that reduces memory overhead and minimizes garbage collection pressure), and enhanced support for bundled operations like MGET (which allow retrieving multiple keys in a single request).
Alight implemented a sophisticated hybrid approach to maximize both performance and cost efficiency. Their decision framework was straightforward: they use ElastiCache Serverless for Valkey as the default choice for a majority of the workloads, leveraging its pay-as-you-go pricing for cost optimization. Node-based Valkey clusters are reserved only for specific use cases requiring extremely low sub-0.5 millisecond latency response times.
For their mission-critical, latency-sensitive node-based workloads serving millions of users globally, high availability and scalability were non-negotiable requirements. Alight deployed a multi-shard ElastiCache architecture distributed across multiple Availability Zones, providing the following benefits:
- Fault tolerance through geographic distribution that provides service continuity
- Horizontal scalability, where multiple shards help the system handle growing workloads efficiently
- Ability to reconfigure shards of the cluster offline
- Performance optimization through distributed architecture that reduces latency
- Robust disaster recovery capabilities
This architecture choice directly supported their business requirements for a highly available, globally accessible suite of services that could scale dynamically with demand.
The following diagram illustrates the solution architecture.

Building a comprehensive migration plan
The migration project was completed in four months, with the first three months dedicated to proof of concept validation, application driver changes, comprehensive performance testing, and instance type selection, followed by the actual migration to ElastiCache completed in one month, demonstrating Alight’s commitment to thorough planning and execution while maintaining their stringent performance requirements.
Alight engaged the AWS Team during the evaluation phase, establishing a partnership that would prove invaluable throughout the migration. Their Technical Account Manager (TAM) served as the dedicated point of contact throughout the entire journey, orchestrating comprehensive support that extended far beyond initial consultation. The TAM coordinated multiple sessions with AWS service subject matter experts to address specific service capabilities, troubleshoot issues as they arose, and provide detailed performance benchmarking guidance. This collaborative approach provided Alight access to deep technical expertise at every stage, with the TAM playing a crucial role in migration strategy development and serving as the central liaison between Alight’s team and the AWS specialized resources.
The migration team, comprising Alight and AWS Solutions Architects, developed a comprehensive plan focusing on critical areas:
- Migration from a Jedis application driver to Lettuce application driver for ElastiCache compatibility with optimized connection handling, retry logic, and MGET request handling
- Migration to Valkey engine
- Extensive stress testing across multiple instance types to validate latency and throughput requirements, and comprehensive validation of high availability and disaster recovery scenarios
Data migration and cache warming strategy
The migration to Amazon ElastiCache involved both historical data and new data—the existing 60-70 GB of data from the backend systems was migrated to the ElastiCache clusters, facilitating continuity of service for the cached information. The team developed a Jenkins-based automation pipeline that first migrated this historical data and then continuously populated new ElastiCache clusters with new data while the legacy system remained operational, helping prevent cold cache performance degradation. Using a dual-write strategy, applications simultaneously updated both systems with background jobs providing data consistency, and a no-eviction policy—implemented because Alight uses ElastiCache as a persistent memory database rather than a traditional cache—maintained critical data availability throughout the transition.
Alight executed a controlled phased cutover: warming up new clusters, operating in shadow mode, then gradually restarting Amazon Elastic Container Service (Amazon ECS) tasks with new connection strings while validating sub-0.5 millisecond targets in real time. This approach provided an instant rollback capability and contributed directly to Alight’s zero-incident record post-migration—a critical factor in both their cost savings and improved customer experience.
Monitoring and security implementation
Alight implemented comprehensive monitoring using Amazon CloudWatch metrics specifically tailored for ElastiCache for Valkey performance tracking. Key metrics include cache hit ratios, CPU utilization showcasing the multi-threading of Valkey, and real-time operations per second validation to provide consistent throughput of over 150,000 operations per second during peak periods. The monitoring setup tracks MGET operation performance—critical for their bulk data retrieval patterns—and validates sub-0.5 millisecond latency across cache operations.
The team is further exploring ServiceNow integration to enable automated incident generation for threshold breaches, while specialized Valkey metrics provide insights into connection pooling efficiency and memory optimization that support their operational excellence goals.
Results
The migration to ElastiCache delivered transformative results across multiple dimensions, exceeding expectations in both cost savings and performance improvements.
“Migrating from an enterprise key-value solution to ElastiCache has transformed our caching layer—reducing costs by over 60% and improving performance beyond expectations,” said the Alight team. “AWS Enterprise Support was highly responsive, knowledgeable, and collaborative, especially in helping us diagnose and resolve critical performance challenges during and after the migration. AWS’s expertise was invaluable in optimizing our ElastiCache setup to achieve this migration goal.”
Key Improvements from the Migration:
1. Business Impact:
- Cost Savings: Alight achieved a 60% reduction in costs by transitioning from fixed licensing and over-provisioned infrastructure to a pay-as-you-go model.
- Incident Reduction: The migration led to a 99.99% decrease in incident rate, with zero incidents recorded in the past six months compared to an average of nine incidents annually with the previous solution.
- Operational Efficiency: Manual operational efforts were reduced by 70–80%, alleviating the need for OS patching and manual scaling tasks.
2. Performance Enhancements:
- Latency and Throughput: The ElastiCache system consistently delivered sub-0.5 millisecond latency and sustained a throughput of 150,000–200,000 operations per second during peak periods.
- Resource Utilization: There was a noticeable improvement in CPU utilization and memory efficiency, along with enhanced performance for MGET operations.
3. Customer Experience Improvements:
- Reliability: The significant reduction in incidents directly translated to improved customer experience through enhanced system reliability and reduced service disruptions.
- Business Agility: The ability to scale resources dynamically based on demand provided unprecedented flexibility, enabling faster deployment of new features and allowing the team to focus on innovation rather than infrastructure management.
Lessons learned and best practices
Through their migration journey, Alight gained valuable insights that can benefit other organizations.“If we were starting over, we would always go with node-based ElastiCache cluster for Valkey for all latency-sensitive applications and serverless for all other general use cases,” the team reflected. Understanding specific performance requirements upfront saves significant time and resources.
They identified the following key best practices:
- Early AWS engagement: Involving AWS from the evaluation phase provided crucial architectural guidance and helped avoid potential pitfalls throughout the migration process.
- Requirements baseline: Establishing baseline performance requirements and determining specific workload characteristics upfront enabled optimal selection of deployment models and instance types, saving significant time and resources.
- Driver and engine optimization: Using Lettuce driver for workloads with high-volume multi-key operations, combined with the Valkey engine. This improvement delivered optimal performance for bundled operations through enhanced connection pooling, improved handling of concurrent MGET requests, and reduced network round trips for bulk data retrieval.
- Comprehensive testing: Detailed performance validation and comprehensive testing proved critical to building confidence in the production deployment.
Conclusion
Alight’s migration to ElastiCache for Valkey demonstrates how enterprises can achieve significant cost savings and performance improvements by embracing cloud-based managed services. Through careful planning, strategic technology choices like Valkey engine, and close partnership with AWS, Alight transformed a costly, operationally intensive infrastructure into a modern, efficient, and scalable solution.
The successful migration has positioned Alight for continued innovation and growth, with plans to migrate additional caching workloads to ElastiCache, explore analytics session caches, and expand customer configuration stores. The team is also implementing auto-discovery capabilities, expanding serverless usage for appropriate workloads, and performing additional Valkey tuning for performance optimization. The substantial cost savings and operational efficiency gains have strengthened Alight’s ability to accelerate their strategic technology investments, with the team now dedicating more focus to innovation initiatives, including advancing their AI and machine learning capabilities that enhance their human capital technology offerings.