AWS Architecture Blog

How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters

As organizations scale their Kubernetes deployments, Kubernetes cluster scaling has traditionally been complex and slow, requiring careful management of node groups and auto scaling configurations. Karpenter, an open source node provisioning project for Kubernetes, can help transform this approach by directly provisioning right-sized nodes based on real-time workload demands. A recent Datadog report reveals that the percentage of nodes provisioned by Karpenter rose by 22% in the last 2 years as organizations migrate from traditional auto scaling approaches. This growth underscores Amazon Web Services (AWS) leadership in cloud-based innovation and the container ecosystem’s recognition of Karpenter’s strong performance and cost efficiency benefits. The following post examines how Salesforce, operating one of the world’s largest Kubernetes deployments, successfully migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 plus Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

Salesforce operates one of the world’s most complex Kubernetes platforms, managing over 1,000 EKS clusters that serve thousands of internal tenants across the company. These clusters power a wide range of applications, from mission-critical services to experimental projects, and demand a high degree of scalability, reliability, and operational efficiency.

As the platform grew, Salesforce’s Kubernetes platform team began to face major hurdles with its traditional auto scaling approach based on AWS Auto Scaling groups and the Kubernetes Cluster Autoscaler. These limitations hampered the team’s ability to respond to application demands quickly, optimize compute resources, and empower internal developers to self-serve infrastructure needs.

To address these challenges, Salesforce undertook a large-scale migration to Karpenter, an open source Kubernetes [1] auto scaler built by AWS. This blog post details the motivation behind the transition, the implementation strategy, the challenges encountered along the way, and the impact it had on cost, performance, and operational complexity.

Opportunity for operational transformation

At Salesforce’s massive scale, the traditional Kubernetes infrastructure faced several critical challenges. The need to accommodate diverse workload requirements led to a proliferation of thousands of node groups and Auto Scaling groups, creating operational bottlenecks and slowing innovation. This architectural complexity was compounded by significant scaling performance issues, where the Auto Scaling group-dependent Cluster Autoscaler struggled to handle dynamic workloads, often resulting in multi-minute delays during demand spikes and degraded user experience. Resource utilization suffered as well, with inefficient bin-packing and conservative scale-down strategies leading to stranded resources and underutilized infrastructure—a particular concern given Salesforce’s focus on cost-to-serve and sustainability goals. These challenges were further exacerbated by structural limitations in the Auto Scaling group–based architecture, including poor Availability Zone balance and performance bottlenecks in large clusters, particularly for memory-intensive workloads. The combination of these factors made it clear that a more modern, flexible auto scaling solution was essential for maintaining Salesforce’s competitive edge and operational efficiency.

Solution overview

To migrate over 1,000 production clusters, without disruption, Salesforce engineered a highly automated, risk-mitigated transition process centered on Karpenter. Here’s how the migration was executed.

At this scale, a manual migration was infeasible. The team developed an in-house Karpenter transition tool to orchestrate the switch-over safely and consistently, and a Karpenter patching check tool. Karpenter transition tool and Karpenter patching check tool provide a comprehensive solution for migrating Kubernetes clusters to and from Karpenter node management while maintaining operational continuity through automated node rotation, Amazon Machine Image (AMI) validation, and graceful pod eviction handling.

Key design principles included:

  • Zero disruption – The tool cordoned and drained legacy nodes with full respect for pod disruption budgets (PDBs), maintaining workload safety
  • Rollback support – A reverse transition capability allowed fast recovery to Auto Scaling group–based auto scaling if needed
  • Continuous integration and continuous delivery (CI/CD) integration – The tool was embedded in the core infrastructure provisioning pipeline, standardizing the migration across services.

This foundation enabled repeatability across thousands of clusters and node pools, inspiring confidence in Salesforce developers.

Automated configuration mapping

To convert existing Auto Scaling group configurations to Karpenter-based definitions, the team automated the mapping logic between legacy and modern configurations. For example:

  • Auto Scaling group instance types → EC2NodeClass instance types
  • Root volume sizes → Storage parameters in Karpenter config
  • Node labels → Applied in both NodePool and EC2NodeClass

With over 1,180 node pools containing highly diverse configurations, automation was essential to minimize errors and reduce manual toil.

Example:

metadata:
 name: m5.8xlarge-min-300-max-2500
data:
 k8s_instance_type: m6i.8xlarge
 k8s_root_volume_size: '100'
 k8s_root_volume_iops: '3000'
 k8s_root_volume_type: 'gp3'
 k8s_root_volume_throughput: '125'
 k8s_min_node_number: '300'
 k8s_max_node_number: '2500'
 multi_az_provisioned_workers: 'false'
 asg_launch_type: 'launch_template'
 gpu_enabled: 'false'

A deliberate, phased rollout strategy was adopted:

  • Mid-2025 to Early 2026 – A multistage migration across internal environments with soak times between stages
  • Start with lower-risk environments – Less critical workloads were migrated first to validate tooling and operational processes
  • Risk-based sequencing – High-stakes production environments continue to be migrated last after testing the process

By using this approach Salesforce, continuously learned and adapted, avoiding large-scale regressions.

Key insights from the migration

During this migration journey, the Salesforce team gained valuable insights and best practices that we’ll share to help guide your own transformation initiatives.

Managing application availability during nude Updates

PDBs emerged as a critical consideration during the migration because several services had overly restrictive or misconfigured PDBs that blocked node replacements. The team addressed this by identifying problematic configurations, partnering with application owners on remediation, and implementing Open Policy Agent (OPA) policies for proactive PDB validation. This experience highlighted how proper PDB configuration is essential for safe auto scaling and helped establish stronger governance practices.

Optimizing node maintenance workflows

The initial migration approach of cordoning Karpenter nodes in parallel led to unexpected cluster health issues. To address this, the team refined their strategy by implementing sequential node cordoning, adding manual verification checkpoints with rollback capabilities, and deploying enhanced monitoring for early detection of cluster instability. This experience reinforced that even with modern infrastructure tooling, careful orchestration of node maintenance remains crucial for system reliability.

Understanding Kubernetes label constraints

During the migration, the team discovered that Salesforce’s human-friendly legacy naming conventions often exceeded Kubernetes’s 63-character label length limit, creating challenges with Karpenter’s label-dependent operations. The team resolved this by refactoring naming conventions across node pools to comply with Kubernetes standards. This experience highlighted how seemingly minor technical constraints, such as label length limits, can become significant blockers in automated infrastructure management if not properly addressed early in the migration process.

For example, the following name is 67 characters long:

analytics-bigdata-spark-executor-pool-m6a-32xlarge-az-a-b-c

It produced the result:

error: metadata.labels: Invalid value: must be no more than 63 characters

Protecting single-instance applications

The team discovered that Karpenter’s efficient bin-packing and consolidation features could unexpectedly impact applications running single-replica pods, leading to service disruptions in critical scenarios. To address this, we began implementing guaranteed pod lifetime features and workload-aware disruption policies to safeguard these singleton workloads. This experience demonstrated that effective auto scaling solutions must balance infrastructure efficiency with application availability requirements, particularly for mission-critical services.

Managing storage requirements in node migrations

The migration revealed that certain workloads failed to schedule due to incomplete ephemeral storage configurations. The team resolved this by implementing precise 1:1 mappings between the original Auto Scaling group–defined volume settings and Karpenter’s EC2NodeClass parameters. This experience emphasized the importance of carefully translating storage requirements during infrastructure migrations, particularly for I/O-intensive applications.

Realized value

The transition to Karpenter delivered measurable impact across multiple dimensions—performance, cost, and developer experience.

Operational efficiency

Salesforce eliminated thousands of node groups, significantly simplifying infrastructure management across its Kubernetes platform. Manual operational overhead was reduced by 80% through automation and the introduction of self-service capabilities. Developers can now define their own node pool requirements without waiting for centralized approvals, resulting in faster onboarding and greater agility.

Performance gains

With Karpenter, scaling latency was reduced from minutes to seconds by provisioning nodes based on actual pending pods, effectively bypassing delays associated with Auto Scaling groups. Node utilization improved significantly due to advanced bin-packing algorithms, resulting in fewer stranded resources and better efficiency. The migration eliminated Auto Scaling group thrashing, leading to more stable workloads and fewer scaling events during traffic spikes.

Cost optimization

Salesforce achieved 5% in cost savings in FY2026 by improving bin-packing efficiency and reducing idle capacity across its Kubernetes clusters. With the Karpenter rollout still in progress, an additional 5–10% in savings is projected for FY2027. The migration also lowered the overall cost-to-serve (CTS) by reducing the number of required nodes and improving multi-instance handling.

Enhanced developer and customer experience

The migration to Karpenter introduced true self-service infrastructure, allowing developers to define their capacity needs through straightforward node pool declarations. It also enabled greater flexibility by supporting heterogeneous instance types, including GPU, ARM, and x86, within a single node pool. Karpenter further improved IP efficiency by decoupling node provisioning from specific subnets, helping reduce IP fragmentation and exhaustion across the platform.

Conclusion

The migration to Karpenter represents a fundamental shift in how Salesforce manages Kubernetes infrastructure at scale. By addressing the limitations of traditional auto scaling approaches, we’ve achieved significant improvements in operational efficiency, cost optimization, and customer experience.

The key to our success was a combination of careful planning, custom tooling, and a phased approach that prioritized stability and zero-disruption migration. The results demonstrate that modern Kubernetes auto scaling solutions like Karpenter can transform platform operations while maintaining the reliability required for enterprise-scale deployments.

Salesforce’s success with Amazon EKS and Karpenter demonstrates how AWS continues to innovate alongside its largest enterprise customers, delivering solutions that scale from hundreds to thousands of clusters while reducing costs and operational complexity. This partnership showcases the power of combining AWS managed Kubernetes service with open source innovations like Karpenter to solve real-world challenges at unprecedented scale. To learn more, refer to the Karpenter Best Practices Guide in the Amazon EKS documentation.


About the Authors

Sana Jawad

Sana Jawad

Sana has 16 years of experience and leads a global, cross-functional team building a substrate-agnostic, multi-tenant Kubernetes Platform at Salesforce. Her team delivers Kubernetes Platform-as-a-Service, allowing engineers to run microservices without managing infrastructure complexity. The platform, one of the largest in the industry, supports Salesforce's Hyperforce transformation with thousands of clusters powering products like Sales, Service, Commerce, MuleSoft, and Tableau. It emphasizes scalability, observability, and developer agility using technologies like Kubernetes, Knative, Karpenter, Terraform, Argo, and Spinnaker.

Min Wang

Min Wang

Min is a Lead Software Engineer at Salesforce. She has 10+ years of experience in large scale and distributed computing. She is actively working karpenter rollout in salesforce, helped to migrate thoughts of nodes from ASG into karpenter world. She also wrote the PDB bypass and node drain timeout feature in salesforce karpenter patching pipeline, this feature helps the patching process much smoother. She is always passionate about contributing to open source, Min was an active contributor in open sourced project Sloop, and was an early contributor for the Open Stack Software Load balancer Octavia. In 2020,she gave a talk in Grace hopper regarding her journey in sloop. A particular interest of hers is to improve service quality by building automation tools and creating well-designed services. Min believes that a community which embraces openness, diversity of thoughts and knowledge sharing will grow bigger. Outside of work, she is into volunteering activities, hiking and reading.

Ganga Hiremath

Ganga Hiremath

Ganga Hiremath has over 19 years of IT experience and 10+ years in cloud and distributed systems, currently serving as a Lead Member of Technology Staff at Salesforce. Since joining Salesforce in 2022, Ganga has contributed in transforming the company's batch data processing capabilities, driving a 4X improvement in Spark job submission rates on Kubernetes and optimized resource consumption of Operator by 95%; contributing the same to open source. His expertise spans Kubernetes and cloud-native architectures.

Mithilesh Satapathy

Mithilesh Satapathy

Mithilesh Satapathy has over 15 years of IT experience and 8+ years in product management, currently serving as Senior Product Manager at Salesforce. As a Product Manager, he is prioritizing features that simplify resource management, allowing users to define complex node requirements (like Graviton or specialized GPUs) with simple NodePool configurations. He is instrumental in driving the product strategy, and the roadmap on making node management an invisible utility, ensuring Karpenter remains the leading cost and performance optimization engine for Kubernetes on AWS.

Anuj Butail

Anuj Butail

Anuj Butail is a Principal Solutions architect at AWS. He is based out of San Francisco and helps customers in San Francisco and Silicon Valley design and build large scale applications on AWS. He has expertise in the area of AWS, edge services, and containers. He enjoys playing tennis, reading, and spending time with his family.

Shalaka Dhayatkar

Shalaka Dhayatkar

Shalaka Dhayatkar is a Senior Customer Solutions Manager for Strategic Accounts at AWS, where she drives large-scale cloud adoption and transformation for enterprise customers. With over a decade of experience spanning finance and technology, she partners with executives and engineering teams to accelerate business outcomes, resolve complex challenges, and ensure long-term success. Shalaka is a trusted customer advocate with deep expertise in compute, Kubernetes, and enterprise cloud strategy, enabling organizations to scale, innovate, and realize the full value of AWS.

Vikram Venkataraman

Vikram Venkataraman

Vikram Venkataraman is a Principal Solution Architect at Amazon Web Services and also a container enthusiast. He helps organization with best practices for running workloads on AWS. In his spare time, he loves to play with his two kids and follows Cricket.