AWS for Industries
Transforming the BMW Connected Vehicle Backend with Karpenter
BMW Connected Company, a division within the BMW Group, develops and operates premium digital services for BMW’s connected vehicle fleet of more than 23 million vehicles worldwide.
In 2019, BMW made the strategic decision to migrate the entire Connected Vehicle backend, comprising over 1,300 microservices, from on-premises to the Amazon Web Services, Inc. (AWS) cloud. Today, the BMW Connected Vehicle backend operates as a highly complex, high-performance service mesh comprising more than 375 Amazon Elastic Kubernetes Service (EKS) clusters that process over 12 billion requests and 145TB of data traffic daily across 4 AWS Regions.
In this blog, we describe BMW’s journey migrating from Kubernetes Cluster Autoscaler (CAS) to Karpenter to help BMW achieve increased flexibility, operational efficiency and reduced costs. We’ll highlight key considerations that drove this decision, walk through the implementation process and share the valuable lessons learned along the way.
Challenges for optimizing at scale
BMW initially implemented CAS as the primary autoscaling solution to help manage the dynamic scaling requirements of their Amazon Elastic Kubernetes Service (Amazon EKS) clusters. However, as the service continued to scale, several critical challenges emerged. The management of multiple Auto Scaling groups became more complex, and required maintaining high availability for important BMW applications and ensuring swift cluster upgrades, all while adhering to BMW’s security requirements, proved increasingly challenging.
With hundreds of clusters in operation, the limitations of CAS became apparent, and a more modern solution was required to meet BMW’s evolving performance and efficiency demands.
Primary benefits driving BMW’s decision to migrate to Karpenter include:
- Computational Efficiency: Dynamic node lifecycle management, intelligent bin-packing, and flexible instance selection help maximize utilization and reduce latency through optimized Availability Zone (AZ) placement.
- Cost Optimization: By eliminating idle capacity and supporting diverse instance types including spot instances, Karpenter helps enable right-sized provisioning that reduces cloud spend without compromising performance.
- Scalable Automation: Karpenter’s declarative approach and deep integration with Kubernetes simplifies scaling operations and helps supports BMW’s future ambitions for full infrastructure automation.
- Operational Excellence: Drift detection identifies new Amazon Machine Image (AMI) releases, helping maintain BMW’s system security by ensuring timely upgrades and addressing vulnerabilities.
Introducing Karpenter
To address these challenges, BMW integrated Karpenter, AWS’s purpose-built autoscaling solution for Amazon EKS and it marked a fundamental change in how BMW approached resource allocation and cluster management.
Although CAS and Karpenter share the same goal of autoscaling, their underlying approaches differ substantially. Karpenter offers a unique advantage by provisioning nodes directly through the Amazon EC2 application programming interfaces (API), helping offer enhanced responsiveness and intelligent and more secure autoscaling capabilities.
Karpenter and CAS employ fundamentally different scaling approaches, as illustrated in Figure 1.
Figure 1: Architectural differences between Cluster Autoscaler and Karpenter
Migration Approach
BMW started by deploying a proof of concept to quantify the benefits and collate learnings that would be carried forward to ensure a more seamless, low-risk production implementation.
Proof of Concept
For the initial phase of the migration, BMW followed the official guidelines from Karpenter’s open-source documentation, adapting the steps to fit BMW’s infrastructure standards and tooling. Yet, rather than using the AWS Command Line Interface directly, BMW leveraged Terraform with GitHub Actions to help implement a controlled, repeatable deployment pipeline. This allowed for gradual rollout and easy rollback, aligned with DevOps best practices.
The following outlines the methodology and steps taken to implement the proof of concept:
- Create Identity and Access Management (IAM) roles and policies required by Karpenter.
- Add necessary tags to virtual private clouds (VPC) subnets, and security groups to help support dynamic provisioning and include labels and selectors to help enable Karpenter to manage workloads efficiently.
- Updated the aws-auth ConfigMap to enable Karpenter node access.
- Place Karpenter in a dedicated node group to help prevent Karpenter from disrupting the nodes it runs on during cluster reconciliation.
- Deployed Karpenter using Terraform, including custom node affinity configurations.
- Verified custom resource definitions (CRDs), created NodePools and EC2NodeClasses for different workload types.
- Assigned affinity for critical workloads to ensure correct placement.
- Decommissioned CAS and Managed Node Groups after migration.
Though the infrastructure setup was largely smooth, BMW also needed to overcome several challenges:
- Ephemeral storage: Default 20GB was insufficient for stateless node pools due to large container images, therefore additional Amazon Elastic Block Storage (EBS) volume mapping was introduced.
- Image pull delays: Resolved by increasing registryPullQPS from 5 to 50 in kubelet configuration to handle BMW’s container registry load more efficiently.
- Pod reconciliation: Excessive pod disruption was mitigated by reducing disruption budget on NodePools to 10%, helping limit simultaneous node replacements.
The outcome of the proof of concept yielded significant improvements as illustrated in Figure 2 and Figure 3, with an approximate 9% increase in overall CPU utilization and 13% reduction in hourly costs.
Figure 2: Cluster Autoscaler utilization
Figure 3: Karpenter utilization
Production Implementation
Following the promising results from the proof of concept, BMW decided to promote Karpenter to production.
Rollback Strategy:
In case the Karpenter rollout did not proceed as expected, the following rollback strategy was defined:
- Disable Karpenter via Infrastructure-as-Code (IaC).
- Scale out Cluster Autoscaler to resume node group management and scaling through it.
- Remove Karpenter components (pods and CRDs) from the cluster; reset min/max for Managed Node Groups.
- Perform cleanup of any residual Karpenter-related resources, if needed.
Production Rollout:
BMW’s production rollout of Karpenter followed a carefully structured plan designed by BMW and AWS to help ensure a safe and efficient transition:
- Dedicated Node Group: A tainted and labeled Managed Node Group was created exclusively for running Karpenter pods, helping ensure isolation from application workloads.
- Terraform Integration: Karpenter-specific IAM roles, node affinity rules, and configurations were embedded into EKS Terraform modules for automated provisioning.
- Workload Segmentation: Custom NodePools and EC2NodeClasses were defined for stateless, stateful, and GPU workloads, helping enable optimal instance selection per workload type.
- Cluster Autoscaler Deactivation: CAS was scaled down to zero to help avoid conflict with Karpenter’s provisioning logic.
- Controlled Rollout: A feature flag (use_karpenter) enabled selective activation, with pre-flight checks to help verify pod readiness and system health before draining Managed Node Groups.
- Observability: Custom monitoring dashboards were created to track Karpenter metrics, performance, and provisioning behaviors in real-time.
Figure 4 shows the resultant architecture following BMW’s migration to Karpenter. The platform VPC hosts an Amazon EKS managed node group with Karpenter for provisioning Amazon EC2 On-Demand or Spot instances, using Amazon EventBridge and Amazon Simple Queue Service (SQS) for termination handling. VPC peering is used to provide connectivity between the VPC hosting the EKS platform, to the product VPC, enabling internal teams to deploy resources such as Amazon Relational Database Service (RDS) for consumption by Amazon EKS applications.
The migration to Karpenter helped deliver significant measurable improvements across infrastructure efficiency, performance, and cost optimization. CPU utilization efficiency improved by ~12%, rising from 84% to 93% – primarily driven by Karpenter’s dynamic bin-packing and real-time instance provisioning. In addition, the total CPU core count reduced across all environments from between 10% to 16%.
The varying reduction percentages reflect differences in environment specific factors including budget constraints for the compaction algorithm, cluster sizes and workload characteristics. These factors directly influence how Karpenter allocates and optimizes resources in each environment.
This resulted in total annual savings for BMW of over $1M in AWS infrastructure cost.
Figure 5: Production deployment CPU core count pre and post migration
In addition to increased efficiency and cost savings, the migration helped provide several key operational and strategic benefits to BMW, including helping:
- Improve workload startup latency through faster node provisioning.
- Reduce over-provisioning by matching instance types to real-time pod resource requirements.
- Increase resiliency through proactive spot instance interruption handling.
- Streamline operations with fewer node group configurations and simplified scaling logic.
- Enhance platform flexibility enabling ARM support and improved GPU capabilities.
- Create custom Nodepool offerings for internal BMW teams that increase BMW EKS platform efficiency.
- Enable dynamic node sizing for efficient handling of large containers, including those required for AI workloads.
These outcomes are consistent with observations from other enterprise implementations of Karpenter, where organizations reported better resource utilization, enhanced automation, and cost reductions without sacrificing application performance or reliability. Karpenter’s real-time decision-making, topology-aware scheduling, and support for multiple instance types proved critical in helping unlock these benefits.
AWS Sample Karpenter Migration
To support broader adoption, AWS has open-sourced a sample project that mirrors the Karpenter implementation described in this blog. This resource helps enable users to more quickly test and understand the migration process from CAS to Karpenter within their own Amazon EKS environments, including helping:
- Provision Infrastructure: Set up a VPC and Amazon EKS cluster with essential components.
- Install Add-ons: Deploy core EKS add-ons (CoreDNS, kube-proxy, and CAS) for initial benchmarking.
- Deploy Sample Workloads: Install sample application to simulate real workloads.
- Install and Configure Karpenter: Deploy Karpenter with EC2NodeClasses and NodePools for dynamic scaling.
- Migrate Workloads: Seamlessly transition workloads from CAS management to Karpenter.
- Automate via CI/CD: Use manually or integrate into CI/CD pipelines for repeatable infrastructure delivery.
This project provides a framework for other Amazon EKS customers to experiment with and validate Karpenter’s capabilities prior to a full-scale rollout. Please find the reference here Karpenter-cluster and follow instructions in the README.md to get started.
Conclusion
The migration from CAS to Karpenter at BMW represents a pivotal advancement in modernizing workload scaling on Amazon EKS. This transition helped empower BMW to respond more swiftly to demand, enhance CPU utilization, and minimize operational overhead through intelligent bin-packing and dynamic provisioning.
The initiative helped yield measurable improvements to BMW in cost efficiency and resource optimization while helping provide a flexible foundation for future innovation and scaling. Karpenter’s seamless integration with AWS APIs, advanced scheduling logic, and open-source extensibility helped BMW align its cloud infrastructure with the demands of high-performance connected vehicle services.
BMW’s journey demonstrates that, for organizations managing large-scale Kubernetes clusters, Karpenter helps provide a transformative upgrade in autoscaling capabilities, paving the way for enhanced operational efficiency and agility. While current optimization gains are driven primarily by Karpenter’s bin-packing capabilities, additional cost savings are expected once spot instance adoption is enabled.
“Karpenter enables us to operate smarter, scale faster, and prepare our platform for the next generation of connected vehicle services.” – Dr. Céline Laurent-Winter, BMW Group Vice President Connected Vehicle Platforms