Containers
How CoStar uses Karpenter to optimize their Amazon EKS Resources
Introduction
CoStar is well known as a market leader for Commercial Real Estate data, but they also run major home, rental, and apartments websites —including apartments.com—that many have seen advertised by Jeff Goldblum. CoStar’s traditional Commercial Real Estate customers are highly informed users that use large and complex data to make critical business decisions. Successfully helping customers analyze and decide which of the 6 million properties with 130 billion sq. ft. of space to rent, has made CoStar a leader in data and analytics technology. When CoStar began building the next generation of their Apartments and Homes websites, it became clear the user profile and customer demands had important differences from their long running Commercial Real Estate customers. CoStar needed to deliver the same decision-making value to their new customer base, but for magnitudes more customers and data. This initiated CoStar’s migration from their legacy data centers into AWS for speed and elasticity needed to deliver the same value for millions of users accessing hundreds of millions of properties.
Challenge
CoStar’s biggest challenge has always been to collect data from hundreds of sources, enrich that data with important insights, and deliver that data in a meaningful and user-friendly system. CoStar Suite’s Commercial Real Estate, Apartments, and Homes all have different data sources that update at different times and with different volumes of data. The systems to support this data ingestion and updates to data sources must be fast, accurate, and able to scale up and down to make them affordable. Many of these systems are in the process of being migrated from legacy data centers and into CoStar’s AWS environment, so running them on parallel and interoperable systems was necessary to avoid massive duplication of engineering support. These needs all pointed to running Kubernetes on-premises and in AWS, with scaling capabilities for their container clusters for increases and decreases in usage. After months of successful testing and production, CoStar decided to optimize their engineering stack even further, while still maintaining as much parallel on-premises Kubernetes management as possible.
In the Kubernetes cluster architecture, the control plane and its components are responsible for managing cluster operations (i.e., scheduling containers, managing application availability, and storing cluster data among other key tasks) and worker nodes to host pods that run containerized application workloads. Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service on AWS that manages the availability and scalability of the Kubernetes control plane. For the worker nodes, customers have the option to schedule Kubernetes pod workloads on any combination of provisioned Amazon Elastic Compute Cloud (Amazon EC2) and AWS Fargate. In this post, we’ll focus on how CoStar used the Karpenter autoscaling solution to provision Amazon EC2 instances for the worker nodes.
The default method to provision worker nodes is to leverage Amazon EKS-managed node groups, which automate the provisioning and lifecycle management of the underlying Amazon EC2 instances using Amazon EC2 Auto Scaling Groups. For the dynamic adjustment of Amazon EC2 instances, the Amazon EKS-managed node group functionality can be paired with the Cluster Autoscaler solution. This autoscaling solution watches for pending pods waiting for compute capacity and any underutilized worker nodes. When pods are in the pending state due to insufficient resources, the Cluster Autoscaler increases the desired number of instances in the Amazon EC2 Auto Scaling group, which provisions new worker nodes, which allows those pods to be scheduled and run. Cluster Autoscaler terminates underutilized or unused nodes based on certain factors.
For CoStar’s workloads running on Amazon EKS, the goal was to maximize the availability and performance while being an efficient resource. While the Cluster Autoscaler solution provides a degree of dynamic compute provisioning and cost-efficiency, there are many considerations and limitations that make it challenging or even restrictive to use. Namely, the Amazon EC2 instance types for a given node group must be of similar Central Processing Unit (CPU), Memory, and Graphics Processing Unit (GPU) specifications to minimize undesired behavior. This is because it uses the first instance type specified in the node group policy to simulate scheduling of pods. If the policy has additional instance types with higher specs, node resources may be wasted after scaling out since it will only schedule pods based on the size of the first instance type. If the policy has additional instance types with lower specs, then pods may fail to schedule on those nodes due to node resource constraints. To diversify the instance sizes to accommodate CoStar’s varied pod resource requirements, they needed to create multiple node groups with similarly specified instance types. Furthermore, the Cluster Autoscaler only deprovisions underutilized nodes, but doesn’t have replace them with cheaper instance types in response to changes in the workloads. Additionally, for CoStar’s stateless workloads, being able to prefer and target Spot compute capacity for deeper discounts over on-demand was cumbersome to implement with node groups.
Solution overview
Why Karpenter
CoStar needed a more efficient means of provisioning nodes for their diverse workload demands without the overhead of management of multiple node groups. This was addressed using the open-source Karpenter node provisioning solution. Karpenter is a flexible, high-performance Kubernetes cluster autoscaler that provides dynamic groupless provisioning of worker node capacity in response to unscheduled pods. Because of Karpenter’s groupless design, CoStar was no longer tied to using similarly specified instance types. Karpenter continuously evaluates the aggregate resource requirements of pending pods and other scheduling constraints e.g., node selectors, affinities, tolerations, and topology spread constraints) and provisions the optimal instance compute capacity as defined in the Provisioner Custom Resource Definition (CRD) . With this added flexibility, different teams within CoStar are able to make use of their own Provisioner configurations based on their application and scaling needs. Additionally, Karpenter provisions nodes directly using the Amazon EC2 fleet application programming interface (API) without the need for nodes and Amazon EC2 auto scaling groups, which enables quicker provisioning and retrial times (i.e., milliseconds versus minutes) that enhance CoStar’s performance service level agreements (SLAs). Furthermore, the CoStar team elected to run the Karpenter controller on AWS Fargate, which eliminates the need for managed node groups altogether.
The following diagram illustrates how Karpenter observes the aggregate resource requests of unscheduled pods, which makes decisions to launch new nodes, and terminates them to reduce infrastructure costs:
To achieve cost-effectiveness for CoStar’s stateless workloads and lower environments, the CoStar team configured the Karpenter Provisioner to prefer Spot capacity and only provision On-Demand capacity if no Spot capacity is available. Karpenter uses the price-capacity-optimized allocation strategy for Spot capacity, which balances cost and lowers the probability of interruptions in the near term. For stateful workloads in production clusters, the Karpenter Provisioner defines a selection of compute and storage optimized instance families running On-Demand, most of which is covered by Compute Savings Plans and Reserved Instances to obtain discounts. For further optimization, CoStar enabled the consolidation capability, which allows Karpenter to actively reduce cluster costs by monitoring the utilization of nodes and checking whether existing workloads can run on other nodes or be replaced with cheaper variants. By evaluating multiple factors, such as the number of running pods, configured node expiry times, use of lower priority pods, and existing Pod Disruption Budgets (PDBs), the consolidation actions are performed in a manner to minimize workload disruptions.
Prerequisites
To carry out the example in this post, you’ll need to setup the following:
- Provision a Kubernetes cluster in AWS
- Install Karpenter for cluster autoscaling
- Install Amazon Elastic Kubernetes Service (Amazon EKS) Node Viewer
Walkthrough
In this section, we provide a simple demonstration of the replace mechanism, which is part of the consolidation capability of Karpenter. The following Karpenter Provisioner and node template configuration code constrains to a number of compute-optimized instance types with the capacity type of on-demand:
Here is the application deployment manifest we use to demonstrate the consolidation behavior:
Using the eks-node-viewer tool to visualize node usage, we observe that Karpenter provisioned a c6a.8xlarge instance type is the most optimal to run the deployment based on its resource request requirements (along with four daemon pods):
Let’s assume the traffic loads are lower during off business hours. You may use the Horizontal Pod Autoscaler to dynamically scale the number of pods based on resource utilization. In our example, we use kubectl to downscale the number of replicas to 30 to simulate a decrease in traffic loads:
With 30 replicas, the overall resource requirements are much lower. Karpenter provisions a replacement node with a cheaper variant (in this case with a c6a.4xlarge) and cordons the original node as shown in the following screenshot.
After the pods are rescheduled on the replacement node the previous node is terminated.
As you can see from our example, Karpenter gives CoStar the ability to scale efficiently and optimize for cost by provisioning the c6a.4xlarge, as resource requirements decreased off business hours.
Cleaning up
To avoid incurring additional operational costs, remember to destroy all the infrastructure you created for the examples in this post.
Conclusion
There are many use cases and options on AWS when evaluating our container solutions. For CoStar, Karpenter consolidated the Amazon EC2 Spot Capacity they were running in their Dev and Test environments and moved workloads to the lowest cost instance type while still effectively runs their workloads. Customers focused on maintaining parity during migrations, continuing to use deep and effective investments in Kubernetes and seek a proven solution to focus on cost optimization, should consider Amazon EKS with Karpenter. For those already using Amazon EKS, the EKS Node Viewer tool is available to assess node efficiency and if you would benefit from Karpenter’s capabilities. Customers looking to achieve consolidate pods onto fewer nodes and use of the most effective instance types, should begin exploring Karpenter and using the documentation to get started.