Containers

How CoStar uses Karpenter to optimize their Amazon EKS Resources

Introduction

CoStar is well known as a market leader for Commercial Real Estate data, but they also run major home, rental, and apartments websites —including apartments.com—that many have seen advertised by Jeff Goldblum. CoStar’s traditional Commercial Real Estate customers are highly informed users that use large and complex data to make critical business decisions. Successfully helping customers analyze and decide which of the 6 million properties with 130 billion sq. ft. of space to rent, has made CoStar a leader in data and analytics technology. When CoStar began building the next generation of their Apartments and Homes websites, it became clear the user profile and customer demands had important differences from their long running Commercial Real Estate customers. CoStar needed to deliver the same decision-making value to their new customer base, but for magnitudes more customers and data. This initiated CoStar’s migration from their legacy data centers into AWS for speed and elasticity needed to deliver the same value for millions of users accessing hundreds of millions of properties.

Challenge

CoStar’s biggest challenge has always been to collect data from hundreds of sources, enrich that data with important insights, and deliver that data in a meaningful and user-friendly system. CoStar Suite’s Commercial Real Estate, Apartments, and Homes all have different data sources that update at different times and with different volumes of data. The systems to support this data ingestion and updates to data sources must be fast, accurate, and able to scale up and down to make them affordable. Many of these systems are in the process of being migrated from legacy data centers and into CoStar’s AWS environment, so running them on parallel and interoperable systems was necessary to avoid massive duplication of engineering support. These needs all pointed to running Kubernetes on-premises and in AWS, with scaling capabilities for their container clusters for increases and decreases in usage. After months of successful testing and production, CoStar decided to optimize their engineering stack even further, while still maintaining as much parallel on-premises Kubernetes management as possible.

In the Kubernetes cluster architecture, the control plane and its components are responsible for managing cluster operations (i.e.,  scheduling containers, managing application availability, and storing cluster data among other key tasks) and worker nodes to host pods that run containerized application workloads. Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service on AWS that manages the availability and scalability of the Kubernetes control plane. For the worker nodes, customers have the option to schedule Kubernetes pod workloads on any combination of provisioned Amazon Elastic Compute Cloud (Amazon EC2) and AWS Fargate. In this post, we’ll focus on how CoStar used the Karpenter autoscaling solution to provision Amazon EC2 instances for the worker nodes.

The default method to provision worker nodes is to leverage Amazon EKS-managed node groups, which automate the provisioning and lifecycle management of the underlying Amazon EC2 instances using Amazon EC2 Auto Scaling Groups. For the dynamic adjustment of Amazon EC2 instances, the Amazon EKS-managed node group functionality can be paired with the Cluster Autoscaler solution. This autoscaling solution watches for pending pods waiting for compute capacity and any underutilized worker nodes. When pods are in the pending state due to insufficient resources, the Cluster Autoscaler increases the desired number of instances in the Amazon EC2 Auto Scaling group, which provisions new worker nodes, which allows those pods to be scheduled and run. Cluster Autoscaler terminates underutilized or unused nodes based on certain factors.

For CoStar’s workloads running on Amazon EKS, the goal was to maximize the availability and performance while being an efficient resource. While the Cluster Autoscaler solution provides a degree of dynamic compute provisioning and cost-efficiency, there are many considerations and limitations that make it challenging or even restrictive to use. Namely, the Amazon EC2 instance types for a given node group must be of similar Central Processing Unit (CPU), Memory, and Graphics Processing Unit (GPU) specifications to minimize undesired behavior. This is because it uses the first instance type specified in the node group policy to simulate scheduling of pods. If the policy has additional instance types with higher specs, node resources may be wasted after scaling out since it will only schedule pods based on the size of the first instance type. If the policy has additional instance types with lower specs, then pods may fail to schedule on those nodes due to node resource constraints. To diversify the instance sizes to accommodate CoStar’s varied pod resource requirements, they needed to create multiple node groups with similarly specified instance types. Furthermore, the Cluster Autoscaler only deprovisions underutilized nodes, but doesn’t have replace them with cheaper instance types in response to changes in the workloads. Additionally, for CoStar’s stateless workloads, being able to prefer and target Spot compute capacity for deeper discounts over on-demand was cumbersome to implement with node groups.

Solution overview

Why Karpenter

CoStar needed a more efficient means of provisioning nodes for their diverse workload demands without the overhead of management of multiple node groups. This was addressed using the open-source Karpenter node provisioning solution. Karpenter is a flexible, high-performance Kubernetes cluster autoscaler that provides dynamic groupless provisioning of worker node capacity in response to unscheduled pods. Because of Karpenter’s groupless design, CoStar was no longer tied to using similarly specified instance types. Karpenter continuously evaluates the aggregate resource requirements of pending pods and other scheduling constraints e.g., node selectors, affinities, tolerations, and topology spread constraints) and provisions the optimal instance compute capacity as defined in the Provisioner Custom Resource Definition (CRD) . With this added flexibility, different teams within CoStar are able to make use of their own Provisioner configurations based on their application and scaling needs. Additionally, Karpenter provisions nodes directly using the Amazon EC2 fleet application programming interface (API) without the need for nodes and Amazon EC2 auto scaling groups, which enables quicker provisioning and retrial times (i.e., milliseconds versus minutes) that enhance CoStar’s performance service level agreements (SLAs). Furthermore, the CoStar team elected to run the Karpenter controller on AWS Fargate, which eliminates the need for managed node groups altogether.

The following diagram illustrates how Karpenter observes the aggregate resource requests of unscheduled pods, which makes decisions to launch new nodes, and terminates them to reduce infrastructure costs:

An architectural diagram that illustrates a typical Karpenter architecture.

To achieve cost-effectiveness for CoStar’s stateless workloads and lower environments, the CoStar team configured the Karpenter Provisioner to prefer Spot capacity and only provision On-Demand capacity if no Spot capacity is available. Karpenter uses the price-capacity-optimized allocation strategy for Spot capacity, which balances cost and lowers the probability of interruptions in the near term. For stateful workloads in production clusters, the Karpenter Provisioner defines a selection of compute and storage optimized instance families running On-Demand, most of which is covered by Compute Savings Plans and Reserved Instances to obtain discounts. For further optimization, CoStar enabled the consolidation capability, which allows Karpenter to actively reduce cluster costs by monitoring the utilization of nodes and checking whether existing workloads can run on other nodes or be replaced with cheaper variants. By evaluating multiple factors, such as the number of running pods, configured node expiry times, use of lower priority pods, and existing Pod Disruption Budgets (PDBs), the consolidation actions are performed in a manner to minimize workload disruptions.

Prerequisites

To carry out the example in this post, you’ll need to setup the following:

Walkthrough

In this section, we provide a simple demonstration of the replace mechanism, which is part of the consolidation capability of Karpenter. The following Karpenter Provisioner and node template configuration code constrains to a number of compute-optimized instance types with the capacity type of on-demand:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: consolidation-replace
spec:
  providerRef:
    name: consolidation-enabled
  requirements:
    - key: "karpenter.sh/capacity-type" 
      operator: In
      values: ["on-demand"]
    - key: "karpenter.k8s.aws/instance-family"
      operator: In
      values: ["c4", "c5", "c5a", "c5n", "c6a", "c6i", "c6in"]
  consolidation:
    enabled: true
  labels:
    type: karpenter-node  
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: consolidation-enabled
spec:
  subnetSelector:
    karpenter.sh/discovery: consolidation-subnet-example
  securityGroupSelector:
    karpenter.sh/discovery: consolidation-sg-example
  tags:
    app.kubernetes.io/created-by: consolidation-replace-example

Here is the application deployment manifest we use to demonstrate the consolidation behavior:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: consolidation-replace-deployment
  namespace: default
spec:
  replicas: 60
  selector:
    matchLabels:
      app: consolidation-replace-deployment
  template:
    metadata:
      labels:
        app: consolidation-replace-deployment
    spec:
      nodeSelector:
        type: karpenter-node
      containers:
        - name: consolidation-replace-container
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.8
          resources:
            requests:
              memory: "128Mi"
              cpu: "500m"

Using the eks-node-viewer tool to visualize node usage, we observe that Karpenter provisioned a c6a.8xlarge instance type is the most optimal to run the deployment based on its resource request requirements (along with four daemon pods):

Image showing the eks-node-viewer output for the c6a.8xlarge instance type.

Let’s assume the traffic loads are lower during off business hours. You may use the Horizontal Pod Autoscaler to dynamically scale the number of pods based on resource utilization. In our example, we use kubectl to downscale the number of replicas to 30 to simulate a decrease in traffic loads:

kubectl scale deployment consolidation-test --replicas 30

With 30 replicas, the overall resource requirements are much lower. Karpenter provisions a replacement node with a cheaper variant (in this case with a c6a.4xlarge) and cordons the original node as shown in the following screenshot.

Image showing replacement c6a.4xlarge instance types being provisioned.

After the pods are rescheduled on the replacement node the previous node is terminated.

Image showing that only the c6a.4xlarge instances currently exist.

As you can see from our example, Karpenter gives CoStar the ability to scale efficiently and optimize for cost by provisioning the c6a.4xlarge, as resource requirements decreased off business hours.

Cleaning up

To avoid incurring additional operational costs, remember to destroy all the infrastructure you created for the examples in this post.

Conclusion

There are many use cases and options on AWS when evaluating our container solutions. For CoStar, Karpenter consolidated the Amazon EC2 Spot Capacity they were running in their Dev and Test environments and moved workloads to the lowest cost instance type while still effectively runs their workloads. Customers focused on maintaining parity during migrations, continuing to use deep and effective investments in Kubernetes and seek a proven solution to focus on cost optimization, should consider Amazon EKS with Karpenter. For those already using Amazon EKS, the EKS Node Viewer tool is available to assess node efficiency and if you would benefit from Karpenter’s capabilities. Customers looking to achieve consolidate pods onto fewer nodes and use of the most effective instance types, should begin exploring Karpenter and using the documentation to get started.

Muhammed Karakas

Muhammed Karakas

Muhammed Karakas is a Senior Technical Account Manager at Amazon Web Services with a focus on Container services. He is passionate about problem solving and helping customers with their cloud journeys.

Josh Manner

Josh Manner

Josh Manner is a Senior Technical Account Manager at Amazon Web Services with a focus on Machine Learning and Artificial Intelligence. He works with a variety of customers, helping them with cloud adoption, cost optimization, and emerging technologies.

Peter Ildefonso

Peter Ildefonso

Peter Ildefonso is an Enterprise Solutions Architect at Amazon Web Services. He is responsible for working with Financial Services Customers to identify business problems and working backwards to identify viable and scalable technical solutions. Peter has been helping customers plan and migrate critical workloads for more than 10 years with a recent focus on securely operationalizing data to provide Capital Markets buy-side customers with more effective and efficient decision making.