Containers

How Slack adopted Karpenter to increase Operational and Cost Efficiency

Bedrock – Slack’s internal Kubernetes platform

Slack is the AI-powered platform for work that connects people, conversations, apps, and systems together in one place. Slack adopted Amazon Elastic Kubernetes Service (Amazon ) to build “Bedrock,” the codename for an internal compute orchestration platform that simplifies container deployment and management. Bedrock handles build, deploy, and runtime environments through a single YAML file, significantly reducing complexity for internal developers at Slack. It also leverages tools such as Jenkins, FQDN service discovery, and Nebula overlay network for efficient operations. With over 80% of Slack applications now operating on Bedrock, Slack has improved testing precision and refined infrastructure management, empowering developers to operate with greater efficiency.

In this post, we dive into Slack’s journey to modernize its container platform on Amazon EKS and how they increased the cost savings and improved operational efficiency by leveraging Karpenter.

Opportunity for improved operational efficiency

Before Karpenter, we leveraged another solution to autoscale Amazon EKS compute, but encountered limitations as our internal team needs grew. With more applications and an increased number of instance types and instance needs, we were managing multiple Autoscaling Groups (ASGs). This became a challenge, as Slack had strict compliance requirements and needed to make frequent updates to EKS clusters that had 1000s of worker nodes. These frequent updates, coupled with managing multiple ASGs, resulted in the additional slow-down of the overall upgrade cadence. We were also concerned with using the single-replica architecture in the current setup. Slack needed a resilient autoscaler that could also launch nodes faster, had high availability, and provided better cluster bin packing.

Solution overview

To overcome the operational challenges mentioned previously, Slack decided to use Karpenter, an open-source cluster autoscaler that automatically provisions new nodes in response to unschedulable pods. Karpenter evaluates the aggregate resource requirements of the pending pods and chooses the optimal instance type to run them. It automatically scales-in or terminates instances that don’t have the non-daemonset pods to reduce waste. It also supports a consolidation feature that actively moves pods around and either deletes or replaces nodes with cheaper versions to reduce the cluster cost.

All of these features addressed Slack’s challenges, and with the help of the AWS team, we successfully validated Karpenter in our Bedrock environment. In addition to the previously mentioned features, the fact that Karpenter makes direct Amazon Elastic Compute Cloud (Amazon EC2) fleet API calls makes sure that we get the right compute with no delays.

We started our journey with a careful two phase roll out.

In the first phase, we rolled out Karpenter in a managed node group alongside the core deployments and applications. Karpenter was validated for a subset of the applications, and consolidation was disabled in this phase.

In the second phase, we moved Karpenter Controller workloads to their own ASGs, since we didn’t want Karpenter to run on Karpenter nodes. After intense testing and careful considerations of all use-cases, we finally rolled out Karpenter across the 200+ fleet of EKS clusters running 1000s of worker nodes. Slack also enabled the consolidation feature of Karpenter to achieve significant cost savings.

Due to the phased rollout of Karpenter, we could control which clusters have Karpenter enabled. This allowed us to validate workload performance in Karpenter and quickly rollback when issues were reported. When a workload didn’t have proper request/limits, Karpenter would allocate smaller instances or only a small portion of a large instance resulting in high pod churn when the load increased. Karpenter helped Slack discover this, in turn helping Slack improve its platform by working with service owners to make sure that the pods were setup with requirements to get them allocated on the proper nodes. For workloads that needed specific instance types, Slack was able to tweak the NodePool custom resource and used well-known-labels with Karpenter to pin the pods on relevant instance types.

Architecture of Slack’s Bedrock EKS cluster

Slack’s EKS Architecture

Figure 1: Slack’s EKS Architecture

Achieved outcomes

Following the rollout of Karpenter across our fleet, we initiated the process of tainting the ASG nodes and transitioning applications to instances managed by Karpenter. The outcomes yielded from this initiative were both significant and quantifiable.

With the help of Karpenter, Slack was able to successfully bin pack applications and leverage a wide range of instance types from 8xlarge to 32xlarge based on the resource requested by pending pods. This resulted in increased cluster use and cost savings. Workloads lacking specific instance requirements began efficiently using available resources across the board. Moreover, Karpenter’s consolidation measures made sure of the elimination of idle instances, as opposed to retaining them as part of the minimum ASG size across various Availability Zones (AZs ), as was the case with our previous solution. Additionally, we observed accelerated node provisioning times owing to Karpenter’s prompt scaling decisions.

To summarize, the dynamic selection of instances based on pod resource demands, coupled with the elimination of hardcoded instance types in cluster Terraform files, facilitated quicker pod launches. Concerns regarding Slack’s system upgrades were also alleviated, as we could swiftly drain and rotate nodes during upgrade procedures. Karpenter’s ability to interact directly with Amazon EC2 API and improved retry mechanism made sure we recover faster in case of AZ failures. Applications can scale as much instance capacity as available with AWS, which means less overhead for us to manage ASGs and better experience for Slack’s users! Slack runs Karpenter today with custom overprovisioning to provide a buffer for their mission critical application during burst scaling activities.

By leveraging the templating feature of Helm, Slack has customized the Karpenter helm chart and is using a single NodePool and AWS EC2NodeClass across 200+ EKS clusters.

With a wide selection of instance types available in the instance family provided by Karpenter, engineering teams at Slack are finding it helpful to switch from one instance type to another when using dynamic scheduling constraints. This has reduced RTB burden for infrastructure teams and de-risked changes of instance types as we saw when maintaining ASG configurations. By leveraging Karpenter, Slack was able to achieve 12% savings on their EKS compute cost.

Figure 2 Bin packing efficiency

Figure 2: Bin packing efficiency

Future enhancements

Slack is currently working on streamlining current the Karpenter configuration to further improve operations and incur more cost savings. Some of the features in the roadmap include:

  • Managed Karpenter: This helps Slack focus on the constraints to run the pod while leaving the heavy lifting of managing the Karpenter controller to AWS.
  • Customizing Kubelets: Use the kubelet flags within Karpenter’s EC2NodeClass rather than passing it through the Infrastructure-as-a-Code (IaC) solution to improve the instance boot times.
  • Warmpool/minimum: As Karpenter is made open-source, Slack is exploring ways to contribute to reducing the bootstrapping time by making Karpenter pick the instances from a warm pool rather than making Amazon EC2 fleet API calls.
  • Disruption control: Slack leverages disruption control to control the disruptions occurring due to consolidation, and to limit the impact to the application availability.

Conclusion

In this post, we discussed how Slack’s Bedrock team improved Amazon EKS cluster operations and cost savings by implementing Karpenter. The collaboration between AWS and Slack was crucial in the successful rollout of Karpenter and the modernization of Slack’s Amazon EKS environment. We also talked about how Slack was able to improve the upgrade cadence and increase their cost savings using Karpenter as an autoscaler for the EKS cluster. Looking ahead, Slack is more focused on further optimizing the Karpenter environment by contributing and leveraging the new features to build a robust platform on top of Amazon EKS.

Vikram Venkataraman

Vikram Venkataraman

Vikram Venkataraman is a Principal Solution Architect at Amazon Web Services and also a container enthusiast. He helps organization with best practices for running workloads on AWS. In his spare time, he loves to play with his two kids and follows Cricket.

Chandra Vellaichamy

Chandra Vellaichamy

Chandra Vellaichamy is a Senior Customer Solutions Manager for Strategic Accounts at AWS. He has 25 years of experience in managing business applications across multiple domains and industries – high tech, supply chain and finance, etc. In his current role at AWS, he collaborates with his team and ensures that customers attain their targeted outcomes during their AWS journey

Ganesh Kumar Kattamuri

Ganesh Kumar Kattamuri

Ganesh Kumar Kattamuri is a Staff Infrastructure Engineer at Slack working as the technical lead within the Bedrock Platform Team. He dedicates his time on delivering the best, cost effective platform out to developers allowing them to run their services seamlessly in Cloud allowing them to focus more on the product development.

Gene Ting

Gene Ting

Gene Ting is a principal solutions architect at Amazon Web Services. He is focused on helping enterprise customers build and operate workloads securely on AWS. In his free time, Gene enjoys teaching kids technology and sports, as well as following the latest on cybersecurity.

Harpreet Singh

Harpreet Singh

Harpreet Singh is a Senior Infrastructure Engineer at Slack in the Bedrock Platform Team. He is passionate about improving the core Kubernetes experience to enable teams using the platform to scale efficiently and reliably while removing operational burdens on the service owners.