亚马逊AWS官方博客

EKS Cluster Autoscaler 迁移 Karpenter 实践

简介

Karpenter 是亚马逊云科技为 Kubernetes 构建的能用于生产环境的开源的工作节点动态调度控制器。相较于传统的 Cluster Autoscaler 工具,Karpenter 具有调度速度快、更灵活、资源利用率高等众多优势,是 Amazon Elastic Kubernetes Service(EKS)自动扩缩容的首选方案,两者的比较可以参考下图。本文将介绍如何在 EKS 上把 cluster autoscaler 迁移到 Karpenter。

CA vs karpenter

Feature Cluster Autoscaler Karpenter
Resource Management Based on the resource utilization of existing nodes, Cluster Autoscaler takes a reactive approach to scale nodes. Based on the current resource requirements of unscheduled pods, Karpenter takes a proactive approach to   provisioning nodes.
Node management Cluster Autoscaler manages nodes based on the resource demands of the present workload, using predefined autoscaling groups. Karpenter scales, provisions, and   manages nodes based on the configuration of custom Provisioners.
Scaling Cluster Autoscaler is more focused on node-level scaling, which means it can effectively add more nodes to meet any increase in demand. But this also means it may be less effective in   downscaling resources. Karpenter offers more effective and   granular scaling functionalities based on specific workload requirements. In   other words, it scales according to the actual usage. It also allows users to   specify particular scaling policies or rules to match their requirements.
Scheduling With Cluster Autoscaler, scheduling   is more simple as it is designed to scale up or down based on the present requirements of the workload. Karpenter can effectively schedule workloads based on different factors like availability zones and resource requirements.   It can try to optimize for the cheapest pricing via spot but is unaware of   any commitments like RI’s or Savings Plans.

迁移步骤说明

为了演示迁移的过程,本次我们使用了 EKS 1.26 版本来模拟生产环境,其中集群使用了 3 个公有子网和 3 个私有子网,使用 1 个托管节点组运行当前负载并安装好 OIDC 插件做 IAM 和 service account 的集成,并预先安装好 eksctlaws CLI 配置工具。过程忽略,可以参考官方 quick start guide

为了迁移过程对应用的影响降到最低,强烈建议用户把 Pod Disruption budgets(PDB)配置上。因为在迁移的过程中我们需要驱逐应用的 Pod 和收缩托管节点组的数量,配置了 PDB 可以保证 pod 迁徙过程中确保可以运行的副本数量永远不会低于你所配置的比例。比如当前 deployment 有 10 个 pod 在运行,你配置 minAvailable 为“50%”,那么就能确保干扰期间至少有 5 个 POD 是能持续工作的。

所以我们迁移步骤大致如下:

  1.  给应用配置 PDB
  1.  准备 Karpenter 运行环境
  1.  安装 Karpenter
  1.  迁移 Pod 并禁用 Cluster AutoScaler

一、配置 PDB

1. 查看 EKS 集群信息

$ kubectl get node
NAME                                                STATUS   ROLES    AGE   VERSION
ip-192-168-31-163.ap-southeast-1.compute.internal   Ready    <none>   40h   v1.26.4-eks-0a21954
ip-192-168-44-153.ap-southeast-1.compute.internal   Ready    <none>   40h   v1.26.4-eks-0a21954
ip-192-168-7-103.ap-southeast-1.compute.internal    Ready    <none>   40h   v1.26.4-eks-0a21954

zhenwei:~/environment $ eksctl get nodegroup --region ap-southeast-1 --cluster prd-eks                                                                        
CLUSTER NODEGROUP       STATUS   CREATED                 MIN SIZE     MAX SIZE   DESIRED CAPACITY    INSTANCE TYPE   IMAGE ID        ASG NAME                                               TYPE
prd-eks worknode        ACTIVE   2023-05-31T02:26:25Z       2            4             3               t3.medium     AL2_x86_64      eks-worknode-c2c437d0-581a-8182-d61f-d7888271bfbb      managed

2. 部署测试 Nginx 应用

创建 nginx.yaml 文件

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 4
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

kubectl apply -f nginx.yaml

3. 部署 PDB

创建 nginx-pdb.yaml 文件

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  maxUnavailable: 50%
  selector:
    matchLabels:
      app: nginx

kubectl apply -f nginx-pdb.yaml

检查 PDB 状态,这样我们就能保证顶多只有 50% 的 POD 会在迁移过程受影响。

kubectl get poddisruptionbudgets 

NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nginx-pdb   N/A             50%               2                     3m9s

二、准备 Karpenter 运行环境

1. 准备好环境变量

CLUSTER_NAME=<your cluster name>
AWS_PARTITION="aws" # if you are not using standard partitions, you may need to configure to aws-cn / aws-us-gov
AWS_REGION="$(aws configure list | grep region | tr -s " " | cut -d" " -f3)"
OIDC_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} \
    --query "cluster.identity.oidc.issuer" --output text)"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' \
    --output text)

2. 创建 Karpenter 需要的 2 个 Role

1)创建 Karpenter 的 node 需要的 role

echo '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}' > node-trust-policy.json
aws iam create-role --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --assume-role-policy-document file://node-trust-policy.json

2)给这个 role 添加 policy

aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \
    --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonSSMManagedInstanceCore

3)把 role 授予 EC2 的 instance profile

aws iam create-instance-profile \
    --instance-profile-name "KarpenterNodeInstanceProfile-${CLUSTER_NAME}"

aws iam add-role-to-instance-profile \
    --instance-profile-name "KarpenterNodeInstanceProfile-${CLUSTER_NAME}" \
    --role-name "KarpenterNodeRole-${CLUSTER_NAME}"

// 输出如下:

{
    "InstanceProfile": {
        "Path": "/",
        "InstanceProfileName": "KarpenterNodeInstanceProfile-prd-eks",
        "InstanceProfileId": "AIPAXGRAMRX5HISZDSH2O",
        "Arn": "arn:aws:iam::495062xxxxx:instance-profile/KarpenterNodeInstanceProfile-prd-eks",
        "CreateDate": "2023-05-31T03:31:40Z",
        "Roles": []
    }
}

4)创建 Karpenter controller 本身需要的 role,它依赖 OIDC 来做 IRSA 绑定授权

$ cat << EOF > controller-trust-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "${OIDC_ENDPOINT#*//}:aud": "sts.amazonaws.com",
                    "${OIDC_ENDPOINT#*//}:sub": "system:serviceaccount:karpenter:karpenter"
                }
            }
        }
    ]
}
EOF


$ aws iam create-role --role-name KarpenterControllerRole-${CLUSTER_NAME} \
    --assume-role-policy-document file://controller-trust-policy.json
cat << EOF > controller-policy.json
{
    "Statement": [
        {
            "Action": [
                "ssm:GetParameter",
                "ec2:DescribeImages",
                "ec2:RunInstances",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeInstanceTypeOfferings",
                "ec2:DescribeAvailabilityZones",
                "ec2:DeleteLaunchTemplate",
                "ec2:CreateTags",
                "ec2:CreateLaunchTemplate",
                "ec2:CreateFleet",
                "ec2:DescribeSpotPriceHistory",
                "pricing:GetProducts"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "Karpenter"
        },
        {
            "Action": "ec2:TerminateInstances",
            "Condition": {
                "StringLike": {
                    "ec2:ResourceTag/karpenter.sh/provisioner-name": "*"
                }
            },
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "ConditionalEC2Termination"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}",
            "Sid": "PassNodeIAMRole"
        },
        {
            "Effect": "Allow",
            "Action": "eks:DescribeCluster",
            "Resource": "arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}",
            "Sid": "EKSClusterEndpointLookup"
        }
    ],
    "Version": "2012-10-17"
}
EOF


$ aws iam put-role-policy --role-name KarpenterControllerRole-${CLUSTER_NAME} \
    --policy-name KarpenterControllerPolicy-${CLUSTER_NAME} \
    --policy-document file://controller-policy.json

// 输出如下:

{
    "Role": {
        "Path": "/",
        "RoleName": "KarpenterControllerRole-prd-eks",
        "RoleId": "AROAXGRAMRX5A5OS6FYJ3",
        "Arn": "arn:aws:iam::495062xxxxx:role/KarpenterControllerRole-prd-eks",
        "CreateDate": "2023-05-31T03:35:56Z",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Federated": "arn:aws:iam::495062xxxxxx:oidc-provider/oidc.eks.ap-southeast-1.amazonaws.com/id/BC13F53BC96566C2087EEF318D307864"
                    },
                    "Action": "sts:AssumeRoleWithWebIdentity",
                    "Condition": {
                        "StringEquals": {
                            "oidc.eks.ap-southeast-1.amazonaws.com/id/BC13F53BC96566C2087EEF318D307864:aud": "sts.amazonaws.com",
                            "oidc.eks.ap-southeast-1.amazonaws.com/id/BC13F53BC96566C2087EEF318D307864:sub": "system:serviceaccount:karpenter:karpenter"
                        }
                    }
                }
            ]
        }
    }
}

3. 为所有子网和安全组添加标签

1)子网打标签

for NODEGROUP in $(aws eks list-nodegroups --cluster-name ${CLUSTER_NAME} \
    --query 'nodegroups' --output text); do aws ec2 create-tags \
        --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}" \
        --resources $(aws eks describe-nodegroup --cluster-name ${CLUSTER_NAME} \
        --nodegroup-name $NODEGROUP --query 'nodegroup.subnets' --output text )
done

2)给托管节点组的运行模版的安全组打标签

NODEGROUP=$(aws eks list-nodegroups --cluster-name ${CLUSTER_NAME} \
    --query 'nodegroups[0]' --output text)
LAUNCH_TEMPLATE=$(aws eks describe-nodegroup --cluster-name ${CLUSTER_NAME} \
    --nodegroup-name ${NODEGROUP} --query 'nodegroup.launchTemplate.{id:id,version:version}' \
    --output text | tr -s "\t" ",")
SECURITY_GROUPS=$(aws ec2 describe-launch-template-versions \
    --launch-template-id ${LAUNCH_TEMPLATE%,*} --versions ${LAUNCH_TEMPLATE#*,} \
    --query 'LaunchTemplateVersions[0].LaunchTemplateData.[NetworkInterfaces[0].Groups||SecurityGroupIds]' \
    --output text)
aws ec2 create-tags \
    --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}" \
    --resources ${SECURITY_GROUPS}

4. 更新 aws-auth ConfigMap

我们需要授权刚刚创建给 Karpenter node 的 role 加入集群的权限。

kubectl edit configmap aws-auth -n kube-system

将变量{AWS_PARTITION}替换为帐户分区,{AWS_ACCOUNT_ID}变量替换为您的帐户 ID,{CLUSTER_NAME}变量替换为集群名称,但不要替换{{EC2PrivateDNSName}}。

检查如下信息,新增第二个 group 的信息,从 13 行到 17 行

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::495062xxxxx:role/eksctl-prd-eks-nodegroup-worknode-NodeInstanceRole-1WM8CGG0R5KS3
      username: system:node:{{EC2PrivateDNSName}}
    - groups:
     - system:bootstrappers
     - system:nodes
      rolearn: arn:aws:iam::495062xxxxx:instance-profile/KarpenterNodeInstanceProfile-prd-eks
      username: system:node:{{EC2PrivateDNSName}}  
kind: ConfigMap
metadata:
  creationTimestamp: "2023-05-31T02:27:10Z"
  name: aws-auth
  namespace: kube-system
  resourceVersion: "1143"
  uid: 1c952f89-d8a3-4c62-8b44-d0d070f6460a

三、部署 Karpenter

1. 设置环境变量

//设置环境变量,0.27.5 是当前最新版

export KARPENTER_VERSION=v0.27.5

2. 创建 karpenter yaml 模版

helm template karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} --namespace karpenter \
    --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME} \
    --set settings.aws.clusterName=${CLUSTER_NAME} \
    --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}" \
    --set controller.resources.requests.cpu=1 \
    --set controller.resources.requests.memory=1Gi \
    --set controller.resources.limits.cpu=1 \
    --set controller.resources.limits.memory=1Gi > karpenter.yaml

3. 设置节点亲和性

对于系统性的重要 workload,例如 CoreDNS,Controller,CNI,CSI 和 Operator 等,这些 workload 对弹性要求不高但是稳定性要求比较高,建议部署在节点组运行。以下以 karpenter 为例进行节点亲和性配置。

编辑 karpenter.yaml 文件并找到 karpenter 部署亲和性规则。修改亲和性关系,以便 karpenter 在现有节点组节点之一上运行。修改值以匹配您的$NODEGROUP,每行一个节点组。

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: karpenter.sh/provisioner-name
          operator: DoesNotExist
      - matchExpressions:
        - key: eks.amazonaws.com/nodegroup
          operator: In
          values:
          - ${NODEGROUP}
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: "kubernetes.io/hostname"

4. 部署 karpenter 相关资源

5. 创建默认的 provisioner

cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values: [c, m, r]
    - key: karpenter.k8s.aws/instance-generation
      operator: Gt
      values: ["2"]
  providerRef:
    name: default
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
EOF

检查 karpenter 状态

$ kubectl get po -n karpenter
NAME                         READY   STATUS    RESTARTS   AGE
karpenter-5d7f8596f6-2ml6s   1/1     Running   0          10s
karpenter-5d7f8596f6-t9lwk   1/1     Running   0          10s

四、删除 Cluster Autoscaler

1. 从上面可以看到 Karpener 已经正常运行,我们可以禁用 CAS 了

kubectl scale deploy/cluster-autoscaler -n kube-system --replicas=0

然后我们可以驱逐 POD 和回收托管节点组资源。

2. 回收 nodegroup 多余节点资源

由于我们配置了 PDB,所以收缩的过程不会对服务造成影响,我们可以这样来做观察——

方式 1:把 nginx 通过 Cluster IP 的方式暴露出来,并找一个别的 POD 检查服务的可用性

$ kubectl expose deployment nginx-to-scaleout --port=80 --target-port=80

$ kubectl get svc
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
kubernetes          ClusterIP   10.100.0.1      <none>        443/TCP   7d8h
nginx-to-scaleout   ClusterIP   10.100.101.74   <none>        80/TCP    20h

//进去其中一个安装了 curl 的 POD

kubectl exec -it curl-777d588d65-lk6xm /bin/bash

// 检查 svc 可用性

root@curl-777d588d65-lk6xm:/# while true;do
> curl -Is http://10.100.101.74 | head -1
> sleep 1
> done
HTTP/1.1 200 OK
HTTP/1.1 200 OK
HTTP/1.1 200 OK
HTTP/1.1 200 OK

方式 2:通过 watch 检查过程中 pod 的变化

$ kubectl get deploy nginx-to-scaleout -w
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
nginx-to-scaleout   4/4     4            4           22h

收缩命令如下:

aws eks update-nodegroup-config --cluster-name ${CLUSTER_NAME} \
    --nodegroup-name ${NODEGROUP} \
    --scaling-config "minSize=2,maxSize=2,desiredSize=2"

由于我们这个是测试环境,所以只保留 2 台。对于生产环境,还是建议保留 3 个 AZ 共 3 节点来运行托管节点组。

总结

Karpenter 作为一款新的 Kubernetes auto scaling 工具,由于其先进的特性我们把它作为节点扩容的首选方案。本文介绍了从 CAS 迁移 Karpenter 的整个过程,并通过配置 PDB 保证过程中不影响服务的正常访问。

参考文档

https://aws-quickstart.github.io/quickstart-amazon-eks/#_overview

https://karpenter.sh/preview/getting-started/migrating-from-cas/

https://kubernetes.io/zh-cn/docs/tasks/run-application/configure-pdb/

本篇作者

林进风

厦门汉为软件技术有限公司 CTO,从事 IT 行业 10 余年,在进入云计算领域之前主要从事系统集成方面的工作,曾经负责制造业 SAP ECC 和 hana 项目、医疗和金融行业 oracle RAC 数据库升级改造项目;在硬件和应用容灾以及高可用方面有相关经验。2017 年进入云计算领域投身于亚马逊云并从事 SA 工作,在计算、联网、数据库方面以及 serverless 领域学习和研究。

张振威

AWS APN 解决方案架构师,主要负责合作伙伴架构咨询和方案设计,同时致力于 AWS 云服务在国内的应用及推广。曾就职于互联网公司数据库团队,拥有丰富的平台化和自动化运维经验。