Containers
Autoscaling EKS on Fargate with custom metrics
NOTICE: October 04, 2024 – This post no longer reflects the best guidance for configuring a service mesh with Amazon EKS and its examples no longer work as shown. Please refer to newer content on Amazon VPC Lattice.
——–
This is a guest post by Stefan Prodan of Weaveworks.
Autoscaling is an approach to automatically scale up or down workloads based on the resource usage. In Kubernetes, the Horizontal Pod Autoscaler (HPA) can scale pods based on observed CPU utilization and memory usage. Starting with Kubernetes 1.7, an aggregation layer was introduced that allows third-party applications to extend the Kubernetes API by registering themselves as API add-ons. Such an add-on can implement the Custom Metrics API and enable HPA access to arbitrary metrics.
What follows is a step-by-step guide on configuring HPA with metrics provided by Prometheus to automatically scale pods running on Amazon EKS on AWS Fargate.
Prerequisites
Install eksctl
and fluxctl
for macOS with Homebrew like so:
brew tap weaveworks/tap
brew install weaveworks/tap/eksctl
brew install fluxctl
And for Windows you can use Chocolatey:
choco install eksctl
choco install fluxctl
Last but not least, for Linux you can download the eksctl and fluxctl binaries from GitHub.
Create an EKS cluster
First, create an EKS cluster with two EC2 managed nodes and a Fargate profile using the following config file:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: eks-fargate-hpa
region: eu-west-1
managedNodeGroups:
- name: default
instanceType: m5.large
desiredCapacity: 2
volumeSize: 120
iam:
withAddonPolicies:
appMesh: true
albIngress: true
fargateProfiles:
- name: default
selectors:
- namespace: demo
labels:
scheduler: fargate
Save above cluster config in a file called cluster-config.yaml
and then do:
eksctl create cluster -f cluster-config.yaml
You’ll use the managed nodes for cluster add-ons (CoreDNS, kube-proxy) and for the HPA metrics add-ons with the following configurations:
- Prometheus: scrapes pods and stores metrics
- Prometheus metrics adapter: queries Prometheus and exposes metrics for the Kubernetes custom metrics API
- Metrics server: collects pods CPU and memory usage and exposes metrics for the Kubernetes resource metrics API
For the demo, you’ll be using the application podinfo running on Fargate. Note that only the pods deployed in the demo
namespace with the scheduler: fargate
label will be running on Fargate, as defined in the Fargate profile, above.
Create a GitHub repository
To configure HPA for Fargate you’ll be using an eksctl GitOps profile. This profile allows you to create an Kubernetes application platform tailored for a specific use case. So, let’s do this: create a GitHub repository and clone it locally. Replace GH_USER
and GH_REPO
with your own GitHub username and the name of the new repository you created in the previous step. Use these variables to clone your repo and setup GitOps for your cluster:
export GH_USER=YOUR_GITHUB_USERNAME
export GH_REPO=YOUR_GITHUB_REPOSITORY
git clone https://github.com/${GH_USER}/${GH_REPO}
cd ${GH_REPO}
Now, run the following eksctl
command to set up the pipeline:
export EKSCTL_EXPERIMENTAL=true
eksctl enable repo \
--cluster=eks-fargate-hpa \
--region=eu-west-1 \
--git-url=git@github.com:${GH_USER}/${GH_REPO} \
--git-user=fluxcd \
--git-email=${GH_USER}@users.noreply.github.com
The above command takes an existing EKS cluster and an empty repository and sets up a GitOps pipeline. After the command finishes installing FluxCD and the Helm Operator, you will be asked to add Flux’s deploy key to your GitHub repository so you need to copy the public key and create a deploy key with write access on your GitHub repository. For this:
- Go to “Settings > Deploy keys” and click on “Add deploy key”.
- Make sure to also check “Allow write access”.
- Then, paste the Flux public key and click “Add key”.
Once that is done, Flux will be able to pick up changes in the repository and deploy them to the cluster.
Install the metrics add-ons
Next, to install the metrics add-ons, run the following command:
eksctl enable profile \
--name=https://github.com/stefanprodan/eks-hpa-profile \
--cluster=eks-fargate-hpa \
--region=eu-west-1 \
--git-url=git@github.com:${GH_USER}/${GH_REPO} \
--git-user=fluxcd \
--git-email=${GH_USER}@users.noreply.github.com
The above command adds the HPA metrics add-ons and the demo app manifests to the configured repository.
Now sync your local repository with the GitHub repo, using:
git pull origin master
And now run the following command to apply the manifests to your EKS cluster:
fluxctl sync --k8s-fwd-ns flux
Flux reconciles your GitHub repo with the EKS cluster every five minutes; the above command can be used to speed up the synchronization.`
Now it’s time to list the installed components, compare your output with the following:
$ kubectl -n monitoring-system get helmreleases
NAME RELEASE STATUS
metrics-server metrics-server DEPLOYED
prometheus prometheus DEPLOYED
prometheus-adapter prometheus-adapter DEPLOYED
Install sample app
You’ll use a Go web sample app named podinfo to test HPA. The app is instrumented with Prometheus and exposes a counter called http_requests_total
. The HPA controller, part of the Kubernetes control plane, will scale the pods running on Fargate based on the number of HTTP requests per second as represented by said Prometheus counter.
So, install podinfo by setting fluxcd.io/ignore
to false
in demo/namespace.yaml:
cat << EOF | tee demo/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: demo
annotations:
fluxcd.io/ignore: "false"
EOF
And apply changes via git like so:
git add -A && \
git commit -m "init demo" && \
git push origin master && \
fluxctl sync --k8s-fwd-ns flux
Now, wait for EKS on Fargate to schedule and launch the podinfo
app using watch kubectl -n demo get po -l scheduler=fargate
. When podinfo
starts, Prometheus will scrape the metrics endpoint and the Prometheus adapter will export the HTTP requests per second metrics to the Kubernetes custom metrics API:
$ watch kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
{
"kind": "APIResourceList",
"apiVersion": "v1",
"groupVersion": "custom.metrics.k8s.io/v1beta1",
"resources": [
{
"name": "namespaces/http_requests_per_second",
"singularName": "",
"namespaced": false,
"kind": "MetricValueList",
"verbs": [
"get"
]
},
{
"name": "pods/http_requests_per_second",
"singularName": "",
"namespaced": true,
"kind": "MetricValueList",
"verbs": [
"get"
]
}
]
}
Configure autoscaling based on HTTP traffic
To configure auto-scaling you can set up a HPA definition that uses the http_requests_per_second metric. The HPA manifest is in demo/podinfo/hpa.yaml and it’s set to scale up podinfo when the average req/sec per pod exceeds 10:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: podinfo
namespace: demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metricName: http_requests_per_second
targetAverageValue: 10
Note that the podinfo deployment manifest has no replicas defined in deployment.spec.replicas
, since the HPA controller updates the number of replicas based on the metric average value.
Once the metric is available to the metrics API, the HPA controller will display the current value:
$ watch kubectl -n demo get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
podinfo Deployment/podinfo 200m/10 1 10 1 8m58s
The m
in 200m
above represents milli-units, that is, 200m
means 0.2 req/sec. The traffic is generated by Prometheus that scrapes the /metrics endpoint every five seconds.
You can exec into the loadtester pod with the following command:
kubectl -n demo exec -it loadtester-xxxx-xxxx
And generate additional traffic using hey as shown below:
hey -z 10m -c 5 -q 5 -disable-keepalive http://podinfo.demo
After a few minutes the HPA begins to scale up the deployment:
$ kubectl -n demo describe hpa podinfo
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 2m horizontal-pod-autoscaler New size: 3; reason: pods metric http_requests_per_second above target
When the load tests finishes, the HPA down scales the deployment to it’s initial replicas:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 21s horizontal-pod-autoscaler New size: 1; reason: All metrics below target
You may have noticed that the autoscaler doesn’t react immediately to usage spikes. The metrics sync happens every 30 seconds and scaling up or down can only happen if there was no rescaling within the last three to five minutes. In this way, the HPA prevents rapid execution of conflicting decisions.
Configure autoscaling based on CPU usage
For workloads that aren’t instrumented with Prometheus, you can use the Kubernetes metrics server and configure auto-scaling based on CPU and/or memory usage.
Update the HPA manifest by replacing the HTTP metric with CPU average utilization:
cat << EOF | tee demo/podinfo/hpa.yaml
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: podinfo
namespace: demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 90
EOF
And apply the changes, as usual, via git:
git add -A && \
git commit -m "cpu hpa" && \
git push origin master && \
fluxctl sync --k8s-fwd-ns flux
Now, run a load test to increase CPU usage above 90% to trigger the HPA (again, execing into the respective pod first):
hey -z 10m -c 5 -q 5 -m POST -d 'test' -disable-keepalive http://podinfo.demo/token
The Kubernetes Metrics Server is a cluster-wide aggregator of resource usage data, it collects CPU and memory usage for nodes and pods by pooling data from the kubernetes.summary_api
. The summary API is a memory-efficient API for passing data from Kubelet to the metrics server.
Configure autoscaling based on App Mesh traffic
One of the advantages of using a service mesh like AWS App Mesh is the built-in monitoring capability. You don’t have to instrument your web apps in order to monitor the L7 traffic.
The Envoy sidecar used by App Mesh exposes a counter envoy_cluster_upstream_rq
, you can configure the Prometheus adapter to transform this metric into req/sec rate with the following config for Helm:
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
name: prometheus-adapter
namespace: monitoring-system
spec:
releaseName: prometheus-adapter
chart:
repository: https://kubernetes-charts.storage.googleapis.com/
name: prometheus-adapter
version: 1.4.0
values:
prometheus:
url: http://prometheus.monitoring-system
port: 9090
rules:
default: false
custom:
- seriesQuery: 'envoy_cluster_upstream_rq{kubernetes_namespace!="",kubernetes_pod_name!=""}'
resources:
overrides:
kubernetes_namespace: {resource: "namespace"}
kubernetes_pod_name: {resource: "pod"}
name:
matches: "envoy_cluster_upstream_rq"
as: "appmesh_requests_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'
Now you can use the appmesh_requests_per_second
metric in the HPA definition with the following HPA resource:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: podinfo
namespace: demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metricName: appmesh_requests_per_second
targetAverageValue: 10
Wrapping up
Not all systems can meet their SLAs by relying on CPU/memory usage metrics alone, most web and mobile backends require autoscaling based on requests per second to handle any traffic bursts. For ETL apps, auto scaling could be triggered by the job queue length exceeding some threshold and so on. By instrumenting your applications with Prometheus and exposing the right metrics for autoscaling you can fine tune your apps to better handle bursts and ensure high availability.