Containers

Accelerate the testing and verification of Amazon EKS upgrades with upgrade insights

Introduction

Amazon’s Elastic Kubernetes Service (Amazon EKS) removes a lot of the heavy lifting that goes into managing Kubernetes. For example, AWS manages the Kubernetes control plane on your behalf, including patching, tuning, and updating it as necessary. Then there are features such as managed node groups that give you a mechanism for managing the lifecycle of worker nodes, or you can elect to completely offload the management worker nodes by using AWS Fargate. You can also make use of Amazon EKS add-ons which offer a mechanism for installing and managing the lifecycle of addons like the VPC CNI, kube-proxy, and so on.

Since its inception, Amazon EKS has orchestrated the upgrades of the Kubernetes control plane on your behalf. While this reduced the amount of time spent upgrading clusters, you still had to do a fair amount of research to identify which resources and applications might be impacted by the upgrade. Ascertaining which Kubernetes APIs were deprecated or removed often required careful analysis of the Amazon EKS and Kubernetes release notes. After gathering this information, you frequently had to search for and remediate applications that referenced those APIs before upgrading your cluster. Given upstream’s commitment to releasing three versions of Kubernetes a year and mounting pressure to keep those clusters up-to-date, this could impede your ability to adopt new versions of Kubernetes.

A small cottage industry of open-source and commercial tools has emerged because of the complexity surrounding cluster upgrades. These tools are largely intended to help you identify and remediate compatibility issues before upgrading your cluster. While there’s no denying the value of these tools, they typically don’t provide Amazon EKS-specific guidance.

Introducing upgrade insights

The aforementioned issues are being addressed by upgrade insights. Once a day, Amazon EKS scans the cluster’s audit logs for resources that have been deprecated and surfaces that information in the Amazon EKS Console. The information can also be retrieved programmatically using the Amazon EKS API or the AWS Command Line Interface (AWS CLI). Each insight includes a brief recommendation describing the actions you should take to remediate the issues it identified along with links to supplemental information, such as release notes and blog posts. Included with each insight is a list of Kubernetes resource types (e.g., CronJobs, PodDisruptionBudgets, etc.) and their respective status (PASSING, WARNING, ERROR, and UNKNOWN) that reflect the severity of each finding. For example, if upgrade insights discovers a call to an API that is removed in the next minor version of Kubernetes, then the status of that resources is set to ERROR. An error indicates that the cluster will reject calls that reference that specific API version after the cluster is upgraded. A WARNING, on the other hand, indicates an impending issue but no immediate action is necessary. This can happen when the Kubernetes resource is scheduled to be removed in a version that is at least two versions away from the current cluster version. Although rare, the status for a finding could be set to UNKNOWN if there is a backend processing error. The highest severity status of all the resources in an insight is always reflected in the insight’s overall status. In this way, you can see at a glance whether your cluster is requires remediation before it’s upgraded. These details become apparent later when we walk through the feature.

Reducing time-to-remediation

Perhaps the hardest part of an upgrade is finding the resources that need to be remediated before the upgrade. Upgrade insights shortens the time spent remediating issues by surfacing actionable information in the report, like the userAgents that have made calls to deprecated APIs within the last 30 days and the specific Kubernetes resources those APIs calls were performed against. For example, say you are running a controller that’s using a deprecated version of the cronjob’s batch API. The userAgent of the controller and the CronJobs that it acts upon would both appear in the report. Armed with this information, you can quickly narrow your search to applications that make use of that specific userAgent. If you’ve adopted GitOps and are storing your application configuration in a Git repository, then you can use the information from upgrade insights to update the appropriate resources in Git prior to upgrading the cluster.

How it works

Upgrade insights scans cluster’s audit logs for events related to APIs that have been deprecated. These events include information about who initiated it (i.e., the caller) and the Kubernetes resource(s) that it was initiated against. Upgrade Insights presents this information to you in concise and easily consumable way so you can identify and remediate the appropriate resources before executing the upgrade.

Walkthrough

In this walkthrough, we show you how you can use upgrade insights when upgrading your cluster from Amazon EKS v1.24 to v1.25. The first thing we’ll do is get a list of insights. Since we are upgrading to 1.25, we filter the results to only show information about Amazon EKS v1.25.

$ aws eks list-insights --filter kubernetesVersions=1.25 --cluster-name preflight | jq .
{
  "insights": [
    {
      "id": "149e0168-c889-4038-9d7f-f8cb4582560f",
      "name": "Deprecated APIs removed in Kubernetes v1.25",
      "category": "UPGRADE_READINESS",
      "kubernetesVersion": "1.25",
      "lastRefreshTime": "2023-11-09T19:16:56-06:00",
      "lastTransitionTime": "2023-11-06T19:16:46-06:00",
      "description": "Checks for usage of deprecated APIs that are scheduled for removal in Kubernetes v1.25. Upgrading your cluster before migrating to the updated APIs supported by v1.25 could cause application impact.",
      "insightStatus": {
        "status": "ERROR"
      }
    }
  ]
}

The last refresh time is the last time the report was refreshed. By default, the report is refreshed once per day. The last transition time is the last time the status of the insight changed. In this particular instance, the status transitioned from WARNING to ERROR on 11/06/23. We can also see the insight’s overall status (ERROR), which always reflects the most severe finding of all the findings in the report.

We can see a similar view of this information in the Amazon EKS Console on a new tab called Upgrade insights:

Next, we use the insight’s ID to get additional details about the insight. For simplicity sake, we split the output of this command below into multiple sections.

$ aws eks describe-insight --id 149e0168-c889-4038-9d7f-f8cb4582560f --cluster-name preflight | jq .
{
  "insight": {
    "id": "149e0168-c889-4038-9d7f-f8cb4582560f",
    "name": "Deprecated APIs removed in Kubernetes v1.25",
    "category": "UPGRADE_READINESS",
    "kubernetesVersion": "1.25",
    "lastRefreshTime": "2023-11-09T19:16:56-06:00",
    "lastTransitionTime": "2023-11-06T19:16:46-06:00",
    "description": "Checks for usage of deprecated APIs that are scheduled for removal in Kubernetes v1.25. Upgrading your cluster before migrating to the updated APIs supported by v1.25 could cause application impact.",
    "insightStatus": {
      "status": "ERROR"
    },
    "recommendation": "Update manifests and API clients to use newer Kubernetes APIs if applicable before upgrading to Kubernetes v1.25. Migrate from Pod Security Policy (PSP) to an alternative approach such as Pod Security Standards or an admission controller.",
    "additionalInfo": {
      "EKS Pod Security Policy": "https://docs.aws.amazon.com/eks/latest/userguide/pod-security-policy.html",
      "EKS update cluster documentation": "https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html",
      "Implementing Pod Security Standards in Amazon EKS": "https://aws.amazon.com/blogs/containers/implementing-pod-security-standards-in-amazon-eks/",
      "Kubernetes v1.25 deprecation guide": "https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25",
      "Migrate from PSP": "https://kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/"
    },

In the first section, we’re immediately presented with guidance about migrating from Pod Security Policies to Pod Security Standards (PSPs), or an alternate solution, before upgrading to Amazon EKS v1.25 because PSPs are removed from Amazon EKS v1.25 and above. Next to the recommendation are links to additional resources that we can use to help us with that task. The following screenshot illustrates how this information is displayed in the Amazon EKS Console:

Resources

The resources section displays a list of Kubernetes resources that may need remediation before we upgrade to Amazon EKS v1.25. The output shows that the cronjob print-date (line 12) and the pod disruption budget inflate-pdb (line 18) are both referencing the v1beta1 version of their respective APIs. These APIs are removed in Amazon EKS v1.25 and above. In-cluster resources like these, however, have already been upgraded to v1. For example, since Amazon EKS v1.21, CronJobs that reference v1beta1 are automatically converted to v1. Nevertheless, we still need to update the Kubernetes manifests, charts, and kustomizations that reference these resources and/or the controllers that created them.

    "resources": [
      {
        "insightStatus": {
          "status": "ERROR"
        },
        "kubernetesResourceUri": "/apis/policy/v1beta1/podsecuritypolicies/null"
      },
      {
        "insightStatus": {
          "status": "ERROR"
        },
        "kubernetesResourceUri": "/apis/batch/v1beta1/namespaces/default/cronjobs/print-date"
      },
      {
        "insightStatus": {
          "status": "ERROR"
        },
        "kubernetesResourceUri": "/apis/policy/v1beta1/namespaces/default/poddisruptionbudgets/inflate-pdb"
      },
      {
        "insightStatus": {
          "status": "ERROR"
        },
        "kubernetesResourceUri": "/apis/policy/v1beta1/podsecuritypolicies/eks.privileged"
      },
      {
        "insightStatus": {
          "status": "ERROR"
        },
        "kubernetesResourceUri": "/apis/policy/v1beta1/podsecuritypolicies/jicowan.test"
      }
    ],

You can see a similar view of this information from the Amazon EKS Console.

    "categorySpecificSummary": {
      "deprecationDetails": [
        {
          "usage": "/apis/batch/v1beta1/cronjobs",
          "replacedWith": "/apis/batch/v1/cronjobs",
          "stopServingVersion": "1.25",
          "startServingReplacementVersion": "1.21",
          "clientStats": [
            {
              "userAgent": "kubectl",
              "numberOfRequestsLast30Days": 1,
              "lastRequestTime": "2023-11-06T16:25:07-06:00"
            }
          ]
        },
        {
          "usage": "/apis/discovery.k8s.io/v1beta1/endpointslices",
          "replacedWith": "/apis/discovery.k8s.io/v1/endpointslices",
          "stopServingVersion": "1.25",
          "startServingReplacementVersion": "1.21",
          "clientStats": []
        },
        {
          "usage": "/apis/events.k8s.io/v1beta1/events",
          "replacedWith": "/apis/events.k8s.io/v1/events",
          "stopServingVersion": "1.25",
          "startServingReplacementVersion": "1.19",
          "clientStats": []
        },
        {
          "usage": "/apis/autoscaling/v2beta1/horizontalpodautoscalers",
          "replacedWith": "/apis/autoscaling/v2/horizontalpodautoscalers",
          "stopServingVersion": "1.25",
          "startServingReplacementVersion": "1.23",
          "clientStats": []
        },
        {
          "usage": "/apis/policy/v1beta1/poddisruptionbudgets",
          "replacedWith": "/apis/policy/v1/poddisruptionbudgets",
          "stopServingVersion": "1.25",
          "startServingReplacementVersion": "1.21",
          "clientStats": [
            {
              "userAgent": "kubectl",
              "numberOfRequestsLast30Days": 1,
              "lastRequestTime": "2023-11-06T17:14:37-06:00"
            }
          ]
        },
        {
          "usage": "/apis/policy/v1beta1/podsecuritypolicies",
          "stopServingVersion": "1.25",
          "clientStats": [
            {
              "userAgent": "pluto",
              "numberOfRequestsLast30Days": 1,
              "lastRequestTime": "2023-11-02T16:47:30-05:00"
            },
            {
              "userAgent": "kubectl",
              "numberOfRequestsLast30Days": 7,
              "lastRequestTime": "2023-11-07T19:20:08-06:00"
            },
            {
              "userAgent": "node-fetch",
              "numberOfRequestsLast30Days": 9,
              "lastRequestTime": "2023-11-02T16:55:31-05:00"
            },
            {
              "userAgent": "kube-controller-manager",
              "numberOfRequestsLast30Days": 1732,
              "lastRequestTime": "2023-11-09T18:46:51-06:00"
            },
            {
              "userAgent": "kube-apiserver",
              "numberOfRequestsLast30Days": 3483,
              "lastRequestTime": "2023-11-09T18:51:39-06:00"
            }
          ]
        },
        {
          "usage": "/apis/node.k8s.io/v1beta1/runtimeclasses",
          "replacedWith": "/apis/node.k8s.io/v1/runtimeclasses",
          "stopServingVersion": "1.25",
          "startServingReplacementVersion": "1.20",
          "clientStats": []
        }
      ]
    }
  }

The categorySpecificSummary section is where we can see the top five userAgents that called a deprecated API, the number of times that userAgent called the API in the last 30 days, and the last time it issued a request to that API. With this information we can narrow our search to the userAgents in the report.

We can create a condensed version of the detailed report by running the following command:

$ aws eks describe-insight --id 149e0168-c889-4038-9d7f-f8cb4582560f --cluster-name preflight | jq '.insight.categorySpecificSummary.deprecationDetails[] | select(.stopServingVersion == "1.25" and .clientStats != [])'
{
  "usage": "/apis/batch/v1beta1/cronjobs",
  "replacedWith": "/apis/batch/v1/cronjobs",
  "stopServingVersion": "1.25",
  "startServingReplacementVersion": "1.21",
  "clientStats": [
    {
      "userAgent": "kubectl",
      "numberOfRequestsLast30Days": 1,
      "lastRequestTime": "2023-11-06T16:25:07-06:00"
    }
  ]
}
{
  "usage": "/apis/policy/v1beta1/poddisruptionbudgets",
  "replacedWith": "/apis/policy/v1/poddisruptionbudgets",
  "stopServingVersion": "1.25",
  "startServingReplacementVersion": "1.21",
  "clientStats": [
    {
      "userAgent": "kubectl",
      "numberOfRequestsLast30Days": 1,
      "lastRequestTime": "2023-11-06T17:14:37-06:00"
    }
  ]
}
{
  "usage": "/apis/policy/v1beta1/podsecuritypolicies",
  "stopServingVersion": "1.25",
  "clientStats": [
    {
      "userAgent": "pluto",
      "numberOfRequestsLast30Days": 1,
      "lastRequestTime": "2023-11-02T16:47:30-05:00"
    },
    {
      "userAgent": "kubectl",
      "numberOfRequestsLast30Days": 7,
      "lastRequestTime": "2023-11-07T19:20:08-06:00"
    },
    {
      "userAgent": "node-fetch",
      "numberOfRequestsLast30Days": 9,
      "lastRequestTime": "2023-11-02T16:55:31-05:00"
    },
    {
      "userAgent": "kube-controller-manager",
      "numberOfRequestsLast30Days": 1732,
      "lastRequestTime": "2023-11-09T18:46:51-06:00"
    },
    {
      "userAgent": "kube-apiserver",
      "numberOfRequestsLast30Days": 3483,
      "lastRequestTime": "2023-11-09T18:51:39-06:00"
    }
  ]
}

Whereas before we saw a list of all the deprecated APIs, this filtered view shows only those deprecated APIs that have been called by a userAgent (in the last 30 days).

Note: Upgrade insights reads the Kubernetes audit log using a 30 day rolling window. resources using deprecated APIs that haven’t been applied to the cluster in the last 30 days may be absent from the report. Similarly, if the status of a finding is marked ERROR, its status won’t change until the last audit log entry for that finding falls outside the 30 day rolling window.

We can also view this information from the console. For example, when we click on a deprecated API in Deprecation details section (e.g., PodSecurityPolicies) we see the list of userAgents, the number of times they placed a call to that API, and the last the API was called.

Once you’ve upgraded to a newer version of Amazon EKS, e.g. 1.25, the insights for that version (1.25) are no longer available.

Future enhancements

Checking for APIs that have been deprecated is only the beginning. Over the next year, upgrade insights will add checks for additional upgrade impacting issues including kubelet version skew and addon version compatibility. As Amazon EKS releases support for new versions of Kubernetes, the list of checks will continue to grow. If you feel a check is missing, please add your suggestions to our containers roadmap on GitHub.

Conclusion

Upgrade insights are built upon best practices learned by Amazon EKS over the course of managing hundreds of thousands of Kubernetes clusters. This initial version of upgrade insights intends to make upgrades easier by surfacing the userAgents that are referencing deprecated APIs. With this information, you can quickly find and remediate controllers and configuration that need to be remediated before upgrading your cluster. The status of each finding in the report allows you to focus on the resources and applications impacted by the upgrade while the version specific guidance provides you with important information about transitioning to the next release.

Upgrade insights can help you avoid issues that arise when APIs are removed from Kubernetes. However, this doesn’t mean you can avoid testing your applications against the next minor release. When possible, you should perform the upgrade in a test environment prior to upgrading your production environment. Be methodical about your approach to upgrades, too. Create an upgrade plan or runbook that addresses each of the findings surfaced by upgrade insights and be sure to update your Kubernetes manifests, Helm charts, and kustomize templates with kubectl convert or another tool.

Upgrade insights is a complimentary service and is on by default for all Amazon EKS versions. We’re looking forward to hearing your feedback about this feature and are excited about the impact it will have on your upgrade experience.

Jeremy Cowan

Jeremy Cowan

Jeremy Cowan is a Specialist Solutions Architect for containers at AWS, although his family thinks he sells "cloud space". Prior to joining AWS, Jeremy worked for several large software vendors, including VMware, Microsoft, and IBM. When he's not working, you can usually find on a trail in the wilderness, far away from technology.