Announcing pull through cache for registry.k8s.io in Amazon Elastic Container Registry

Introduction

Container images are stored in registries and pulled into environments where they run. There are many different types of registries from private, self-run registries to public, unauthenticated registries. The registry you use is a direct dependency that can have an impact on how fast you can scale, the security of the software you run, and the availability of your systems. For production environments, it’s recommended that customers limit external dependencies that impact these areas and host container images in a private registry.

Amazon Elastic Container Registry (Amazon ECR) is a managed service to host your Open Container Initiative (OCI) images and artifacts. It can also be used as a pull through cache for container images you depend on from external registries. A pull through cache is a way to cache images you use from an upstream repository. Container images are copied and kept up-to-date without giving you a direct dependency on the external registry. If the upstream registry or container image becomes unavailable, then your cached copy can still be used.

At launch, Amazon ECR supported Amazon ECR Public and Quay Container registry as pull through cache sources. Starting today, you can also use Amazon ECR as a pull through cache for the official Kubernetes registry at registry.k8s.io. With a pull through cache, you won’t have external dependencies on the community run registry for commonly used images such as the Kubernetes metrics server or cluster autoscaler. This feature is generally available today and can be used in all regions that support Amazon ECR pull through cache.

The Kubernetes community image registry recently changed from k8s.gcr.io to registry.k8s.io in an effort to keep the registry sustainable and improve performance for AWS users. This change has been mostly transparent for users, but it requires updating manifests to keep receiving new releases. The core components of an Amazon EKS cluster don’t use the community registry and the base images come from Amazon hosted repositories. However, workloads you deploy to the cluster may come from the community registry.

Solution overview

Even with the Kubernetes registry changes, AWS and the Kubernetes project recommend customers take ownership of their dependencies to avoid unexpected availability incidents and for disaster recovery. There are two typical options to own upstream container dependencies:

Manually sync images from one registry to another
Cache images as they are requested

Syncing images between registries requires you to first identify all of the images and tags that you want to sync and then use a tool like crane or skopeo to pull images from one registry and push them into another. This can be tedious work and error prone if you use lots of images or if you have multiple accounts and regions. Keeping images up to date requires you to run the sync commands regularly.

A pull through cache is an automatic way to store images in a new repository, when they are requested. The pull through cache automatically creates the image repository in your registry when it’s first requested and keeps the image updated and available for future pulls. You aren’t required to manually identify upstream dependencies or manually sync images when updating your images.

Further benefits for Amazon EKS customers include:

Reduce image pull time by storing images in the same Region
Optional automatic replication to multiple Regions and accounts
Cross account pull permissions
Image vulnerability scanning and encryption

In addition to those benefits, you also support the upstream Kubernetes project by reducing image pulls from the upstream sources. There is no additional cost to use Amazon ECR pull through cache and standard ECR storage pricing is applied to cached images.

Walkthrough

Getting started with Amazon ECR pull through cache

Log into the AWS Management Console and create a new pull through cache rule. Select the Private Registry tab on the left and then select Pull through cache to update the rules for caching.

Select Add rule and in the Public registry drop down select registry.k8s.io. In the destination tab create a namespace. Cached images keep the same path as upstream, with the namespace prefixed to their path.

Image shows AWS Console with a dropdown box selecting the Kubernetes public registry.

If you create the namespace k8s, then your cached images will be available at:

<account number>.dkr.ecr.<region>.amazonaws.com/k8s

You can also create a pull through cache rule from the AWS Command Line Interface (AWS CLI) with:

aws ecr create-pull-through-cache-rule --ecr-repository-prefix k8s \
  --upstream-registry-url registry.k8s.io

Use the cache

To test the cache, you can manually pull an image found in registry.k8s.io using the new rule.

The first time you pull an image using the pull through cache namespace it automatically creates the repository. This command pulls the busybox image, which creates the repository and populates it with the upstream image.

docker pull <account number>.dkr.ecr.<region>.amazonaws.com/k8s/busybox:latest

Now you can configure all of your workloads and clusters to pull from the cache instead of the community registry. Here are three examples for how you can use the new cached repositories depending on how you manage your Kubernetes workloads.

Manually update manifests

The first option to use the new cached images is the most straightforward. If you have static Kubernetes manifest files, then you can update the image: field in the manifests to use the new repository. This works for a Git repo full of manifests that are manually applied to the cluster or for a GitOps repo of rendered manifest files.

grep -lr 'image: registry.k8s.io/' \
  | xargs sed -i \
    's,registry.k8s.io,<account number>.dkr.ecr.<region>.amazonaws.com/k8s,g'

To identify running workloads in a cluster that use the registry.k8s.io or k8s.gcr.io registry, you can use the community-images kubectl plugin.

Add helm variables to override the repository

If you’re using helm to install and manage workloads, then you can override the image repository to pull from your private repositories. Most community charts have the image.repository variable, but you may need to verify your chart’s variables.

helm upgrade --install metrics-server \
  --set image.repository=<account number>.dkr.ecr.<region>.amazonaws.com/k8s/metrics-server/metrics-server \
  metrics-server/metrics-server

You can also set this variable in a values file to make it easier to configure the correct Region settings.

Automatically rewrite registry URI with policy

A third option to use the registry, is to have your image specification modified when jobs are submitted to the cluster. This can be accomplished with a custom webhook or generically with a policy. Here we’ll show you how to write a policy for Kyverno.

The benefit of dynamically rewriting jobs to use a cache is that it also modifies sidecars, init containers, and debug containers that may not have predefined manifests.

You can follow the installation instructions for Kyverno to get started. To deploy a prebuilt release, image you can use this command in a development cluster as an admin user.

kubectl apply -f \
  https://github.com/kyverno/kyverno/releases/download/v1.9.0/install.yaml

Now you can create a ClusterPolicy to perform the registry rewrite dynamically for workloads that try to use the upstream registry.k8s.io registry.

apiVersion : kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: change-deprecated-registry
  annotations:
    policies.kyverno.io/title: Use ECR Pull Through Cache for registry.k8s.io
    policies.kyverno.io/category: Best Practices, EKS Best Practices
    policies.kyverno.io/severity: high
    policies.kyverno.io/minversion: 1.6.0
    policies.kyverno.io/description: >-
      Use a ECR Pull Through Cache instead of upstream registry.k8s.io community registry.
spec:
  mutateExistingOnPolicyUpdate: false
  rules:
  - name: change-deprecated-containers
    match:
      any:
      - resources:
          kinds:
          - Pod
    preconditions:
      all:
      - key: "{{request.operation || 'BACKGROUND'}}"
        operator: AnyIn
        value:
        - CREATE
        - UPDATE
      - key: registry.k8s.io
        operator: AnyIn
        value: "{{ images.containers.*.registry[] || `[]` }}"
    mutate:
      foreach:
      - list: "request.object.spec.containers"
        patchStrategicMerge:
          spec:
            containers:
            - name: "{{ element.name }}"
              image: <account number>.dkr.ecr.<region>.amazonaws.com/k8s/{{ images.containers."{{element.name}}".path}}:{{images.containers."{{element.name}}".tag}}
  - name: change-deprecated-initcontainers
    match:
      any:
      - resources:
          kinds:
          - Pod
    preconditions:
      all:
      - key: "{{request.operation || 'BACKGROUND'}}"
        operator: AnyIn
        value:
        - CREATE
        - UPDATE
      - key: "{{ request.object.spec.initContainers[] || '' | length(@) }}"
        operator: GreaterThanOrEquals
        value: 1
      - key: registry.k8s.io
        operator: AnyIn
        value: "{{ images.initContainers.*.registry[] || `[]` }}"
    mutate:
      foreach:
      - list: "request.object.spec.initContainers"
        patchStrategicMerge:
          spec:
            initContainers:
            - name: "{{ element.name }}"
              image: <account number>.dkr.ecr.<region>.amazonaws.com/k8s/{{ images.initContainers."{{element.name}}".name}}:{{images.initContainers."{{element.name}}".tag}}

It’s important to note that this policy may not catch every workload deployed to the cluster depending on the failurePolicy set for your Kyverno webhook. Ignoring the webhook on failure may be needed during an outage, but it is up to you to determine how your policy webhook should be configured.

Additional usage options

Here are some additional ways you can take advantage of a pull through cache with other Amazon ECR features.

Replication and cross-account permissions

Replication rules are only required in the Region where the pull through cache rule is created. You should create a replication rule before pulling images because replication happens when the repository is populated. If the pull through cache rule exists in us-east-1 and we want to replicate to us-west-2 and us-east-2, then we can use the following replication rule.

{
  "rules": [
    {
      "destinations": [
        {
          "region": "us-west-2",
          "registryId": "<account number>"
        },
        {
          "region": "us-east-2",
          "registryId": "<account number>"
        }
      ],
      "repositoryFilters": [
        {
          "filter": "k8s",
          "filterType": "PREFIX_MATCH"
        }
      ]
    }
  ]
}

After the rule has been created, all repositories that are pulled and cached in the primary region are automatically created and replicated to the other Regions.

If you need to pull images from other accounts, then you need to add permissions on each repository in each Region. Make sure the repositories have already been created and replicated before adding cross account permissions.

First create a repo-policy.txt file to allow image pulls from the other accounts needed.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "pull access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<your 2nd account>:root"
      },
      "Action": [
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer"
      ]
    }
  ]
}

Then get a list of all of the repositories that need the policy.

aws ecr describe-repositories --query 'repositories[].[repositoryName]' \
  --output text | grep 'k8s/'

If the list of repositories looks correct you can set the policy for all of them with this command:

for REPO in $(aws ecr describe-repositories \
    --query 'repositories[].[repositoryName]' \
    --output text | grep 'k8s/'); do \
        AWS_PAGER="" aws ecr set-repository-policy \
            --repository-name "${REPO}" \
            --policy-text file://repo-policy.json
done

Make sure to repeat that step for each region where you store containers.

Automatic repo creation when worker nodes pull images

Some companies have restrictions on what images can be used in their environments. If you have those restrictions, then you should identify and prepopulate the pull through cache images and labels. Kubernetes worker nodes, by default, won’t be able to pull a new image from a pull through cache because it requires additional AWS Identity and Access Management (AWS IAM) permissions to create a repository.

If you want to have repositories created automatically when Amazon EKS nodes request upstream images, then you need to add the following AWS IAM permission to worker nodes.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EksPtc",
            "Effect": "Allow",
            "Action": [
                "ecr:CreateRepository",
                "ecr:ReplicateImage",
                "ecr:BatchImportUpstreamImage"
            ],
            "Resource": "*"
        }
    ]
}

If you have a Kubernetes cluster in the same account and Region as the Amazon ECR registry, then you can deploy the following pod to validate image pulls are working.

kind: Pod
metadata:
  name: busybox
  namespace: default
spec:
  restartPolicy: OnFailure
  containers:
  - name: busybox
    image: <account number>.dkr.ecr.<region>.amazonaws.com/k8s/busybox:latest
    command: ['sh', '-c', 'echo "Success!" && sleep 1m']

This pulls the image registry.k8s.io/busybox:latest and caches it in the Amazon ECR repo under the k8s/ namespace.

If your worker nodes are on a private subnet without internet access, then you need to prepopulate the images you want to use because the pull through cache requires internet access to query the upstream registry for image metadata. You can do that from a separate cluster that has internet access or manually via the command line with docker or finch. Once the image repository is created, it remains up-to-date without needing additional syncing.

Self-hosted Kubernetes

If you are hosting your own Kubernetes control plane on AWS you should use the pull through cache for your control plane components. This ensures you can continue to deploy and manage clusters without relying on the community registry.

Tools like Kubespray has a container mirror option and kOps has a mirror option, too. You need to look into your tooling support for private, authenticated mirrors. Some of the tools, including kOps, don’t yet support authenticated mirrors, which is a requirement for Amazon ECR pull through cache.

Cleaning up

If you no longer want to use the Amazon ECR pull through cache you can delete it with the following command:

aws ecr delete-pull-through-cache-rule --ecr-repository-prefix k8s

Delete each repository that was created by first listing all of the repositories:

aws ecr describe-repositories --query 'repositories[].[repositoryName]' \
  --output text | grep 'k8s/'

And deleting them with the following command:

for REPO in $(aws ecr describe-repositories \
    --query 'repositories[].[repositoryName]' \
    --output text | grep 'k8s/'); do \
        AWS_PAGER="" aws ecr delete-repository \
            --repository-name "${REPO}"
done

You will need to update your Kubernetes manifests, Helm charts, or policy rules to revert the image URI back to registry.k8s.io.

Conclusion

In this post, we showed you how to create a container image pull through cache for Kubernetes images from registry.k8s.io. The upstream Kubernetes registry is run by volunteers in the Kubernetes community and is funded by credits from AWS and other cloud providers. There is no on-call schedule or service level agreement (SLA) for availability. While the community has done a fantastic job at scaling it and making it performent—Thank you all!—it’s an external risk to depend on for critical availability.

With a container pull through cache and updated workload definitions, you have additional control of your workload dependencies and reliability. These changes have an initial setup cost, but they help the upstream registry and provide more control and insights into how these images are being used in your environment.

Containers