Containers

Speeding up Windows container launch times with EC2 Image builder and image cache strategy

I have heard many times from customers that Windows containers aren’t fast to launch due to the container image size. In part this is true, however, it is important to demystify “the big image” and how to implement cache strategy to avoid expensive operations on the disk (the extraction) and speed up the Windows container launch.

In many cases, I also hear the following comparative: Linux Containers vs Windows Containers and how fast is the Linux when compared with Windows. This is true, but this comparative doesn’t bring much value to the table as each platform address completed different problems. As an example, a developer should not run ASP.NET applications on a Linux container, nor should they run Python on a Windows container.

Let’s put the comparative aside and burst the Windows container launch.

Before we dig into how to solve the problem, let’s understand it first. Let’s say you are running an Amazon Elastic Container Service (Amazon ECS) cluster based on Windows or an Amazon Elastic Kubernetes Service cluster (Amazon EKS) with Windows node groups. On a high-pressure container environment, where EC2 Auto Scaling is frequently triggered to add more capacity in the cluster, it may take around 4 to 8 minutes for a container to become ready from the time the EC2 Auto Scaling was triggered to the time the Windows container accepts traffic. This may be a reality if not using the right approach to avoid expensive I/O operations.

Demystifying the Windows container image

Two container base images are part of the ECS/EKS Optimized Windows AMI.

mcr.microsoft.com/windows/servercore
mcr.microsoft.com/windows/nanoserver

In-built based images are already extracted on the ECS/EKS Optimized Windows AMI. During a push/pull operation, only the layers that compose your image are uploaded/download to the repository. The following example shows an Amazon Elastic Container Registry (Amazon ECR) image called iis-dnn-a82378d43adb that has only 302.25MB compacted. This is the size of the upload/download during the push/pull operations.

However, the following output from docker image ls shows the iis-dnn-a82378d43adb image size as 5.73GB on disk, but that doesn’t mean it pulled and extracted that amount. What happened during the pull operation is only the compacted 302.25MB mentioned before was downloaded and extracted.

REPOSITORY                                                          TAG                 IMAGE ID            CREATED             SIZE
mcr.microsoft.com/windows/servercore                                ltsc2019            152749f71f8f        2 weeks ago         5.27GB
010101011575.dkr.ecr.us-east-1.amazonaws.com/iis-dnn-a82378d43adb   latest              de4f5f1edfe0        3 weeks ago         5.73GB

The size column shows the overall size of 5.73GB. Breaking it down:

In-built base image = 5.27GB
Application layers = compressed 302.25MB / Extracted on disk = 460MB
Total image size on disk = 5.73GB

The base image already exists on the disk, resulting in the additional amount in disk as 460MB. The next time you see that amount of GBs in size, don’t worry too much. It is likely that more than 80% is already on disk as in-built base image. Analyzing the explanation above, it isn’t the overall image size the main problem for a slow Windows container launch. Instead, it is the time the Pull/Extraction operation takes to pull, extract, and make the additional layers available.

Results of a container image cache strategy

To speed up the Windows container deployment, we will use Amazon EC2 Image Builder to pull container images from an Amazon ECR repository during the AMI build pipeline. The container images to be pulled should be the ones which are essentials for the entire solution, for example, applications images and side-car containers like Fluentd, Fluent Bit, or any other necessary container image for your solution.

Using this approach, all the expensive I/O operations (file extraction) will be happening on the AMI build creation instead of the container launch. As a result, all the necessary image layers will be extracted on the AMI and will be ready to be used, speeding up the time a Windows container launches and can start accepting traffic.

The output below contains the results of an ASP.NET application running on a Windows pod hosted by Amazon EKS. The environment is composed of two different node groups, one with a vanilla EKS Optimized Windows AMI and the second node group with a custom EKS Optimized AMI, built using the solution proposed in this blog.

In this first example, the kubectl describe pod outputs the time from when pod is scheduled to the node until it reaches the “Started” status. As you can see, the time taken to “Pulled” the image was 54 seconds (which is the time Docker spent checking the already present image metadata. In this example, the cache strategy is implemented on the Windows node.

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  70s   default-scheduler  Successfully assigned default/iis-dnn-ondemand-deployment-574c8789bd-f986k
 to ip-172-31-42-147.ec2.internal
  Normal  Pulled     *54s*   kubelet            Container image "010101010575.dkr.ecr.us-east-1.amazonaws.com/iis-dnn-a823
78d43adb" already present on machine
  Normal  Created    53s   kubelet            Created container iis-dnn-ondemand
  Normal  Started    16s   kubelet            Started container iis-dnn-ondemand

 

In the next example, the time taken to “Pulled” the image is 6 minutes (which is time Docker identified the image wasn’t presented on the machine, pulled the additional layers from Amazon ECR, and extracted it to the disk. The extraction is the most expensive operation and the most common root cause for delays in Windows container launches.

Comparing the two results, it is clear that by using the cache image strategy, you can speed up to 6x the time it takes to the first pod reaches the “Started” status and starts to receive traffic.

Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  7m16s  default-scheduler  Successfully assigned default/iis-dnn-ondemand-deployment-7f7545cf48-82gx
n to ip-172-31-1-96.ec2.internal
  Normal  Pulling    *7m4s*   kubelet            Pulling image "010101010575.dkr.ecr.us-east-1.amazonaws.com/iis-dnn-a8237
8d43adb"
  Normal  Pulled     64s    kubelet            Successfully pulled image "010101010575.dkr.ecr.us-east-1.amazonaws.com/i
is-dnn-a82378d43adb" in *6m0.7576396s*
  Normal  Created    63s    kubelet            Created container iis-dnn-ondemand
  Normal  Started    62s    kubelet            Started container iis-dnn-ondemand

Building a custom EKS/ECS Windows AMI to speed up Windows container launch

How are we going to achieve this?

Prerequisites and assumptions:

In this blog post, I’m using Amazon EKS as the orchestrator. The cache strategy is built on the AMI level, you can use the same approach for an Amazon ECS cluster.

In this blog, we are going to do the following tasks:

  1. Create an EC2 Image Builder custom component.
  2. Build the custom EKS/ECS Optimized Windows AMI pipeline.
  3. Create an EKS node group using the custom EKS Optimized Windows AMI.
  4. Check results.

1. Create an EC2 Image Builder custom component

Image Builder uses a component management application (AWSTOE) to orchestrate complex workflows, modify system configurations, and test your systems. There is no additional server setup required to use the Image Builder in the AWS Management Console or to use Image Builder commands that interact with AWSTOE on your behalf.

AWSTOE uses YAML documents to define the scripts that customize your image. The documents can include build, validate, and test phases. For more information about YAML documents, see document schema and definitions.

1.1 In the EC2 Image Builder section of the AWS Management Console, click Components and then Create component.

1.2 Create a Build type component that is compatible with Windows.

1.3 In the Definition document, add the following content and replace the Amazon ECR URL and images URLs to the ones that matches your environment.

name: DockerPull
description: DockerImageCacheStrategy.
schemaVersion: 1.0

phases:
  - name: build
    steps:
      - name: Dockerpull
        action: ExecutePowerShell
        inputs:
          commands:
            - (Get-ECRLoginCommand).Password | docker login --username AWS --password-stdin 01010101.dkr.ecr.us-east-1.amazonaws.com
            - docker pull 01010101575.dkr.ecr.us-east-1.amazonaws.com/iis-dnn-a82378d43adb
            - docker pull 01010101575.dkr.ecr.us-east-1.amazonaws.com/fluentd-a729311dbs

1.4 Click Create component.

2. Build the custom EKS/ECS optimized Windows AMI pipeline.

In this step, we’ll create an image pipeline to automatically build the custom EKS/ECS optimized Windows AMI. On the main page of EC2 Image Builder, click Create image pipeline.

2.1 Specify the pipeline name, description, and build schedule. In my example, the pipeline will automatically run every week to make sure the latest Windows updated are installed on my image. Click Next.

2.2 Click Create new recipe. Select the image type as Amazon Machine Image (AMI).

2.3 On the source image, Select Windows and Quick start (Amazon-managed).

2.4 This is an important step. You must select the Windows Server image that will be served as the base image for your Windows Nodes. In this example, I’ll select Windows Server 2004 English Core Base x86. For Amazon ECS, the EC2 Image Builder already has images with the ECS agent installed, which is not the case for EKS. We’ll prepare the the Windows Server 2004 to have the EKS components.

2.5 Another great option is to select the option: Use latest available OS version, which will include all the Windows updates at the time AWS generated it as well as all AMI new features or performance improvements.

2.6 The Components panel is where we attach the component we created in step 1 to the image pipeline, but first let’s make sure we are adding core components to this AMI. In the search box, type EKS.

In the Sequence panel, you have two options. Let the pipeline use the latest Kubelet and Docker version or set it to a specific version. Something to pay attention to is that the description has EKS version 1.16. In the Sequence panel, we select the latest available component version, meaning the latest version released by AWS. In the time of this blog post, Kubernetes 1.20. You can learn more on the official documentation.

2.7 A good option is to add the update-windows component to have the latest security updates installed on the AMI. Search the component update-windows.

2.8 It is time to add the cache strategy component we called Docker pull. Change the search to Owned by me and search by Docker pull or the name you choose during step 1.

2.9 You will end up having three components:

  • eks-optimized-ami-windows
  • update-windows
  • docker pull

2.10 The IAM instance profile generated by EC2 Image Builder already has the necessary policy to log in to Amazon ECR. “On the step 1.3 we use the follow command to login on the Amazon ECR repository.”

(Get-ECRLoginCommand).Password | docker login --username AWS --password-stdin 01010101575.dkr.ecr.us-east-1.amazonaws.com

2.11 Follow all the necessary next steps until the pipeline and infrastructure creation completes. Distribute the image using the EC2 Image Builder distribution settings.

3. Create an Amazon EKS node group using the custom EKS Optimized Windows AMI.

In order to test the results, you can use your favorite deployment tool to add a new node group using the new AMI or edit the existing launch template attached to an existing Auto Scaling group. At this point you should manually start your EC2 Image Builder pipeline in order to generate the AMI.

In this blog post, I’ll use eksctl to create a new Amazon EKS node group and specify the custom AMI.

3.1 Adjust your eksctl config file to add a new node group and specify the custom AMI. An important step is that you must inform the AMI ID and AMI Family. Not specifying the amiFamily will cause the node to not join in the cluster.

apiVersion: eksctl.io/v1alpha5
  kind: ClusterConfig
  
  metadata:
    name: eks-windows
    region: us-east-1
    version: '1.20'  
  availabilityZones: 
      - us-east-1f
      - us-east-1b
  
  nodeGroups:
    - name: windows-ng-sac2004-customami-test
      instanceType: c5.xlarge
      minSize: 1
      ami: ami-0556136c21149e0ac
      amiFamily: WindowsServer2004CoreContainer
.......

Also you can let EC2 Image Builder to create a new version of the existing Launch template that references your latest Amazon Machine Images (AMIs) and automatically update your EC2 Auto Scaling.

Conclusion

In this blog post I showed how you might use a cache container image strategy to speed up a Windows container launch, however you can also use the same approach to speed up any container workload, independent of OS, like containers sidecar, CI build containers, and more.

TAGS:
Marcio Morales

Marcio Morales

Sr. Solution Architect - Microsoft Specialist