Automated Cleanup of Unused Images in Amazon ECR

Thanks to my colleague Anuj Sharma for a great blog on cleaning up old images in Amazon ECR using AWS Lambda.
—-

When you use Amazon ECR as part of a container build and deployment pipeline, a new image is created and pushed to ECR on every code change. As a result, repositories tend to quickly fill up with new revisions. Each Docker push command creates a version that may or may not be in use by Amazon ECS tasks, depending on which version of the image is actually used to run the task.

It’s important to have an automated way to discover the image that is in use by running tasks and to delete the stale images. You may also want to keep a predetermined number of images in your repositories for easy fallback on a previous version. In this way, repositories can be kept clean and manageable, and save costs by cleaning undesired images.

In this post, I show you how to query running ECS tasks, identify the images which are in use, and delete the remaining images in ECR. To keep a predetermined number of images irrespective of images in use by tasks, the solution should keep the latest n number of images. As ECR repositories may be hosted in multiple regions, the solution should also query either all available regions where ECR is available or only a specific region. You should be able to identify and print the images to be deleted, so that extra caution can be exercised. Finally, the solution should run at a scheduled time on a reoccurring basis, if required.

Cleaning solution

The solution contains two components:

A Python script, which identifies the images satisfying all the requirements and deletes the images; and
An AWS Lambda function, which executes the script using Amazon CloudWatch Events

Python script

The Python script is available from the AWS Labs repository. The logic coded is as follows:

Get a list of all the ECS clusters using ListClusters API operation.
For each cluster, get a list of all the running tasks using ListTasks.
Get the ARNs for each running task by calling DescribeTasks.
Get the container image for each running task using DescribeTaskDefinition.
Filter out only those container images that contain “.dkr.ecr.” and “:” . This ensures that you get a list of all the containers running on ECR that have an associated tag.
Get a list of all the repositories using DescribeRepositories.
For each repository, get the imagePushedAt value, tags, and SHA for every image using DescribeImages.
Ignore those images from the list that have a “latest” tag or which are currently running (as discovered in the earlier steps).
Delete the images that have the tags as discovered earlier, using BatchDeleteImage.
- The -dryrun flag prints out the images marked for deletion, without deleting them. Defaults to True, which only prints the images to be deleted.
- The -region flag deletes images only in the specified region. If not specified, all regions where ECS is available are queried and images deleted. Defaults to None, which queries all regions.
- The -imagestokeep flag keeps the latest n number of specified images and does not deletes them. Defaults to 100. For example, if –imagestokeep 20 is specified, the last 20 images in each repository are not deleted and the remaining images that satisfy the logic mentioned above are deleted.

Lambda function

The Lambda function runs the Python script, which is executed using CloudWatch scheduled events. Here’s a diagram of the workflow:

(1) The build system builds and pushes the container image. The build system also tags the image being pushed, either with the source control commit SHA hash value or the incremental package version. This gives full control over running a versioned container image in the task definition. The versioned images (images tagged in this way version the container images) are further used to run the tasks, using your own scheduler or by using the ECS scheduling capabilities.
(2) The Lambda function gets invoked by CloudWatch Events at the specified time and queries all available clusters.
(3) Based on the coded Python logic, the container image tags being used by running tasks are discovered.
(4) The Lambda function also queries all images available in ECR and, based on the coded logic, decides on the image tags to be cleaned.
(5) The Lambda function deletes the images as discovered by the coded logic.

Important: It’s a good practice to run the task definition with an image, which has a version in image tag, so that there is a better visibility and control over the version of running container. The build system can tag the image with the source code version or package version before pushing the image so that each image has the version tagged with each push, which can be used further to run the task. The script assumes that the task definitions have a versioned container specified. I recommend running the script first with the –dryrun flag to identify the images to be deleted.

Usage examples

Prints the images that are not used by running tasks and which are older than the last 100 versions, in all regions: python main.py

Deletes the images that are not used by running tasks and which are older than the last 100 versions, in all regions: python main.py -dryrun False

Deletes the images that are not used by running tasks and which are older than the last 20 versions (in each repository), in all regions: python main.py -dryrun False -imagestokeep 20

Deletes the images that are not used by running tasks and which are older than the last 20 versions (in each repository), in Oregon only: python main.py -dryrun False -imagestokeep 20 -region us-west-2

Assumptions

I’ve made the following assumptions for this solution:

Your ECR repository is in the same account as the running ECS clusters.
Each container image that has been pushed is tagged with a unique tag.

Docker tags and container image names are mutable. Multiple names can point to the same underlying image and the same image name can point to different underlying images at different points in time.

To trace the version of running code in a container to the source code, it’s a good practice to push the container image with a unique tag, which could be the source control commit SHA hash value or a package’s incremental version number; never reuse the same tag. In this way, you have better control over identifying the version of code to run in the container image, with no ambiguity.

Lambda function deployment

Now it’s time to create and deploy the Lambda function. You can do this through the console or the AWS CLI.

Using the AWS Management Console

Use the console to create a policy, role, and function.

Create a policy

Create a policy with the following policy test. This defines the access for the necessary APIs.

Sign in to the IAM console and choose Policies, Create Policy, Create Your Own Policy. Copy and paste the following policies, type a name, and choose Create.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1476919244000",
            "Effect": "Allow",
            "Action": [
                "ecr:BatchDeleteImage",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:DescribeImages"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "Stmt1476919353000",
            "Effect": "Allow",
            "Action": [
                "ecs:DescribeClusters",
                "ecs:DescribeTaskDefinition",
                "ecs:DescribeTasks",
                "ecs:ListClusters",
                "ecs:ListTaskDefinitions",
                "ecs:ListTasks"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Create a Lambda role

Create a role for the Lambda function, using the policy that you just created.

In the IAM console, choose Roles and enter a name, such as LAMBDA_ECR_CLEANUP. Choose AWS Lambda, select your custom policy, and choose Create Role.

Create a Lambda function

Create a Lambda function, with the role that you just created and the Python code available from the Lambda_ECR_Cleanup repository. For more information, see the Lambda function handler page in the Lambda documentation.

Schedule the Lambda function

Add the trigger for your Lambda function.

In the CloudWatch console, choose Events, Create rule, and then under Event selector, choose Schedule. For example, you can put the cron schedule as * 22 * * * to run the function every day at 10PM. For more information, see Scenario 2: Take Scheduled EBS Snapshots.

Using the AWS CLI

Use the following code examples to create the Lambda function using the AWS CLI. Replace values as needed.

Create a policy

Create a file with the following contents, called LAMBDA_ECR_DELETE.json. You reference this file when you create the IAM role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "lambda.amazonaws.com"
        ]
      },
      "Action": [
                "ecr:BatchDeleteImage",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:DescribeImages",
                "ecs:DescribeClusters",
                "ecs:DescribeTaskDefinition",
                "ecs:DescribeTasks",
                "ecs:ListClusters",
                "ecs:ListTaskDefinitions",
                "ecs:ListTasks"

            ],
      "Resource": ["*"]

    }
  ]
}

Create a Lambda role

aws iam create-role --role-name LAMBDA_ECR_DELETE --assume-role-policy-document file://LAMBDA_ECR_DELETE.json

Create a Lambda function

aws lambda create-function --function-name {NAME_OF_FUNCTION} --runtime python2.7 
    --role {ARN_NUMBER} --handler main.handler --timeout 15 
    --zip-file fileb://{ZIP_FILE_PATH}

Schedule a Lambda function

For more information, see Scenario 6: Run an AWS Lambda Function on a Schedule Using the AWS CLI.

Conclusion

The SDK for Python provides methods and access to ECS and ECR APIs, which can be leveraged to write logic to identify stale container images. Lambda can be used to run the code and avoid provisioning any servers. CloudWatch Events can be used to trigger the Lambda code on a recurring schedule, to keep your repositories clean and manageable and control costs.

If you have questions or suggestions, please comment below.

AWS Compute Blog