Optimizing Disk Usage on Amazon ECS

My colleague Jay McConnell sent a nice guest post that describes how to track and optimize the disk space used in your Amazon ECS cluster.
—

On October 4 Amazon ECS launched support for automated container and image cleanup. Read about it in the documentation.

Failure to monitor disk space utilization can cause problems that prevent Docker containers from working as expected. Amazon EC2 instance disks are used for multiple purposes, such as Docker daemon logs, containers, and images. This post covers techniques to monitor and reclaim disk space on the cluster of EC2 instances used to run your containers.

Amazon ECS is a highly scalable, high performance container management service that supports Docker containers and allows you to run applications easily on a managed cluster of Amazon EC2 instances. You can use ECS to schedule the placement of containers across a cluster of EC2 instances based on your resource needs, isolation policies, and availability requirements.

The ECS-optimized AMI stores images and containers in an EBS volume that uses the devicemapper storage driver in a direct-lvm configuration. As devicemapper stores every image and container in a thin-provisioned virtual device, free space for container storage is not visible through standard Linux utilities such as df. This poses an administrative challenge when it comes to monitoring free space and can also result in increased time troubleshooting task failures, as the cause may not be immediately obvious.

Disk space errors can result in new tasks failing to launch with the following error message:

 Error running deviceCreate (createSnapDevice) dm_task_run failed

NOTE: The scripts and techniques described in this post were tested against the ECS 2016.03.a AMI. You may need to modify these techniques depending on your operating system and environment.

Monitoring

You can use Amazon CloudWatch custom metrics to track EC2 instance disk usage. After a CloudWatch metric is created, you can add a CloudWatch alarm to alert you proactively, before low disk space causes a problem on your cluster.

Step 1: Create an IAM role

The first step is to ensure that the EC2 instance profile for the EC2 instances in the ECS cluster uses the “cloudwatch:PutMetricData” policy, as this is required to publish to CloudWatch.
In the IAM console, choose Policies, Create Policy. Choose Create Your Own Policy, name it “CloudwatchPutMetricData”, and paste in the following policy in JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudwatchPutMetricData",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

After you have saved the policy, navigate to Roles and select the role attached to the EC2 instances in your ECS cluster. Choose Attach Policy, select the “CloudwatchPutMetricData” policy, and choose Attach Policy.

Step 2: Push metrics to CloudWatch

Open a shell to each EC2 instance in the ECS cluster. Open a text editor and create the following bash script:

#!/bin/bash

### Get docker free data and metadata space and push to CloudWatch metrics
### 
### requirements:
###  * must be run from inside an EC2 instance
###  * docker with devicemapper backing storage
###  * aws-cli configured with instance-profile/user with the put-metric-data permissions
###  * local user with rights to run docker cli commands
###
### Created by Jay McConnell

# install aws-cli, bc and jq if required
if [ ! -f /usr/bin/aws ]; then
  yum -qy -d 0 -e 0 install aws-cli
fi
if [ ! -f /usr/bin/bc ]; then
  yum -qy -d 0 -e 0 install bc
fi
if [ ! -f /usr/bin/jq ]; then
  yum -qy -d 0 -e 0 install jq
fi

# Collect region and instanceid from metadata
AWSREGION=`curl -ss http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region`
AWSINSTANCEID=`curl -ss http://169.254.169.254/latest/meta-data/instance-id`

function convertUnits {
  # convert units back to bytes as both docker api and cli only provide freindly units
  if [ "$1" == "b" ] ; then
    echo $2
  elif [ "$1" == "kb" ] ; then
    awk 'BEGIN{ printf "%.0f\n", '$2' * 1000 }'
  elif [ "$1" == "mb" ] ; then
    awk 'BEGIN{ printf "%.0f\n", '$2' * 1000**2 }'
  elif [ "$1" == "gb" ] ; then
    awk 'BEGIN{ printf "%.0f\n", '$2' * 1000**3 }'
  elif [ "$1" == "tb" ] ; then
    awk 'BEGIN{ printf "%.0f\n", '$2' * 1000**4 }'
  else
    echo "Unknown unit $1"
    exit 1
  fi
}

function getMetric {
  # Get freespace and split unit
  if [ "$1" == "Data" ] || [ "$1" == "Metadata" ] ; then
    docker info | awk '/'$1' Space Available/ {print tolower($5), $4}'
  else
    echo "Metric must be either 'Data' or 'Metadata'"
    exit 1
  fi
}

data=$(convertUnits `getMetric Data`)
aws cloudwatch put-metric-data --value $data --namespace ECS/$AWSINSTANCEID --unit Bytes --metric-name FreeDataStorage --region $AWSREGION
data=$(convertUnits `getMetric Metadata`)
aws cloudwatch put-metric-data --value $data --namespace ECS/$AWSINSTANCEID --unit Bytes --metric-name FreeMetadataStorage --region $AWSREGION

Next, set the script to be executable:

chmod +x /path/to/metricscript.sh

Now, schedule the script to run every 5 minutes via cron. To do this, create the file /etc/cron.d/ecsmetrics with the following contents:

*/5 * * * * root /path/to/metricscript.sh

This pulls both free data and metadata every 5 minutes and push them to CloudWatch with the namespace ECS/.

Disk cleanup

The next step is to clean up the disk, either automatically on a schedule or manually. This post covers cleanup of tasks and images; there is a great blog post, Send ECS Container Logs to CloudWatch Logs for Centralized Monitoring, that covers pushing log files to CloudWatch. Using CloudWatch Logs instead of local log files reduces disk utilization and provides a resilient and centralized place from which to manage logs.

Take a look at what you can do to remove unneeded containers and images from your instances.

Delete unused containers

Stopped containers should be deleted if they are no longer needed. The ECS agent, by default, deletes all containers that have exited every 3 hours. This behavior can be customized by adding the following to /etc/ecs/ecs.config:

ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=10m

This sets the frequency of the task to 10 minutes.

Delete unused images

By default the ECS agent will check for images to delete every 30 minutes, an image will only be deleted if it was pulled more than an hour prior, and a maximum of 5 images will be deleted per check.

These defaults can be changed with the following configuration options:

ECS_IMAGE_CLEANUP_INTERVAL=10m
ECS_IMAGE_MINIMUM_CLEANUP_AGE=15m
ECS_NUM_IMAGES_DELETE_PER_CYCLE=10

Setting the above will result in checks every 10 minutes that delete up to 10 unused images that are at least 15 minutes old.

Applying config changes

For this change to take effect, the ECS agent needs to be restarted, which can be done via ssh:

stop ecs; start ecs

To set this up for new instances, attach the following EC2 user data:

cat /etc/ecs/ecs.config | grep -v 'ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION' > /tmp/ecs.config
echo "ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=5m" >> /tmp/ecs.config
mv -f /tmp/ecs.config /etc/ecs/
stop ecs
start ecs

Conclusion

The techniques described in this post provide visibility into a critical component of running Docker—the disk space used on the cluster’s EC2 instances—and techniques to clean up unnecessary storage. If you have any questions or suggestions for other best practices, please comment below.

AWS Compute Blog

Optimizing Disk Usage on Amazon ECS

Monitoring

Disk cleanup

Conclusion

Resources

Follow