How do I resolve the "DockerTimeoutError" error in AWS Batch?

5 minute read
0

The jobs in my AWS Batch compute environment are failing and are returning the following error: "DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s." How do I troubleshoot "DockerTimeoutError" errors in AWS Batch?

Short description

If your docker start and docker create API calls take longer than four minutes, then AWS Batch returns a DockerTimeoutError error.

Note: The default timeout limit set by the Amazon Elastic Container Service (Amazon ECS) container agent is four minutes.

The error can occur for a variety of reasons, but it's commonly caused by one of the following:

  • The ECS instance volumes of the AWS Batch compute environment are under high I/O pressure from all the other jobs in your queue. These jobs, which are created on and run on the ECS instance, can deplete the burst balance. To resolve this issue, follow the steps in the Resolve any burst balance issues section of this article.
  • Stopped ECS containers aren't being cleaned fast enough to free up the Docker daemon. You can experience Docker issues if you're using a customized Amazon Machine Image (AMI) instead of the default AMI provided by AWS Batch. The default AMI for AWS Batch optimizes your Amazon ECS cleanup settings. To resolve this issue, follow the steps in the Resolve any Docker issues section of this article.

If neither of these issues is causing the error, then you can further troubleshoot the issue by doing the following:

  • Check your Docker logs to identify the source of the error.
  • Run the Amazon ECS logs collector script on the ECS instances in the ECS cluster associated with your AWS Batch compute environment.

Resolution

Resolve any burst balance issues

Check the burst balance of your ECS instance

1.    Open the Amazon ECS console.

2.    In the navigation pane, choose Clusters. Then, choose the cluster that contains your job.

Note: The name of the cluster starts with the name of the compute environment, followed by _Batch_ and a random hash of numbers and letters.

3.    Choose the ECS Instances tab.

4.    From the EC2 Instance column, choose your instance.

Note: To find the failed job's instance ID, run the AWS Batch describe-jobs command. The instance ID appears in the output for containerInstanceArn.

5.    On the Descriptions tab in the Amazon EC2 console, under Block devices, choose the link for your volume.

6.    On the block device pop-up window, for EBS ID, choose your volume.

7.    Choose the Monitoring tab. Then, choose Burst Balance to check your burst balance metrics. If your burst balance drops to 0, then your burst balance is depleted.

Create a launch template for your managed compute environment

Note: If you change the launch template, you must create a new compute environment.

1.    Open the Amazon EC2 console, and then choose Launch Templates.

2.    Choose Create launch template.

3.    For AMI ID, select the default Amazon ECS optimized AMI.

4.    In the Storage (Volumes) section, choose a volume type in the Volume type column. Then, enter an integer value in the Size(GiB) column.

Note: If you choose Provisioned IOPS SSD (io1) for your volume type, enter an integer value that's permitted for IOPS.

5.    Choose Create launch template.

6.    Use your new launch template to create a new managed compute environment.

Create an AWS Batch compute environment with your AMI

Note: If you change the AMI, you must create a new compute environment because the AMI ID parameter can't be updated.

1.    Open the Amazon EC2 console.

2.    Choose Launch instance.

3.    Follow the steps in the setup wizard to create your instance.

Important: On the Add Storage page, modify the volume type or size of your instance. The larger the volume size, the greater the baseline performance is and the slower it replenishes the burst balance. To get better performance for high I/O loads, change the volume to type io1.

4.    Create a compute resource AMI from your instance.

5.    Create a compute environment for AWS Batch that includes your AMI ID.

Resolve any Docker issues

By default, the Amazon ECS container agent automatically cleans up stopped tasks and Docker images that your container instances aren't using. If you run new jobs with new images, then your container storage might fill up with Docker images you aren't using.

1.    Use SSH to connect to the container instance for your AWS Batch compute environment.

2.    To inspect the Amazon ECS container agent, run the Docker inspect ecs-agent command. Then, review the env section in the output.

Note: You can reduce the values of the following variables to speed up task and image cleanup:

  • ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
  • ECS_IMAGE_CLEANUP_INTERVAL
  • ECS_IMAGE_MINIMUM_CLEANUP_AGE
  • ECS_NUM_IMAGES_DELETE_PER_CYCLE

You can also use tunable parameters for automated task and image cleanup.

3.    Create a new AMI with updated values.

-or-

Create a launch template with the user data that includes your new environment variables.

To create a new AMI with updated values

1.    Set your agent configuration parameters in the /etc/ecs/ecs.config file.

2.    Restart your container agent.

3.    Create a compute resource AMI from your instance.

4.    Create compute environment for AWS Batch that includes your AMI ID.

To create a launch template with the user data that includes your new environment variables

1.    Create a launch template with user data.

For example, the user data in the following MIME multi-part file overrides the default Docker image cleanup settings for a compute resource:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
--==MYBOUNDARY==--

2.    Use your new launch template to create a managed compute environment.


Related information

AWS services that publish CloudWatch metrics

Compute resource AMIs

amazon-ecs-agent (AWS GitHub)

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago