How do I resolve the "DockerTimeoutError" error in AWS Batch?
Last updated: 2020-12-18
The jobs in my AWS Batch compute environment are failing due to the following error: "DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s."
You receive this error when the docker start and docker create calls take longer than four minutes. The default timeout limit set by the Amazon Elastic Container Service (Amazon ECS) container agent is four minutes.
The error can be caused by the following issues:
- The ECS instance volumes of the AWS Batch compute environment are under high I/O pressure from all the other jobs in your queue. These jobs, which are created on and run on the ECS instance, can deplete the burst balance. To resolve this issue, follow the steps in the Resolve burst balance issues section.
- Stopped ECS containers aren't being cleaned fast enough to free up the Docker daemon. You can experience Docker issues if you're using a customized Amazon Machine Image (AMI) instead of the default AMI provided by AWS Batch. The default AMI for AWS Batch optimizes your Amazon ECS cleanup settings. To resolve this issue, follow the steps in the Resolve Docker issues section.
If neither of these issues is causing the error, try the following:
- Check your Docker logs for the source of the error.
- Run the Amazon ECS logs collector script on the ECS instances in the ECS cluster associated with your AWS Batch compute environment.
Resolve burst balance issues
To check the burst balance of your ECS instance:
1. Open the Amazon ECS console.
2. In the navigation pane, choose Clusters, and then choose the cluster that contains your job.
Note: The name of the cluster starts with the name of the compute environment, followed by _Batch_ and a random hash of numbers and letters.
3. Choose the ECS Instances tab.
4. From the EC2 Instance column, choose your instance.
Note: To find the instance ID of the failed job, run the aws batch describe-jobs –jobs awsExampleJobID command. The instance ID appears in the output for containerInstanceArn.
5. On the Descriptions tab in the Amazon EC2 console, choose the link for your volume from Block devices.
6. On the block device pop-up window, for EBS ID, choose your volume.
7. Choose the Monitoring tab, and then choose Burst Balance to check your burst balance metrics.
Note: If your burst balance drops to 0, then your burst balance is depleted.
To create a launch template for your managed compute environment:
Note: If you change the launch template, you must create a new compute environment.
1. Open the Amazon EC2 console, and then choose Launch Templates.
2. Choose Create launch template.
3. For AMI ID, select the default Amazon ECS-optimized AMI.
4. In the Storage (Volumes) section, choose a volume type in the Volume type column, and then enter an integer value in the Size(GiB) column.
Note: If you choose Provisioned IOPS SSD (io1) for your volume type, enter an integer value that's permitted for IOPS.
5. Choose Create launch template.
6. Use your new launch template to create a new managed compute environment.
To create an AWS Batch compute environment with your AMI:
Note: If you change the AMI, you must create a new compute environment, because the AMI ID parameter can't be updated.
1. Open the Amazon EC2 console.
2. Choose Launch instance.
3. Follow the steps in the setup wizard to create your instance.
Important: On the Add Storage page, modify the volume type or size of your instance. The larger the volume size, the greater the baseline performance is and the slower it replenishes the burst balance. To get better performance for high I/O loads, change the volume to type io1.
4. Create a compute resource AMI from your instance.
5. Create a compute environment for AWS Batch that includes your AMI ID.
Resolve Docker issues
By default, the Amazon ECS container agent automatically cleans up stopped tasks and Docker images that tasks on your container instances aren't using. If you run new jobs with new images, then your container storage might fill up with Docker images you aren't using.
1. Use SSH to connect to the container instance for your AWS Batch compute environment.
2. To inspect the Amazon ECS container agent, run the Docker inspect ecs-agent command, and then view the env section in the output.
Note: You can reduce the values of the following variables to speed up task and image cleanup: ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION, ECS_IMAGE_CLEANUP_INTERVAL, ECS_IMAGE_MINIMUM_CLEANUP_AGE, and ECS_NUM_IMAGES_DELETE_PER_CYCLE. Also, you can use tunable parameters for automated task and image cleanup.
3. Create a new AMI with updated values, or create a launch template with the user data that includes your new environment variables.
To create a new AMI:
1. Set your agent configuration parameters in the /etc/ecs/ecs.config file.
2. Restart your container agent.
3. Create a compute resource AMI from your instance.
4. Create compute environment for AWS Batch that includes your AMI ID.
To create a launch template:
For example, the user data in the following MIME multi-part file overrides the default Docker image cleanup settings for a compute resource:
MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="==MYBOUNDARY==" --==MYBOUNDARY== Content-Type: text/x-shellscript; charset="us-ascii" #!/bin/bash echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config --==MYBOUNDARY==--
2. Use your new launch template to create a managed compute environment.