Why is my AWS Glue job failing with the error "Exit status: -100. Diagnostics: Container released on a *lost* node"?
Last updated: 2019-11-21
My AWS Glue job failed with the following error: "Exit status: -100. Diagnostics: Container released on a *lost* node."
If your data source is JDBC, see My AWS Glue job fails with lost nodes while migrating a large dataset from Amazon RDS to Amazon S3. If your data source is Amazon Simple Storage Service (Amazon S3), use one or more of the following methods to resolve lost node errors.
Add more data processing units (DPUs)
By default, AWS Glue jobs have 10 DPUs and use the Standard worker type. One DPU is reserved for the application master. Before adding more DPUs, consider the following:
- Adding more DPUs helps only if the job workload is parallelized. If you're not partitioning the data appropriately, the data won't be distributed to the new nodes and you'll receive lost node errors. For more information, see Determine the Optimal DPU Capacity.
- In some cases, large files can consume too many resources on a node, which reduces the effectiveness of parallelizing the job workload. To mitigate this problem, be sure that large files are in a splittable format. For more information, see Best practices to scale Apache Spark jobs and partition data with AWS Glue.
- The longer a job runs, the more logs the job writes to disk. If logs are causing the disk space problem, adding more DPUs won't help. In some cases, lost node errors are caused by both excessive logging and disk spills.
- Check logs and the glue.ALL.jvm.heap.usage Amazon CloudWatch metric to identify memory-consuming executors. If some executors are consuming more memory than others, data skew might be causing the error.
- Jobs API: Set the NumberOfWorkers property when you run the CreateJob operation.
- AWS CLI: Set the number-of-workers property when you run the create-job command.
- AWS Glue console: On the Configure the job properties page, under Security configuration, script libraries, and job parameters (optional), increase the value for Maximum capacity. This is the number of DPUs for the job.
Be sure that CloudWatch metrics are enabled for the job. These metrics can help you monitor job performance and determine if you need to add more DPUs. Use the Jobs API, the AWS CLI, or the AWS Glue console to enable metrics for a new or existing job:
- Jobs API: Use the --enable-metrics argument when defining DefaultArguments in the CreateJob or UpdateJob operations.
- AWS CLI: Use the --enable-metrics argument.
- AWS Glue console: Under Monitoring options, set Job metrics to Enabled.
Change the DPU worker type
Depending on the size of your dataset, the Standard worker type might not have enough resources to prevent the Spark application from running out of memory and spilling to disk. To resolve this issue, choose a worker type with more available resources:
- Standard (default): Each worker maps to 1 DPU (4 vCPUs, 16 GB of memory), and has 50 GB of disk space.
- G.1X: Each worker maps to 1 DPU (4 vCPUs, 16 GB of memory), and has 64 GB of disk space.
- G.2X: Each worker maps to 2 DPUs (8 vCPUs, 32 GB of memory), and has 128 GB of disk space.
- Jobs API: Set the WorkerType property when you run the CreateJob operation.
- AWS CLI: Set the worker-type property when you run the create-job command.
- AWS Glue console: On the Configure the job properties page, under Security configuration, script libraries, and job parameters (optional), choose a different option for Worker type.