How do I troubleshoot primary node failure with error “502 Bad Gateway” or “504 Gateway Time-out” in Amazon EMR?

Last updated: 2023-01-06

My Amazon EMR primary node is failing with a "502 Bad Gateway" or "504 Gateway Time-out" error.

Short description

An EMR primary node might fail with one of the following errors:

The master failed: Error occurred:<html>?? <head><title>502 Bad Gateway</title></head> <body>?? <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.20.0</center>?? </body>?? </html>??

-or-

The master failed: Error occurred: <html>??<head><title>504 Gateway Time-out</title></head>??<body>??<center><h1>504 Gateway Time-out</h1></center>??<hr><center>nginx/1.16.1</center>??</body>??</html>??

The following are common reasons for these errors:

  • The instance-controller daemon is in the stopped state or is down on the primary node instance.
  • The primary node runs out of memory or disk space.
  • The Amazon Elastic Compute Cloud (Amazon EC2) instance status checks fail.

Resolution

Troubleshoot primary node instance-controller daemon failures

The primary node's instance controller (I/C) is the daemon that communicates with the EMR control plane and the rest of the cluster. If the instance controller can't communicate with the EMR control plane, then the primary node is classified as unhealthy and the cluster is terminated.

To resolve this, analyze the instance-controller logs to determine why the process failed. The instance-controller logs are located at /emr/instance-controller/log/.

If termination protection is turned on, SSH into the primary node and restart the instance-controller process.

In Amazon EMR 5.30.0 and later release versions:

1.    Use the following command to check the status of the I/C:

sudo systemctl status instance-controller.service

2.    Use the following command to restart the I/C if the status is down:

sudo systemctl start instance-controller.service

In Amazon EMR 4.x-2.x release versions:

1.    Use the following command to check the status of I/C:

sudo /etc/init.d/instance-controller status

2.    Use the following command to restart the I/C if the status is down:

sudo /etc/init.d/instance-controller start

Analyze log files to troubleshoot memory and disk issues

  1. If termination protection is turned on, use SSH to connect into the primary node. Then, review the instance-state log file.
  2. Analyze instance metrics such as memory and disk listed in the instant-state log. You can analyze these metrics using Linux commands such as free -m and df -h.
  3. Use the log file results to determine why the primary node is using a high amount of disk or memory.

Troubleshoot primary node EC2 instance status check failures

Troubleshoot primary nodes that have termination protection turned off and the cluster is already terminated


Did this article help?


Do you need billing or technical support?