How do I troubleshoot the failure of my Amazon EMR Spark job using Amazon Athena?

Last updated: 2021-03-11

My Spark job on Amazon EMR has failed. I want to troubleshoot the failure by querying the Spark logs using Amazon Athena.

Resolution

When Amazon EMR applications run on Amazon EMR, they produce log files. You can create a basic table for the EMR log files and then use Athena to query these EMR logs. You can identify events and trends for applications and clusters by querying the EMR logs.

Run a command similar to the following to create a basic table myemrlogs based on the EMR log files saved to your Amazon S3 log location:

CREATE EXTERNAL TABLE `myemrlogs`(
  `data` string COMMENT 'from deserializer')
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-logs-111122223333-us-west-2/elasticmapreduce/j-1ABCDEEXAMPLE/containers/application_1111222233334_5555/'

Replace the following in the above query:

  • myemrlogs with the name of the table
  • 111122223333 with the AWS account number
  • j-1ABCDEEXAMPLE with the clusterID
  • us-west-2 with your preferred Region
  • application_1111222233334_5555 with the application ID

Note: The S3 bucket mentioned in the example is the default bucket used by Amazon EMR. To verify your log bucket path, open the Amazon EMR console, choose your cluster, and then check the Log URI field in the Summary tab.

Then, run a command similar to the following to check for occurrences of FAIL, ERROR, WARN, EXCEPTION, FATAL, or CAUSE in myemrlogs:

SELECT *,"$PATH" FROM myemrlogs WHERE regexp_like(data, 'FAIL|ERROR|WARN|EXCEPTION|FATAL|CAUSE') limit 100;

Note: Replace myemrlogs with the name of the table that you’ve created from the EMR log files.

The EMR logs can be queried in different ways to find out at which step the Spark application failed. Here are a few ways to query the logs to troubleshoot if the application failed at the job, stage, task, or executor level.

Run a command similar to the following to get the exit code of the application:

SELECT *,"$PATH" FROM myemrlogs WHERE regexp_like(data, 'exitCode');

Run a command similar to the following to check which host the Spark executor is running on:

SELECT *,"$PATH" FROM myemrlogs WHERE regexp_like(data, 'executor ID');

Run a command similar to the following to track the mapping of tasks to stages:

SELECT *,"$PATH" FROM myemrlogs WHERE regexp_like(data, 'TID');

Run a command similar to the following to check the heap memory details of containers:

SELECT *,"$PATH" FROM myemrlogs WHERE regexp_like(data, 'space');

Run a command similar to the following to track the progress of each job/stage on the Directed Acyclic Graph (DAG) scheduler:

SELECT *,"$PATH" FROM myemrlogs WHERE regexp_like(data, 'DAGScheduler');

You can also create a partitioned table based on Amazon EMR logs and then use Athena to query these logs. For more information, see Creating and querying a partitioned table based on Amazon EMR logs.


Did this article help?


Do you need billing or technical support?