When I run Apache Spark Streaming jobs, the logs take up the remaining disk space on the core and task nodes of my Amazon EMR cluster.

Spark allocates one YARN container for each executor in the Streaming job. Logpusher doesn't upload logs to Amazon Simple Storage Service (Amazon S3) until the container finishes or terminates. The longer the container takes to finish, the more disk space it consumes on the cluster's nodes. Eventually the container can consume the remaining disk space of the nodes.

To resolve this problem, configure log rotation for Spark jobs by modifying the Log4j properties file, which is located in the /etc/spark/conf directory.

For Amazon EMR release versions earlier than 5.18.0, follow these steps to manually configure log rotation. (Amazon EMR release versions 5.18.0 and later automatically rotate Spark Streaming container logs hourly.)

1.    Connect to the Master Node Using SSH.

2.    On each node in your Amazon EMR cluster (master, core, and task nodes), overwrite the contents of /etc/spark/conf/log4j.properties with the following configuration. This configuration uses the RollingFileAppender class to rotate container log files when they exceed 100,000 bytes. Each rotated file is named with the timestamp to prevent duplicate files from being uploaded to S3.

log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

3.    To prevent permission errors, run spark-submit as sudo.

The next time that you run a Spark Streaming job, the logs are uploaded to S3 when they exceed 100,000 bytes. This prevents the container from consuming the remaining disk space on your EMR cluster's core and task nodes.

Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2018-12-04