How do I reduce the amount of logs generated by my AWS Glue job?

Last updated: 2021-07-29

My AWS Glue job is generating too many logs in Amazon CloudWatch. I want to reduce the number of logs generated.

Resolution

With AWS Glue-Spark ETL jobs, you can't control the verbosity of the logs generated by the instances on which the AWS Glue jobs are run. The logs are verbose so that they can be used to monitor internal failures and help in diagnosing job failures. However, you can define the Spark logging levels by doing the following:

Choose the standard filter setting for continuous logging

If you've turned on continuous logging for your job, then choose the Standard filter for Log filtering option. This filter can help you prune the non-useful Apache Spark driver/executor and Apache Hadoop YARN heartbeat log messages. To change the log filter setting for your AWS Glue job, do the following:

  1. Open the AWS Glue console.
  2. In the navigation pane, choose Jobs.
  3. Select the job that you want to update.
  4. Choose Action, and then choose Edit job.
  5. Expand the Monitoring options section.
  6. Select Continuous logging.
  7. Under Log filtering, select Standard filter.
  8. Choose Save.

To change this setting from the AWS Command Line Interface (AWS CLI), use the following command:

'--enable-continuous-cloudwatch-log': 'true'
'--enable-continuous-log-filter': 'true'

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

For more information, see Enabling continuous logging for AWS Glue jobs.

Important: Even with the standard filter setting, the application master logs for the Spark jobs are still pushed to /aws-glue/jobs/output and /aws-glue/jobs/error log groups.

Set the logging level using Spark context method setLogLevel

You can set the logging level for your job using the setLogLevel method of pyspark.context.SparkContext. Valid logging levels include ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, and WARN. For more information, see Spark documentation for setLogLevel.

Use the following code to import the Spark context method and set the logging level for your job:

from pyspark.context import SparkContext
sc = SparkContext()
sc.setLogLevel("new-log-level")

Note: Be sure to replace new-log-level with the logging level that you want to set for your job.

For more information, see Spark documentation for Configuring logging.

Use a custom log4j.properties file to define the logging level

Spark uses log4j configurations for logging. You can include the logging preferences in a log4j.properties file, upload the file to Amazon Simple Storage Service (Amazon S3), and use the file in the AWS Glue job.

To reference the Amazon S3 file in the job, do the following:

  1. Open the AWS Glue console.
  2. In the navigation pane, choose Jobs.
  3. Select the job where you want to reference the file.
  4. Choose Actions, and then choose Edit job.
  5. Expand the Security configuration, script libraries, and job parameters (optional) section.
  6. For Referenced files path, paste the full S3 path where you stored the lo4j.properties file.

For more information, see Providing your own custom scripts.

For a sample log4j.properties file, see log4j.properties.template in Apache Spark's GitHub repository.


Did this article help?


Do you need billing or technical support?