How do I troubleshoot "Error Code: 503 Slow Down" on s3-dist-cp jobs in Amazon EMR?

Last updated: 2020-02-10

My S3DistCp (s3-dist-cp) job on Amazon EMR job fails due to Amazon Simple Storage Service (Amazon S3) throttling. I get an error message like this:

mapreduce.Job: Task Id : attempt_xxxxxx_0012_r_000203_0, Status : FAILED Error: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon
S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: D27E827C847A8304; S3 Extended Request ID: XWxtDsEZ40GLEoRnSIV6+HYNP2nZiG4MQddtNDR6GMRzlBmOZQ/LXlO5zojLQiy3r9aimZEvXzo=), S3 Extended Request ID: XWxtDsEZ40GLEoRnSIV6+HYNP2nZiG4MQddtNDR6GMRzlBmOZQ/LXlO5zojLQiy3r9aimZEvXzo= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)

Short Description

"Slow Down" errors occur when you exceed the Amazon S3 request rate (3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket). This often happens when your data uses Apache Hive-style partitions. For example, the following Amazon S3 paths use the same prefix (/year=2019/). This means that the request limit is 3,500 write requests or 5,500 read requests per second.

  • s3://awsexamplebucket/year=2019/month=11/day=01/mydata.parquet
  • s3://awsexamplebucket/year=2019/month=11/day=02/mydata.parquet
  • s3://awsexamplebucket/year=2019/month=11/day=03/mydata.parquet

If increasing the number of partitions isn't an option, reduce the number of reducer tasks or increase the EMR File System (EMRFS) retry limit to resolve Amazon S3 throttling errors.

Resolution

Use one of the following options to resolve throttling errors on s3-dist-cp jobs.

Reduce the number of reduces

The mapreduce.job.reduces parameter sets the number of reduces for the job. Amazon EMR automatically sets mapreduce.job.reduces based on the number of nodes in the cluster and the cluster's memory resources. Run the following command to confirm the default number of reduces for jobs in your cluster:

$ hdfs getconf -confKey mapreduce.job.reduces

To set a new value for mapreduce.job.reduces, run a command similar to the following. This command sets the number of reduces to 10.

$ s3-dist-cp -Dmapreduce.job.reduces=10 --src s3://awsexamplebucket/data/ --dest s3://awsexamplebucket2/output/

Increase the EMRFS retry limit

By default, the EMRFS retry limit is set to 4. Run the following command to confirm the retry limit for your cluster:

$ hdfs getconf -confKey fs.s3.maxRetries

To increase the retry limit for a single s3-dist-cp job, run a command similar to the following. This command sets the retry limit to 20.

$ s3-dist-cp -Dfs.s3.maxRetries=20 --src s3://awsexamplebucket/data/ --dest s3://awsexamplebucket2/output/

To increase the retry limit on a new or running cluster:

[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.maxRetries": "20"
      }
    }
]

When you increase the retry limit for the cluster, Spark and Hive applications can also use the new limit. Here's an example of a Spark shell session that uses the higher retry limit:

spark> sc.hadoopConfiguration.set("fs.s3.maxRetries", "20")
spark> val source_df = spark.read.csv("s3://awsexamplebucket/data/")
spark> source_df.write.save("s3://awsexamplebucket2/output/")