My Spark or Hive job on Amazon EMR fails with an HTTP 503 "Slow Down" AmazonS3Exception

Last updated: 2019-12-10

My Apache Spark or Apache Hive job on Amazon EMR job fails with an HTTP 503 "Slow Down" AmazonS3Exception like this:

java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: 2E8B8866BFF00645; S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE=), S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE=

Short Description

This error occurs when you exceed the Amazon Simple Storage Service (Amazon S3) request rate (3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket).

There are three ways to resolve this problem:

  • Reduce the number of Amazon S3 requests.
  • Add more prefixes to the S3 bucket.
  • Increase the EMR File System (EMRFS) retry limit.

Resolution

Before you can identify the issue with too many requests, first configure Amazon CloudWatch request metrics

Configure CloudWatch request metrics

To monitor Amazon S3 requests, enable CloudWatch request metrics for the bucket. Then, define a filter for the prefix. For a list of useful metrics to monitor, see Amazon S3 CloudWatch Request Metrics.

After you enable metrics, use the data in the metrics to determine which of the following resolutions is best for your use case.

Reduce the number of Amazon S3 requests

  • If multiple concurrent jobs (Spark, Apache Hive, or s3-dist-cp) are reading or writing to same Amazon S3 prefix: Reduce the number of concurrent jobs. If you configure cross-account access for Amazon S3, keep in mind that other accounts might also be submitting jobs to the prefix.
  • If the error happens when the job tries to write to the destination bucket: Reduce the parallelism of the jobs. For example, use Spark .coalesce() or .repartition() operations to reduce number of Spark output partitions before writing to Amazon S3. You can also reduce the number of cores per executor or reduce the number of executors.
  • If the error happens when the job tries to read from the source bucket: Reduce the number of files to reduce the number of Amazon S3 requests. For example, use s3-dist-cp to merge a large number of small files into a smaller number of large files.

Add more prefixes to the S3 bucket

Another way to resolve "Slow Down" errors is to add more prefixes to the S3 bucket. There are no limits to the number of prefixes in a bucket. The request rate applies to each prefix, not the bucket.

For example, if you create three prefixes in a bucket like this:

  • s3://awsexamplebucket/images
  • s3://awsexamplebucket/videos
  • s3://awsexamplebucket/documents

then you can make 10,500 write requests or 16,500 read requests per second to that bucket.

Increase the EMRFS retry limit

By default, the EMRFS retry limit is set to 4. Run the following command to confirm the retry limit for your cluster:

$ hdfs getconf -confKey fs.s3.maxRetries
[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.maxRetries": "20"
      }
    }
]

When you increase the retry limit for the cluster, Spark and Hive applications can also use the new limit. Here's an example of a Spark shell session that uses the higher retry limit:

spark> sc.hadoopConfiguration.set("fs.s3.maxretries", "20")
spark> val source_df = spark.read.csv("s3://awsexamplebucket/data/")
spark> source_df.write.save("s3://awsexamplebucket2/output/")