My Spark or Hive job on Amazon EMR fails with an HTTP 503 "Slow Down" AmazonS3Exception

Last updated: 2020-04-20

My Apache Spark or Apache Hive job on Amazon EMR fails with an HTTP 503 "Slow Down" AmazonS3Exception like this:

java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: 2E8B8866BFF00645; S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE=), S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE=

Short Description

This error occurs when you exceed the Amazon Simple Storage Service (Amazon S3) request rate. The request rate is 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix in a bucket.

There are three ways to resolve this problem:

  • Add more prefixes to the S3 bucket.
  • Reduce the number of Amazon S3 requests.
  • Increase the EMR File System (EMRFS) retry limit.

Resolution

Before you can identify the issue with too many requests, first configure Amazon CloudWatch request metrics.

Configure CloudWatch request metrics

To monitor Amazon S3 requests, enable CloudWatch request metrics for the bucket. Then, define a filter for the prefix. For a list of useful metrics to monitor, see Amazon S3 CloudWatch Request Metrics.

After you enable metrics, use the data in the metrics to determine which of the following resolutions is best for your use case.

Add more prefixes to the S3 bucket

There are no limits to the number of prefixes in a bucket. The request rate applies to each prefix, not the bucket. For example, if you create three prefixes in a bucket like this:

  • s3://awsexamplebucket/images
  • s3://awsexamplebucket/videos
  • s3://awsexamplebucket/documents

Then, you can make 10,500 write requests or 16,500 read requests per second to that bucket.

Reduce the number of Amazon S3 requests

  • If multiple concurrent jobs (Spark, Apache Hive, or s3-dist-cp) are reading or writing to same Amazon S3 prefix: Reduce the number of concurrent jobs. Start with the most read/write heavy jobs. If you configured cross-account access for Amazon S3, keep in mind that other accounts might also be submitting jobs to the prefix.
  • If the error happens when the job tries to write to the destination bucket: Reduce the parallelism of the jobs. For example, use Spark .coalesce() or .repartition() operations to reduce number of Spark output partitions before writing to Amazon S3. You can also reduce the number of cores per executor or reduce the number of executors.
  • If the error happens when the job tries to read from the source bucket: Reduce the number of files to reduce the number of Amazon S3 requests. For example, use s3-dist-cp to merge a large number of small files into a smaller number of large files.

Increase the EMRFS retry limit

By default, the EMRFS retry limit is set to 4. You can increase the retry limit on a new cluster, on a running cluster, or at application runtime.

To increase the retry limit on a new cluster without EMRFS consistency, add a configuration object similar to the following when you launch the cluster.

[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.maxRetries": "20"
      }
    }
]

To launch a new cluster with EMRFS consistency and a higher retry limit:

Instead of enabling EMRFS consistent view on Step 3: General Cluster Settings, under Additional Options, add a configuration object similar to the following when you launch the cluster. This configuration specifies all required properties for EMRFS consistent view and increases the retry limit to 20.

[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.maxRetries": "20",
        "fs.s3.consistent.retryPeriodSeconds": "10",
        "fs.s3.consistent": "true",
        "fs.s3.consistent.retryCount": "5",
        "fs.s3.consistent.metadata.tableName": "EmrFSMetadata"
      }
    }
]

After the cluster is launched, Spark and Hive applications use the new limit.

To increase the retry limit on a running cluster:

1.    Open the Amazon EMR console.

2.    In the cluster list, choose the active cluster that you want to reconfigure under Name.

3.    Open the cluster details page for the cluster and choose the Configurations tab.

4.    In the Filter drop-down list, select the instance group that you want to reconfigure.

5.    In Reconfigure drop-down menu, choose Edit in table.

6.    In the configuration classification table, choose Add configuration, and then enter the following:

For Classification: emrfs-site
For Property: fs.s3.maxRetries
For Value: the new value for the retry limit (for example, 20)

7.    Select Apply this configuration to all active instance groups, and then choose Save changes.

After the configuration is deployed, Spark and Hive applications use the new limit.

To increase the retry limit at runtime:

Here's an example of a Spark shell session that increases the retry limit at runtime:

spark> sc.hadoopConfiguration.set("fs.s3.maxRetries", "20")
spark> val source_df = spark.read.csv("s3://awsexamplebucket/data/")
spark> source_df.write.save("s3://awsexamplebucket2/output/")

Here's an example of how to increase the retry limit at runtime for a Hive application:

hive> set fs.s3.maxRetries=20;
hive> select ....