Why does my Spark or Hive job on Amazon EMR fail with an HTTP 503 "Slow Down" AmazonS3Exception?

Last updated: 2022-05-17

My Apache Spark or Apache Hive job on Amazon EMR fails with an HTTP 503 "Slow Down" AmazonS3Exception similar to the following:

java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: 2E8B8866BFF00645; S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE=), S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE=

Short description

This error occurs when the Amazon Simple Storage Service (Amazon S3) request rate for your application exceeds the typically sustained rates of over 5,000 requests per second, and Amazon S3 internally optimizes performance.

To improve the success rate of your requests when accessing the S3 data using Amazon EMR, try the following approaches:

  • Modify the retry strategy for S3 requests.
  • Adjust the number of concurrent S3 requests.

Resolution

To help identify the issue with too many requests, it's a best practice to configure Amazon CloudWatch request metrics for the S3 bucket. You can determine the solution that best works for your use case based on these CloudWatch metrics.

Configure CloudWatch request metrics

To monitor Amazon S3 requests, turn on CloudWatch request metrics for the bucket. Then, define a filter for the prefix. For a list of useful metrics to monitor, see Monitoring metrics with Amazon CloudWatch.

Modify the retry strategy for S3 requests

By default, EMRFS uses an exponential backoff strategy to retry requests to Amazon S3. The default EMRFS retry limit is 15. However, you can increase the retry limit on a new cluster, on a running cluster, or at application runtime.

To increase the retry limit, change the value of fs.s3.maxRetries parameter. If you set a very high value for this parameter, then you might experience longer job duration. Try setting this parameter to a high value (for example, 20), monitor the duration overhead of the jobs, and then adjust this parameter based on your use case.

For a new cluster, you can add a configuration object similar to the following when you launch the cluster:

[
  {
    "Classification": "emrfs-site",
    "Properties": {
      "fs.s3.maxRetries": "20"
    }
  }
]

After the cluster is launched, Spark and Hive applications running on Amazon EMR use the new limit.

To increase the retry limit on a running cluster, do the following:

1.    Open the Amazon EMR console.

2.    In the cluster list, choose the active cluster that you want to reconfigure under Name.

3.    Open the cluster details page for the cluster and choose the Configurations tab.

4.    In the Filter dropdown list, select the instance group that you want to reconfigure.

5.    In Reconfigure dropdown list, choose Edit in table.

6.    In the configuration classification table, choose Add configuration, and then enter the following:

For Classification: emrfs-site

For Property: fs.s3.maxRetries

For Value: the new value for the retry limit (for example, 20)

7.    Select Apply this configuration to all active instance groups.

8.    Choose Save changes.

After the configuration is deployed, Spark and Hive applications use the new limit.

To increase the retry limit at runtime, use a Spark shell session similar to the following:

spark> sc.hadoopConfiguration.set("fs.s3.maxRetries", "20")
spark> val source_df = spark.read.csv("s3://awsexamplebucket/data/")
spark> source_df.write.save("s3://awsexamplebucket2/output/")

Here's an example of how to increase the retry limit at runtime for a Hive application:

hive> set fs.s3.maxRetries=20;
hive> select ....

Adjust the number of concurrent S3 requests

  • If you have multiple jobs (Spark, Apache Hive, or s-dist-cp) reading and writing to the same S3 prefix, then you can adjust the concurrency. Start with the most read/write heavy jobs and lower their concurrency to avoid excessive parallelism. If you configured cross-account access for Amazon S3, keep in mind that other accounts might also be submitting jobs to the same prefix.
  • If you see errors when the job tries to write to the destination bucket, then reduce excessive write parallelism. For example, use Spark .coalesce() or .repartition() operations to reduce number of Spark output partitions before writing to Amazon S3. You can also reduce the number of cores per executor or reduce the number of executors.
  • If you see errors when the job tries to read from the source bucket, then adjust the size of objects. You can aggregate smaller objects to larger ones so that the number of objects to be read by the job is reduced. Doing this makes your jobs to read datasets with fewer read requests. For example, use s3-dist-cp to merge a large number of small files into a smaller number of large files.