How can I access S3 Requester Pays buckets from AWS Glue, Amazon EMR, or Athena?

Last updated: 2020-07-08

How can I access an Amazon Simple Storage Service (Amazon S3) Requester Pays bucket from AWS Glue, Amazon EMR, or Amazon Athena?

Short description

To access S3 buckets that have Requester Pays enabled, all requests to the bucket must have the Requester Pays header.

Resolution

AWS Glue

AWS Glue requests to Amazon S3 don't include the Requester Pays header by default. Without this header, an API call to a Requester Pays bucket fails with an AccessDenied exception. To add the Requester Pays header to an ETL script, use hadoopConfiguration().set() to enable fs.s3.useRequesterPaysHeader on the GlueContext variable or the Apache Spark session variable.

GlueContext:

glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

Spark session:

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

The following is an example of how to use the header in an ETL script. Replace these values:

database_name: the name of your database
your_table_name: the name of your table
s3://awsdoc-example-bucket/path-to-source-location/: the path to the source bucket
s3://awsdoc-example-bucket/path-to-target-location/: the path to the destination bucket

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")

##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")

job.commit()

Amazon EMR

Set the following property in /usr/share/aws/emr/emrfs/conf/emrfs-site.xml:

<property>
   <name>fs.s3.useRequesterPaysHeader</name>
   <value>true</value>
</property>

Athena

To allow workgroup members to query Requester Pays buckets, choose Enable queries on Requester Pays buckets in Amazon S3 when you create the workgroup. For more information, see Create a workgroup.