I want to access an Amazon Simple Storage Service (Amazon S3) Requester Pays bucket from AWS Glue, Amazon EMR, or Amazon Athena.
Short description
To access S3 buckets that have Requester Pays turned on, all requests to the bucket must have the Requester Pays header.
Resolution
AWS Glue
AWS Glue requests to Amazon S3 don't include the Requester Pays header by default. Without this header, an API call to a Requester Pays bucket fails with an AccessDenied exception. To add the Requester Pays header to an ETL script, use hadoopConfiguration().set() to turn on fs.s3.useRequesterPaysHeader on the GlueContext variable or the Apache Spark session variable.
GlueContext:
glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
Spark session:
spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
The following is an example of how to use the header in an ETL script. Replace the following values:
database_name: the name of your database your_table_name: the name of your table s3://awsdoc-example-bucket/path-to-source-location/: the path to the source bucket s3://awsdoc-example-bucket/path-to-target-location/: the path to the destination bucket
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")
##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")
job.commit()
Amazon EMR
Set the following property in /usr/share/aws/emr/emrfs/conf/emrfs-site.xml:
<property>
<name>fs.s3.useRequesterPaysHeader</name>
<value>true</value>
</property>
Athena
To allow workgroup members to query Requester Pays buckets, choose Enable queries on Requester Pays buckets in Amazon S3 when you create the workgroup. For more information, see Create a workgroup.
Related information
Downloading objects in Requester Pays buckets
How do I troubleshoot 403 Access Denied errors from Amazon S3?