How can I access Amazon S3 Requester Pays buckets from AWS Glue, Amazon EMR, or Amazon Athena?

2 minute read

I want to access an Amazon Simple Storage Service (Amazon S3) Requester Pays bucket from AWS Glue, Amazon EMR, or Amazon Athena.

Short description

To access S3 buckets that have Requester Pays turned on, all requests to the bucket must have the Requester Pays header.

Resolution

AWS Glue

AWS Glue requests to Amazon S3 don't include the Requester Pays header by default. Without this header, an API call to a Requester Pays bucket fails with an AccessDenied exception. To add the Requester Pays header to an ETL script, use hadoopConfiguration().set() to turn on fs.s3.useRequesterPaysHeader on the GlueContext variable or the Apache Spark session variable.

GlueContext:

glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

Spark session:

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

The following is an example of how to use the header in an ETL script. Replace the following values:

database_name: the name of your database your_table_name: the name of your table s3://awsdoc-example-bucket/path-to-source-location/: the path to the source bucket s3://awsdoc-example-bucket/path-to-target-location/: the path to the destination bucket

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")

##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")

job.commit()

Amazon EMR

Set the following property in /usr/share/aws/emr/emrfs/conf/emrfs-site.xml:

<property>
   <name>fs.s3.useRequesterPaysHeader</name>
   <value>true</value>
</property>

Athena

To allow workgroup members to query Requester Pays buckets, choose Enable queries on Requester Pays buckets in Amazon S3 when you create the workgroup. For more information, see Create a workgroup.

Related information

Downloading objects in Requester Pays buckets

How do I troubleshoot 403 Access Denied errors from Amazon S3?

Topics

Analytics

Relevant content

"Access Denied" exception while making GET and DELETE requests to Amazon S3 buckets
rePost-User-4255628
asked 10 months ago
CloudFront with S3 Requester pays origin
Smotrov
asked 2 years ago
Access Denied Exception while accessing Cross Region S3 bucket from EMR.
Joswa Thomas
asked 5 months ago
S3 bucket blocking request
oguzhank
asked a year ago
S3DistCp on a bucket with Requester Pays
Kraisfeld
asked 3 months ago
Why can't I access my S3 bucket when using the Hue S3 File Browser in Amazon EMR?
AWS OFFICIALUpdated 2 years ago
How can I access my Amazon S3 bucket over Direct Connect?
AWS OFFICIALUpdated a year ago
How can I see who's been accessing my Amazon S3 buckets and objects?
AWS OFFICIALUpdated a year ago
How can I audit deleted or missing objects from my Amazon S3 bucket?
AWS OFFICIALUpdated 5 months ago
Troubleshoot 403 Access Denied error in Amazon S3
EXPERT
Gayathri Krishnamoorthy
published a year ago