如何从 AWS Glue、Amazon EMR 或 Athena 访问 S3 申请方付款存储桶?

上次更新时间:2020 年 7 月 8 日

如何从 AWS Glue、Amazon EMR 或 Amazon Athena 访问 Amazon Simple Storage Service (Amazon S3) 申请方付款存储桶?

简短描述

要访问已启用申请方付款的 S3 存储桶,对存储桶的所有请求都必须具有“申请方付款”标头。

解决方法

AWS Glue

默认情况下,对 Amazon S3 的 AWS Glue 请求不包括“申请方付款”标头。如果没有该标头,则 API 调用将失败并发出的 AccessDenied 异常提示,通知您为申请方付款存储桶。要将“申请方付款”标头添加到 ETL 脚本中,请在 GlueContext 变量或 Apache Spark 会话变量上使用 hadoopConfiguration().set() 启用 fs.s3.useRequesterPaysHeader

GlueContext:

glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

Spark 会话:

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

以下关于如何使用 ETL 脚本中的标头的示例。替换这些值:

database_name:您的数据库的名称
your_table_name:您的表的名称
s3://awsdoc-example-bucket/path-to-source-location/:源存储桶的路径
s3://awsdoc-example-bucket/path-to-target-location/:目标存储桶的路径

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")

##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")

job.commit()

Amazon EMR

/usr/share/aws/emr/emrfs/conf/emrfs-site.xml 中设置以下属性:

<property>
   <name>fs.s3.useRequesterPaysHeader</name>
   <value>true</value>
</property>

Athena

要允许工作组成员查询申请方付款存储桶,请在创建工作组时选择在 Amazon S3 中启用对申请方付款存储桶的查询。有关更多信息,请参阅创建工作组