如何从 AWS Glue、Amazon EMR 或 Amazon Athena 中访问 Amazon S3 申请方付款存储桶?

上次更新时间:2021 年 11 月 29 日

我想要从 AWS Glue、Amazon EMR 或 Amazon Athena 访问 Amazon Simple Storage Service (Amazon S3) 申请方付款存储桶。

简短描述

要访问已开启申请方付款的 S3 存储桶,对存储桶的所有请求都必须具有“申请方付款”标头。

解决方法

AWS Glue

默认情况下,对 Amazon S3 的 AWS Glue 请求不包括“申请方付款”标头。如果没有该标头,则“申请方付款”存储桶的 API 调用将失败并发出“访问被拒绝”异常。要将“申请方付款”标头添加到 ETL 脚本中,请在 GlueContext 变量或 Apache Spark 会话变量上使用 hadoopConfiguration().set() 开启 fs.s3.useRequesterPaysHeader

GlueContext:

glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

Spark 会话:

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

下面是关于如何使用 ETL 脚本中的标头的示例。替换以下值:

database_name:您的数据库的名称your_table_name:您的表名称s3://awsdoc-example-bucket/path-to-source-location/:至源存储桶的路径s3://awsdoc-example-bucket/path-to-target-location/:至目标存储桶的路径

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")

##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")

job.commit()

Amazon EMR

/usr/share/aws/emr/emrfs/conf/emrfs-site.xml 中设置以下属性:

<property>
   <name>fs.s3.useRequesterPaysHeader</name>
   <value>true</value>
</property>

Athena

要允许工作组成员查询申请方付款存储桶,请在创建工作组时选择 Enable queries on Requester Pays buckets in Amazon S3(在 Amazon S3 中启用对申请方付款存储桶的查询)。有关更多信息,请参阅创建工作组