AWS Glue, Amazon EMR 또는 Amazon Athena에서 Amazon S3 요청자 지불 버킷에 액세스하려면 어떻게 해야 합니까?

2분 분량

AWS Glue, Amazon EMR 또는 Amazon Athena에서 Amazon Simple Storage Service(Amazon S3) 요청자 지불 버킷에 액세스하려 합니다.

간략한 설명

요청자 지불을 사용하는 S3 버킷에 액세스하려면 버킷에 대한 모든 요청에 Requester Pays 헤더가 있어야 합니다.

해결 방법

AWS Glue

Amazon S3에 대한 AWS Glue 요청에는 기본적으로 요청자 지불 헤더가 포함되지 않습니다. 이 헤더가 없으면 요청자 지불 버킷에 대한 API 호출이 AccessDenied 예외와 함께 실패합니다. ETL 스크립트에 Requester Pays 헤더를 추가하려면 **hadoopConfiguration().set()**를 사용하여 GlueContext 변수 또는 Apache Spark 세션 변수에서 fs.s3.useRequesterPaysHeader를 사용합니다.

GlueContext:

glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

Spark 세션:

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

다음은 ETL 스크립트에서 헤더를 사용하는 방법의 예입니다. 다음 값을 바꿉니다.

database_name: 데이터베이스 이름 your_table_name: 테이블 이름 s3://awsdoc-example-bucket/path-to-source-location/: 소스 버킷 경로 s3://awsdoc-example-bucket/path-to-target-location/: 대상 버킷 경로

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")
# glueContext._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

##AWS Glue DynamicFrame read and write
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database_name", table_name = "your_table_name", transformation_ctx = "datasource0")
datasource0.show()
datasink = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path":"s3://awsdoc-example-bucket/path-to-source-location/"}, format = "csv")

##Spark DataFrame read and write
df = spark.read.csv("s3://awsdoc-example-bucket/path-to-source-location/")
df.show()
df.write.csv("s3://awsdoc-example-bucket/path-to-target-location/")

job.commit()

Amazon EMR

/usr/share/aws/emr/emrfs/conf/emrfs-site.xml에서 다음 속성을 설정합니다.

<property>
   <name>fs.s3.useRequesterPaysHeader</name>
   <value>true</value>
</property>

Athena

작업 그룹 구성원이 요청자 지불 버킷을 쿼리하도록 허용하려면 작업 그룹을 생성할 때 [Amazon S3의 요청자 지불 버킷에 대한 쿼리 활성화]를 선택합니다. 자세한 내용은 작업 그룹 생성을 참조하십시오.