Enabling Amazon Simple Storage Service (Amazon S3) Access Points in Apache Hadoop S3A
We’re pleased to announce that Amazon Simple Storage Service (Amazon S3) Access Points can now be used in Apache Hadoop 3.3.2 and any framework consuming the S3A connector or relying on the Hadoop Distributed File System (such as Apache Spark, Apache Hive, Apache Pig, and Apache Flink).
In this post, we review what access point resources are, how they can help you manage data inside Amazon Simple Storage Service (Amazon S3), and how to create and configure them.
Amazon S3 Access Points are a great way to simplify permissions for shared access to data inside your bucket. As an application grows, Amazon S3 Access Points makes managing permissions policies inside a single bucket policy easier. With Access Points, you can create named network endpoints and unique access control policies for each bucket.
Access points can be configured to accept requests that are only coming from inside an Amazon Virtual Private Cloud (Amazon VPC). This is done on creation and not by configuring extra policies like you would for VPC-only access on a S3 bucket. This means you can now configure to have sets of sensitive data in a S3 bucket that can be accessed only from an Amazon VPC.
There is no additional cost—in performance or otherwise—for creating and using access points, so there’s no need to worry about a Hadoop cluster runtime increasing if you choose to use them.
When running Hadoop clusters in or outside of Amazon Web Services (AWS), the main interaction between the Hadoop Distributed File System (HDFS) and Amazon S3 is handled by a connector called S3A. For a simplified and easy to maintain Hadoop setup, check out Amazon EMR, a platform for rapidly processing, analyzing, and applying machine learning to big data using open source frameworks.
Now let’s walk through how to create access points for a S3 bucket and use them directly in Hadoop.
Setting up Amazon S3 Access Points
Creating the resource
Creating an access point for a S3 bucket can be done through the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS SDK, or CloudFormation. For this example, we will use the console to create them.
- In the AWS Management Console, navigate to the Amazon S3 service.
- Select Access Points.
- Select Create Access Point.
- Enter a name, for example, finance.
- Choose the S3 bucket for which you want the access point to be created.
- Select whether the S3 bucket should be restricted to VPC-only traffic, or whether it should accept internet traffic, too, and choose the policy.
Once the access point is created, select Copy ARN so we can use it when configuring the Hadoop cluster.
Configuring S3A with access points
Open the Hadoop configuration file
core-default.xml and add the following property, remembering to replace the ARN and name value with your own (
Now when you use the
s3a://finance/data/in/bucket URI in Hadoop, S3A will automatically send requests to Amazon S3 through the access point and not to the S3 bucket directly.
Same name access points
Access point names do not have to be globally unique like S3 bucket names are. They only must be unique per account; however, you have a situation in which you have two access points with the same name created for two different accounts. For example, let’s say you have the following:
financeaccess point created for account
financeaccess point created for account
You can then configure access points in the following way:
Notice that the ARN and the access point name don’t have to change—only the property name when configuring S3A. Now you can use the URIs
s3a://finance-dublin/key2/ and access will be redirected accordingly.
Access point only access
One benefit of using access points is that they can be used to increase the overall security of your data in Amazon S3 by creating them to accept only requests from a virtual private network (VPC). To take full advantage of this capability and raise the bar on security, a new configuration option was added to S3A to configure access to Amazon S3 only through access point resources:
This lets us configure entire clusters to use only configured access points to access data in Amazon S3, and also configure those access points to accept only VPC traffic. It also reduces the chance of accidentally mistyping a S3 bucket name and ingesting data from a different public S3 bucket.
If you still want access to certain S3 buckets not to require access points, then you can configure them by using per-bucket overrides as such:
This makes it so that any URI inside of the following
s3a://non-ap-bucket will not use access points, but the S3 bucket name directly.
We’re happy to have contributed this feature into Hadoop, and we are excited to see developers simplify their Amazon S3 permission workflows and improve their cluster’s security by restricting access to VPCs only requests. To find out more about Access Points in Hadoop S3A make sure you read the official Hadoop docs.
Thanks to the Apache Hadoop community for their support, and special thanks to Apache Hadoop committers Steve Loughran and Mukund Thakur, and their colleague at Cloudera, Mehakmeet Singh for their reviews and guidance on the Access Point PR.
Support for Amazon S3 Access Points starts with Hadoop 3.3.2, so make sure you’re keeping your versions up to date.