AWS Big Data Blog
Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet
Ben Snively is a Solutions Architect with AWS
Private subnets allow you to limit access to deployed components, and to control security and routing of the system. You can also use a private subnet to connect an on-premises local network to AWS through a VPN or AWS Direct Connect. Amazon EMR allows customers to launch clusters in a private subnet in Amazon VPC. For more information, see Select an Amazon VPC Subnet for the Cluster in the Amazon EMR Developer Guide.
When you launch an EMR cluster in a public subnet, the master node of the cluster has a public DNS, allowing you to create a SSH tunnel and securely access web applications like Hue, Apache Zeppelin, and other Hadoop web-UIs running on that node. However, when you launch an EMR cluster in a private subnet, your master node doesn’t have a public DNS, which can make it more difficult to access the web applications running on it.
In this post, I outline two possible solutions to securely access web UIs on an EMR cluster running in a private subnet. These options cover scenarios such as a connecting through a local network to your VPC or connecting through the Internet if your private subnet is not directly accessible.
Networking requirements in a private subnet
EMR depends on resources on the Internet and regional endpoints within the AWS cloud. Many customers analyze data in Amazon S3, Amazon Kinesis, and Amazon DynamoDB using EMR. Communication route to these services differ when your EMR cluster is in a private subnet.
Network routes can be configured from your private subnet to provide communication to these resources. A common setup is to use an S3 VPC endpoint for S3 and a NAT instance to access the public endpoints of other AWS services (for optional EMR functionality). Amazon VPC also offers a managed NAT gateway that can be configured to allow access to these public endpoints. The new VPC Subnets page in the EMR console makes the setting up S3 endpoints and NAT instances very easy.
In the above screenshot of the Create Cluster – Advanced Options page, you can also see that the private subnet does not have a route to an S3 Endpoint or a NAT instance; choose Add S3 Endpoint and NAT Instance to add these components. This takes you to the Configure Subnet page (which you can also access from the VPC Subnets page), which gives you options to configure these components directly before creating your cluster.
Connecting to a local network
When you connect a private subnet to your local network through a VPN connection or AWS Direct Connect, routing is configured so that communication spans the two networks. After the private connection is established, data scientists and engineers can connect directly from their local network to the private and public subnets in the VPC.
The diagram below shows an example networking flow in green:
Because the master node in the private subnet is directly addressable through the private connection, a client on the local network can follow the same mechanism as before to access the web UIs on EMR. There are multiple options to connect to the web interfaces outlined in the View Web Interfaces Hosted on Amazon EMR Clusters topic; this post builds upon the option to set up an SSH tunnel to the master node using dynamic port forwarding.
Connecting through the Internet
How can you access the EMR web interfaces through the Internet when it’s launched in a private subnet? In this section, I’ll focus on how to stand up a bastion host with a public IP and configure dynamic port forwarding to enable this communication.
A bastion host provides a secure gateway between the public and private subnets. The bastion host has a public IP address that clients connect to and provides a proxy to the web interfaces on the master node of your EMR cluster in a private subnet.
In an earlier blog post, Mike Pope showed how to configure a bastion host and SSH on a Linux instance running in a private subnet. When the EMR cluster is launched with an EC2 key pair, you can use the same technique to gain SSH access to nodes of your EMR cluster.
The diagram below shows network flow through the bastion host to EMR.
Setting up the bastion host and dynamic port forwarding
The following section focuses on the steps needed to set up the dynamic port forwarding to enable the web interfaces running on your EMR cluster. The steps assume that a VPC has already been created with a private and public subnet. For more information, see Getting Started with Amazon VPC.
Launch the bastion host
An easy way to launch a Linux bastion host is to use the Amazon Linux AMI, and follow the instructions on how to launch a Linux EC2 instance. It’s a good idea to assign an Elastic IP address to the instance. Elastic IP addresses are persistent public IP addresses for an instance. This allows the public IP address to stay the same if the instance is stopped or started, or if the instance itself ever needs to be changed. For more information, see Step 4: Assign an Elastic IP Address to Your Instance.
Configure VPC security groups to allow incoming traffic from the bastion host
By default, EMR associates one security group with the master node and associates a different security group with core and task nodes for the respective node types in your cluster. For more information about how security groups work as a stateful firewall, see Security Groups for EC2-VPC.
- Create a security group within the VPC prior to launching your EMR cluster.
- Associate this security group with the bastion host. The security group rules need to be set to allow ingress SSH traffic on port 22 from your client’s IP CIDR ranges.
- Create another new security group that allows ingress TCP traffic from the bastion host’s security group specified as a source.
- When you create your EMR cluster, use Advanced Options. When you configure Security options, under EC2 security groups, add these security groups for both the Master and Core & Task node Type by clicking the pencil under Additional security groups and selecting the security groups from the list
In the following screenshot, the “Internal TCP Traffic” security group is applied to all the nodes in the EMR cluster, which allows the TCP traffic between the bastion host and to all nodes in your cluster.
In the following screenshot, the master node shows both security groups.
- If your EMR cluster has already launched, you can modify the existing security groups with a new rule that allows ingress TCP traffic from the bastion host’s security group.
Connect to the bastion host with dynamic port forwarding
- In the EMR console, note the DNS entry for the master node, which is a private IP address because the cluster is running in a private subnet.
- Use SSH to connect to the bastion host running in the public subnet (through the bastion host’s Elastic IP address):
ssh -D <local_port> ec2-user@<EIP for bastion host>
The SSH tunnel may time out in the above statement as it’s both setting up the tunnels and starting a secure shell. If you’d like to create the tunnel’s setup, you can use the following command:
ssh -ND <local_port> ec2-user@<EIP for bastion host>
View active web UI links
After the dynamic port forwarding is set up and a dynamic proxy is configured in the browser (see Configure Proxy Settings to View Websites Hosted on the Master Node), the links for various web UIs become active on the EMR Cluster Details page:
You can choose the links to open the web UIs in your browser, or enter the internal master node IP address with the relevant port for a specific web UI.
Other patterns for access
Other patterns also support accessing the web interfaces running on an EMR cluster in a private subnet. These patterns have their trade-offs, but are important to mention.
HTTP(S) proxy in the public subnet
One option is to run a component that accepts HTTP(S) traffic in the public VPC subnet and proxies requests to your EMR cluster in the private subnet. An easy setup for this would be to launch an Elastic Load Balancing load balancer in the public subnets that points to web UI endpoints on your cluster’s master node. Other flavors of this include running a HTTP proxy (such as NGINX or HAProxy) on an EC2 instance in the public subnet. This is very easy to set up, and because users do not need an SSH key, it can be easier for them to connect. However, this can make it more difficult for fine-grained user access and this method does not support all web UIs on your cluster.
Client/desktops within AWS
Generally, the closer you can be to the data and processing, the more you can do. This also applies to clients accessing the web interface. An excellent option is for customers to use Amazon Workspaces as a virtual desktop, and access the web interfaces on EMR from their Workspace, which can be configured to access a private subnet.
You can now create secure SSH tunnels to access the various web interfaces like Hue and the YARN Resource Manager UI running on your EMR cluster, even when that cluster is in a private subnet.
If you have questions or suggestions, please leave a comment below.
Using IPython Notebook to Analyze Data with Amazon EMR
Love to work on open source? Check out EMR’s careers page.