AWS Architecture Blog

Field Notes: Launch Amazon EMR with a Static Private IP in a Private Subnet

Organizations across every industry and sector are looking to easily and cost-effectively process vast amounts of data. Amazon EMR offers a way to instantly provision as much or as little capacity as needed to perform data- intensive tasks.

When launching Amazon EMR, the IPs of the primary (master) and core node are automatically assigned at the starting point. However, you may need to set up static private IPs for an Amazon EMR cluster to connect to systems within your on-premises data center. For example, if your on-premises data center has firewall policies set to allow access only from specific IPs. In that case, you have to assign static private IPs to the primary (master) and core nodes to access servers within on-premises from Amazon EMR.

Most corporate security policies allow only limited IPs to pass through their firewall. Therefore, when launching an Amazon EMR cluster, it is necessary to allocate an IP allowed by the enterprise to the cluster.

This post explains how you can set static private IPs to an Amazon EMR cluster to access servers within your on- premises data center. I also show one of the use cases using this solution, where you copy data from the Hadoop Distributed Filesystem (HDFS) within on-premises to Amazon S3. We use the distcp command on Amazon EMR with static private IPs for primary (master) and core nodes.

Overview of solution

To adopt the highest level of security for your Amazon EMR cluster, you should place your Amazon EMR cluster on a private subnet. For an EMR cluster launched in a private subnet to communicate with the outside of the subnet, AWS Direct Connect, a VPN connection, or VPC Peering must be used. Corporate firewall policy allows minimal access in general and this keeps accessible IPs to a minimum.

When launching an Amazon EMR cluster, this solution specifies the secondary private IP of the primary (master) and core node to ensure compliance with the corporate security policies It uses a bootstrap action to automate the assignment of a static private IP to the Amazon EMR cluster.

The following diagram shows how Amazon EMR with the static private IPs communicates with servers within on-premises having a restricted access policy. VPC A simulates the on-premises data center.

 

EMR solution diagram

Walkthrough

This solution follows these steps in order:

  • Create VPC A for Source VPC with a private subnet.
    • If you have an existing VPC, you can use it. When you create a new VPC, refer to the AWS CloudFormation template in this GitHub repo. It assumes that VPC A is a simulated virtual network environment for on-premises to be accessed from the Amazon EMR cluster.
  • Create VPC B for Amazon EMR VPC with a private subnet. When you create a new VPC, you can refer to AWS CloudFormation template in this GitHub repo.
    • To configure VPC peering between the two VPC, the CIDRs of the two VPC must be created so that they do not overlap.
  • Add specific IPs to the inbound rule of VPC A’s Security Group to allow access from them. These IPs will be the secondary private IP of the primary(master) and core nodes of the Amazon EMR cluster.
  • Set up VPC peering between VPC A and VPC B.
    • To connect between on-premises and VPC, you can use Amazon Direct Connect or VPN for the connection. However, this post uses two VPCs, one is for Amazon EMR and one for the simulated network of your data center.
  • To test this solution, launch Amazon EC2 instances within the private subnet of VPC A, and create a hadoop cluster on top of them. The HDFS on this hadoop cluster will be a data source to access from Amazon EMR.
  • Download and upload the following python script to your Amazon S3 bucket.
  • Launch the Amazon EMR cluster in the private subnet of VPC B using CloudFormation stack.
  • When you launch the Amazon EMR cluster:
    • Enter the location of python script into bootstrap action
    • Pass the static private IPs into optional the arguments field of bootstrap action to assign the secondary private IP to Amazon EMR’s primary (master) and core nodes. It assumes that these secondary private IPs got allowed to access VPC A and that the inbound rule of the security group is set up.
  • To launch the Amazon EMR cluster with this bootstrap action, you can use AWS the following CloudFormation template.

The following python code is for the bootstrap action of the Amazon EMR cluster.

#!/usr/bin/python
#
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

import sys, subprocess
import time

is_master = subprocess.check_output(['cat /emr/instance-controller/lib/info/instance.json | jq .isMaster'], shell=True).strip()
cluster_id = subprocess.check_output(['cat /mnt/var/lib/info/job-flow.json | jq -r .jobFlowId'], shell=True).strip()
core_count = subprocess.check_output(['aws emr describe-cluster --cluster-id %s | jq -r \'.[].InstanceGroups[] | select( .InstanceGroupType == "CORE") | .RequestedInstanceCount\'' % cluster_id], shell=True).strip()
instance_group_id = subprocess.check_output(['cat /mnt/var/lib/info/instance.json | jq -r .instanceGroupId'], shell=True).strip()
instance_group_type = subprocess.check_output(['cat /mnt/var/lib/info/job-flow.json | jq -r \'.instanceGroups | .[] | select( .instanceGroupId == \"%s\") | .instanceRole\' | tr a-z A-Z' % instance_group_id], shell=True).strip()
instance_id = subprocess.check_output(['/usr/bin/curl -s http://169.254.169.254/latest/meta-data/instance-id'], shell=True)
interface_id = subprocess.check_output(['aws ec2 describe-instances --instance-ids %s | jq .Reservations[].Instances[].NetworkInterfaces[].NetworkInterfaceId' % instance_id], shell=True).strip().strip('"')
current_tag_name = subprocess.check_output(['aws ec2 describe-tags --filters Name=resource-id,Values=%s | jq -r \'.Tags | .[] | select( .Key == \"Name\") | .Value\'' % instance_id], shell=True).strip().strip('"')

#Create Tags for EMR Master/Core Instances
subprocess.check_call(['aws ec2 create-tags --resources %s --tags Key=Name,Value=%s-%s' % (instance_id, current_tag_name, instance_group_type)], shell=True)


# sys.argv[1] = master node ip, sys.argv[2~n] = core node ip
if len(sys.argv) != int(core_count)+2:
    print "Insufficient arguments"
    sys.exit(2)

if is_master == "true":
    print "This is the MASTER node"
    private_ip = str(sys.argv[1])
    
    #Assign private IP to the MASTER instance:
    subprocess.check_call(['aws ec2 assign-private-ip-addresses --network-interface-id %s --private-ip-addresses %s' % (interface_id, private_ip)], shell=True)
    subnet_id = subprocess.check_output(['aws ec2 describe-instances --instance-ids %s | jq .Reservations[].Instances[].NetworkInterfaces[].SubnetId' % instance_id], shell=True).strip().strip('"').strip().strip('"')
    subnet_cidr = subprocess.check_output(['aws ec2 describe-subnets --subnet-ids %s | jq .Subnets[].CidrBlock' % subnet_id], shell=True).strip().strip('"')
    cidr_prefix = subnet_cidr.split("/")[1]

    #Get CORE Instance ID in parameter store
    PARAMETERS = subprocess.check_output(['aws ssm describe-parameters --parameter-filters "Key=tag:Resource,Values=CORES" --query "Parameters[*]"|jq -r .[].Name'], shell=True).split()
    #Wait till all CORE instances get registered in parameter store. 
    while len(PARAMETERS) < int(core_count):
        time.sleep(1)
        PARAMETERS = subprocess.check_output(['aws ssm describe-parameters --parameter-filters "Key=tag:Resource,Values=CORES" --query "Parameters[*]"|jq -r .[].Name'], shell=True).split()
        print('Core nodes = %s'%len(PARAMETERS))

    # use for loop put all CORE node's IP parameters
    for i in range(len(PARAMETERS)):
        subprocess.check_call(['aws ssm put-parameter --name %s --value %s --type String --overwrite' % (PARAMETERS[i],sys.argv[i+2])], shell=True);

#Add the private IP address to the default network interface:
    subprocess.check_call(['sudo ip addr add dev eth0 %s/%s' % (private_ip, cidr_prefix)], shell=True)

    #Configure iptables rules such that traffic is redirected from/to the secondary to/from the primary IP address:
    primary_ip = subprocess.check_output(['/sbin/ifconfig eth0 | grep \'inet \' | cut -d: -f2 | awk \'{ print $2}\''], shell=True).strip()
    subprocess.check_call(['sudo iptables -t nat -A PREROUTING -d %s -j DNAT --to-destination %s' % (private_ip, primary_ip)], shell=True)
    subprocess.check_call(['sudo iptables -t nat -A POSTROUTING -s %s -j SNAT --to %s' % (primary_ip, private_ip)], shell=True)

else:
    print "This is the CORE node"
    
    #Add system parameter store for ip of CORE nodes
    subprocess.check_call(['aws ssm put-parameter --name %s --value "blank" --type String --tags Key=Resource,Value=CORES' % instance_id], shell=True)
    private_ip = subprocess.check_output(['aws ssm get-parameters --names %s --query "Parameters[*]"|jq -r .[].Value' % instance_id], shell=True).strip().strip('"')
    print('Private IP = %s'%private_ip)

    #Wait till MASTER is going to register all IPs of CORE nodes in parameter store
    while True:
        if private_ip != 'blank':
            break;
        else:
            time.sleep(1)
            private_ip = subprocess.check_output(['aws ssm get-parameters --names %s --query "Parameters[*]"|jq -r .[].Value' % instance_id], shell=True).strip().strip('"')
            print('Private IP = %s'%private_ip)

    #Assign private IP to the CORE instance:
    subprocess.check_call(['aws ec2 assign-private-ip-addresses --network-interface-id %s --private-ip-addresses %s' % (interface_id, private_ip)], shell=True)
    subnet_id = subprocess.check_output(['aws ec2 describe-instances --instance-ids %s | jq .Reservations[].Instances[].NetworkInterfaces[].SubnetId' % instance_id], shell=True).strip().strip('"').strip().strip('"')
    subnet_cidr = subprocess.check_output(['aws ec2 describe-subnets --subnet-ids %s | jq .Subnets[].CidrBlock' % subnet_id], shell=True).strip().strip('"')
    cidr_prefix = subnet_cidr.split("/")[1]

    #Add the private IP address to the default network interface:
    subprocess.check_call(['sudo ip addr add dev eth0 %s/%s' % (private_ip, cidr_prefix)], shell=True)

    #Configure iptables rules such that traffic is redirected from/to the secondary to/from the primary IP address:
    primary_ip = subprocess.check_output(['/sbin/ifconfig eth0 | grep \'inet \' | cut -d: -f2 | awk \'{ print $2}\''], shell=True).strip()
    subprocess.check_call(['sudo iptables -t nat -A PREROUTING -d %s -j DNAT --to-destination %s' % (private_ip, primary_ip)], shell=True)
    subprocess.check_call(['sudo iptables -t nat -A POSTROUTING -s %s -j SNAT --to %s' % (primary_ip, private_ip)], shell=True)

The main idea of this solution is to use the Parameter Store of AWS Systems Manager, which is a centralized store to manage the configuration data.

This python script provides the following actions:

  • Save the specified private IPs in the sys.argv array, which are entered to the optional arguments of the bootstrap action.
  • Determine whether the node on which this python code is running is primary (master) or core node.
  • Put the specified private IPs into the Parameter Store of AWS Systems Manager.
  • Set up the specified private IPs as secondary private IP to Amazon EC2 instances for primary(master) and core nodes of Amazon EMR.
  • Set up the routing rule to iptables for each Amazon EC2 instance.

Prerequisites

  • An AWS account
  • Two VPCs with private and public subnets. You can use the sample AWS CloudFormation template in this GitHub repo to create a VPC.
  • Security Group and inbound rule of VPC A. Source field of inbound rule is each IP of Amazon EMR cluster’s primary(master) and core nodes.
  • A Key Pair and an Amazon EC2 instance in the public subnet of both VPCs for the bastion host.
  • An Elastic IP and a NAT gateway in the public subnet of both VPCs.
  • An Amazon S3 VPC endpoint in the private subnet of VPC B.
  • The hadoop cluster in private subnet of VPC A (Optional).
  • A VPC peering between VPC A and VPC B.
  • A python script stored into Amazon S3 for a bootstrap action.

Create Amazon EMR with the bootstrap action

To launch Amazon EMR, you can use AWS CloudFormation stack to assign a static private IP to primary (master) and core nodes of Amazon EMR.

1.Log in to AWS Management Console using your AWS account.

2. To launch Amazon EMR cluster with a static private IP, choose Launch Stack.

launch stack button

3. Select the Region where you want to run your Amazon EMR cluster.

4. Enter your parameter values and refer to the screen below.

The following screenshot shows an example of the AWS CloudFormation stack parameters.

parameters image

5. Verify the primary(master) and core nodes of Amazon EMR created from the stack.

Amazon EMR image

6. You confirm that the second private IP of Amazon EMR primary (master) and core node is set to the IP that you enter as a parameter in the AWS CloudFormation stack.

security group image

7.  Set up the security group inbound rule of VPC A to allow access from the static private IPs of the Amazon EMR primary (master) and core nodes.

Hadoop security group image

8. Set up the VPC peering between VPC A and VPC B.

Peering connection

Use case using this solution

You can now transfer data over a specified private IP between servers within the on-premises data center or one VPC and Amazon EMR cluster running on another VPC.

1.Log in to the primary (master) node of Amazon EMR cluster over the bastion host in the public subnet.

2. Copy data inside hadoop HDFS to Amazon S3 through Amazon EMR cluster with the static private IPs.

Hadoop distcp hdfs://<source ip>:9000/<filesystem name>/ s3a://<bucket name>/

Hadoop distcp hdfs

3. Check the copied data in the Amazon S3 bucket.

Copied data in S3

Cleaning up

To avoid incurring future charges, delete the AWS CloudFormation stack and any additional resources.

Conclusion

In this blog, I showed how you can  create an Amazon EMR cluster with a static private IP in a private subnet. This solution uses a bootstrap action to assign a static private IP for both primary (master) and core nodes of Amazon EMR cluster at launch time automatically. I also recommend you refer to this blog to learn more:  Access web interfaces securely on Amazon EMR launched in a private subnet using an Application Load Balancer.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.
TAGS:
Incheol Roh

Incheol Roh

Incheol Roh is a Solutions Architect based in Seoul. With database and data analytics experience in various industries, he has been working with his customers to build efficient architectures to help them achieve data-driven business outcomes. 노인철 솔루션즈 아키텍트는 다양한 산업 군에서 데이터베이스와 데이터 분석 경험을 바탕으로 고객이 데이터 기반의 비즈니스 성과를 달성할 수 있도록 고객과 함께 효율적인 아키텍처를 구성하는 역할을 수행하고 있습니다.