Launching and Running an Amazon EMR Cluster inside a VPC

NOTE: This article contains information and instructions only pertinent to older EMR releases (emr-4.6.0 and earlier) and may no longer be applicable. For latest information please refer to the current user guide.

Daniel Garrison is a Big Data Support Engineer for Amazon Web Services

Introduction

With Amazon EC2 now firmly in the VPC-by-default model, it’s important to understand the ins and outs of running your Amazon EMR cluster successfully inside the Amazon VPC environment. In this post, we’ll explore the requirements for Hadoop to operate inside the EC2 VPC environment. Then we’ll create a new VPC environment and launch an EMR cluster. (Once you’re through reading this post, be sure to check out Part II which deals with custom DNS).

Hadoop Communication Requirements

According to the Hadoop wiki, “For Hadoop to work, all these machines must be able to find each other, to talk to each other, and indeed, simply identify themselves so that other machines in the cluster can find them.” In EMR, the ability for machines to find and talk to each other is provided by the DNS resolution and the DNS hostname VPC settings. With these default settings enabled, your instances are automatically assigned hostnames using public and private IP addresses with the dots replaced by dashes (example: ip-10-128-8-1.ec2.internal). DNS resolution to both internal and external resources is provided by internal DNS service. Instances are able to communicate using EMR-managed security groups.

Hadoop 2 Requirements

In Hadoop 1, the communication requirements are simple and (mostly) forgiving. For example, HDFS DataNodes in Hadoop 1 that can’t be resolved by the NameNode to a fully qualified domain name fall back to the IP address and continue to communicate. Unless there is an issue in the environment, there is not much that will stop a properly configured Hadoop 1 environment.

As the Hadoop project matured, so did the requirements for a more robust security model. Enhancements across the product brought in features such as Kerberos authentication, network encryption, and items focused on preventing unwanted nodes from joining a cluster, such as the following HDFS configuration parameter:

dfs.namenode.datanode.registration.ip-hostname-check

In Hadoop 2, a DataNode that couldn’t be resolved by the NameNode is now prevented from communicating.

org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode

In essence, the technical requirements have not changed from Hadoop 1 to Hadoop 2. Nodes must still find and communicate with one another to function properly. However, in Hadoop 2, there are consequences to having an improperly configured environment. These consequences can range from HDFS not starting to log files full of UnknownHost, ConnectionRefused, and NoRouteToHost exceptions.

VPC-specific Settings

EMR sits outside of your VPC and must be able to communicate with your nodes at all times. The instances in your cluster need to locate (with DNS) and communicate (using permissive security groups and network ACLs) with the EMR and Amazon S3 service endpoints. This process facilitates cluster creation, Amazon CloudWatch monitoring, log synchronizing, and the step submission process. To accomplish this, EMR requires an Internet gateway.

NAT instances are not supported as the default route, nor are VGWs. Attempting to launch a cluster in either of these scenarios causes a terminated cluster and an error:

"The subnet configuration was invalid: Cannot find route to InternetGateway in main RouteTable rtb-036bc666 for vpc vpc-a33690c6"

Additionally, EMR clusters launched in a subnet that uses a network ACL may unintentionally block this communication. Because network ACLs are stateless, the inbound and outbound rules for communication must be explicitly stated, unlike the stateful rules employed by the EMR-managed security groups. If an EMR cluster is launched in a subnet with an attached ACL and sufficient access is not provided, the cluster terminates. If you must use an ACL, use the EMR master security group as a guide for IP address and port ranges that must be able to communicate with the instance.

Understanding DNS Resolution and DNS Hostnames

Inside the VPC, the nodes need to be able to locate each other using DNS lookups. Otherwise, you will generate problems with crucial daemons such as the Yarn ResourceManager and the HDFS NameNode. Because of the important role that name resolution holds in the cluster, any unnoticed DNS issues can take up a lot of administration time that would otherwise be spent elsewhere.

The DNS hostname setting in the VPC controls whether your instances receive a fully qualified DNS name automatically. If you are using a default configuration with AmazonProvidedDNS as your DNS server, disabling this setting also prevents your instances from resolving any hostname located in the CIDR range of your VPC subnet.

Consider an instance launched with DNS hostnames enabled, doing a simple lookup against another instance in the VPC:

[ec2-user@ip-192-168-128-5 ~]$ nslookup 192.168.128.6
Server:		192.168.0.2
Address:	192.168.0.2#53
	
Non-authoritative answer:
6.128.168.192.in-addr.arpa	name = ip-192-168-128-6.us-west-2.compute.internal.

Authoritative answers can be found from:
	
[ec2-user@ip-192-168-128-5 ~]$ nslookup ip-192-168-128-6.us-west-2.compute.internal
Server:		192.168.0.2
Address:	192.168.0.2#53

Non-authoritative answer:
Name:	ip-192-168-128-6.us-west-2.compute.internal
Address: 192.168.128.6
	
[ec2-user@ip-192-168-128-5 ~]$

We disable DNS hostname support, reboot, and perform the same lookups again:

aws ec2 modify-vpc-attribute --vpc-id vpc-a33690c6 --no-enable-dns-hostnames

[ec2-user@ip-192-168-128-5 ~]$ nslookup 192.168.128.6
Server:		192.168.0.2
Address:	192.168.0.2#53

** server can't find 6.128.168.192.in-addr.arpa.: NXDOMAIN

[ec2-user@ip-192-168-128-5 ~]$ nslookup ip-192-168-128-6.us-west-2.compute.internal
Server:		192.168.0.2
Address:	192.168.0.2#53

** server can't find ip-192-168-128-6.us-west-2.compute.internal: NXDOMAIN

This time, our forward and reverse lookups fail. Note, however, that lookups to the outside world still succeed.

[ec2-user@ip-192-168-128-5 ~]$
[ec2-user@ip-192-168-128-5 ~]$ nslookup www.google.com
Server:		192.168.0.2
Address:	192.168.0.2#53

Non-authoritative answer:
Name:	www.google.com
Address: 216.58.216.132

Turning off DNS resolution now prevents us from resolving any queries.

aws ec2 modify-vpc-attribute --vpc-id vpc-a33690c6 --no-enable-dns-support

[ec2-user@ip-172-31-15-59 ~]$ nslookup google.com

;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

When crucial daemons have DNS issues they will be unable to start. Without all of the necessary systems running, EMR does not allow the cluster to begin working. Examples of this behavior are reflected as follows in your logs:

SHUTDOWN_MSG: Shutting down ResourceManager at java.net.UnknownHostException: ip-192-168-128-13.hadoop.local: ip-192-168-128-13.hadoop.local: Name or service not known

SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: ip-192-168-128-13.hadoop.local: ip-192-168-128-13.hadoop.local

Issues stemming from DNS errors lead to clusters terminating with errors like the following:

"On the master instance (i-b3b1e3bf), after bootstrap actions were run Hadoop failed to launch"

Another example:

On 2 slave instances (including i-92f8aa9e and i-8cf8aa80), after bootstrap actions were run Hadoop failed to launch.

EMR assumes that your EC2 instance will be assigned the default internal hostname (ec2.internal or region.compute.internal), or an IP address (if you have DNS hostnames disabled in your VPC configuration). Hadoop 1 will not generate an error when starting up using a custom domain name because there is no requirement that nodes must employ DNS resolution before starting.

In Hadoop 2, use of a custom domain name causes the cluster configuration values to be populated with hostnames that can’t be properly resolved. This in turn causes EMR to terminate the cluster because key daemons failed to start. Hadoop 2 clusters can use a custom domain name if the cluster can resolve the custom values. This can be accomplished by using a custom DNS server with your VPC.

Create a VPC for EMR

Now that we have a solid understanding of Hadoop’s and EMR’s mutual requirements, we can create a VPC in which to launch a cluster. You can either use the VPC wizard in the console, or follow along with the CLI steps below.

Our VPC will be a very small range /24, with our subnet space using a /28 netmask.

aws ec2 create-vpc --cidr-block 10.20.30.0/24 { "Vpc": { "InstanceTenancy": "default", "State": "pending", "VpcId": "vpc-055ef660", "CidrBlock": "10.20.30.0/24", "DhcpOptionsId": "dopt-a8c1c9ca" } }

Let’s create a subnet to go along with the VPC, and specify the VPC ID and the range for the subnet. In this example, we will use 10.20.30.0/28.

aws ec2 create-subnet --vpc-id vpc-055ef660 --cidr-block 10.20.30.0/28

The subnet ID and number of IP addresses will be returned. Take note of these.

{
    	"Subnet": {
        "VpcId": "vpc-055ef660",
        "CidrBlock": "10.20.30.0/28",
        "State": "pending",
        "AvailabilityZone": "us-west-2a",
        "SubnetId": "subnet-907af9f5",
        "AvailableIpAddressCount": 11
    	}
}

We need a route table with a public Internet gateway. We will issue the create-route-table command with the VPC ID from earlier. Next, we will create the default route with the create-route command.

aws ec2 create-route-table --vpc-id vpc-055ef660

Take note of the route table ID that is returned.

{
    "RouteTable": {
        "Associations": [],
        "RouteTableId": "rtb-4640f623",
        "VpcId": "vpc-055ef660",
        "PropagatingVgws": [],
        "Tags": [],
        "Routes": [
            {
                "GatewayId": "local",
                "DestinationCidrBlock": "10.20.30.0/24",
                "State": "active",
                "Origin": "CreateRouteTable"
            }
]
    }
}

Because it is required by EMR, we will create a new internet gateway.

aws ec2 create-internet-gateway

{ 
    "InternetGateway": {
        "Tags": [],
        "InternetGatewayId": "igw-24469141",
        "Attachments": []
    }
}

Take note of the Internet gateway ID that is returned. Next, we attach the Internet gateway to the VPC:

{ 
    "InternetGateway": {
        "Tags": [],
        "InternetGatewayId": "igw-24469141",
        "Attachments": []
    }
}

aws ec2 attach-internet-gateway --internet-gateway-id igw-24469141 --vpc-id vpc-055ef660

We will make the Internet gateway the default route using the Internet gateway ID and route table ID from earlier.

aws ec2 create-route --route-table-id rtb-e743f582 --destination-cidr-block 0.0.0.0/0 --gateway-id igw-24469141

And attach the route table:

aws ec2 associate-route-table --route-table-id rtb-ID --subnet-id subnet-ID

We need to check to see if DNS hostnames are enabled. If they are not, we will enable them.

aws ec2 describe-vpc-attribute  --vpc-id vpc-055ef660 --attribute  enableDnsHostnames
{                                                                      
    "VpcId": "vpc-055ef660",
    "EnableDnsHostnames": {
        "Value": false
    }
}

aws ec2 modify-vpc-attribute --vpc-id vpc-055ef660 --enable-dns-hostnames

Finally, we have everything in place to launch a cluster inside of the VPC successfully. We can use the following command to launch a test cluster running the word-count job, substituting your own S3 bucket name for the output location.

aws emr create-cluster --steps Type=STREAMING,Name='Streaming Program&',ActionOnFailure=CONTINUE,Args=[-files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3:///wordcount/output] --use-default-roles --ec2-attributes SubnetId=subnet-ID --ami-version 3.3.2 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

The cluster ID is returned.

{
     ClusterId": "j-2TEFHMDR3LXWD"
}

After several minutes, check to see that your cluster finished the word count job.

aws emr describe-cluster --cluster-id j-2TEFHMDR3LXWD --query Cluster.Status.StateChangeReason.Message
"Steps completed"

And there you have it! We have explored the requirements of Hadoop, and have seen how VPC settings work together with EC2 to make the EMR service more effective. By creating a new VPC, we saw how the default settings provide the communication and discovery requirements for the instances that comprise the Hadoop cluster, and how modifying these actions affects EMR.

Next Steps!

Please see Part II, where we explore how to configure a DNS server to launch an EMR cluster with a custom private domain name in VPC.

If you have questions or suggestions, please leave a comment below.

—————————————————————

Using IPython Notebook to Analyze Data with EMR