AWS Big Data Blog

Launching and Running an Amazon EMR Cluster in your VPC – Part 2: Custom DNS

Daniel Garrison is a Big Data Support Engineer for Amazon Web Services

In Part 1 you learned how Amazon EMR uses Amazon VPC DNS hostname and DHCP settings to satisfy the Hadoop requirements. Because it’s common to change the domain name setting in your DHCP options set to a custom internal domain name, this post explores how to configure several DNS services successfully for EMR, such as BIND, Dnsmasq, and Amazon Route 53. We’ll perform the following steps:

  1. Create a BIND server configuration for our VPC.
  2. Modify the created VPC to use a custom domain name and DNS server.
  3. Install and configure Dnsmasq to resolve hostname queries for our custom domain.
  4. Configure Route 53 to host our private zone.

As in the earlier post, we will then launch a simple test cluster to ensure that everything starts and runs appropriately.

A Quick Note on DNS Before Getting Started

A DHCP server is required to auto-generate DNS hostname records. This is not possible with the DHCP options in the VPC, so you must manually list all possible IP address records for each subnet you launch a cluster into. Additionally, to satisfy Hadoop DNS lookup, both forward and reverse lookups must succeed. This means you need a forward zone for the hostname A records and a reverse lookup zone for the IP PTR records. For this blog post, we show a small subnet (11 usable host records). Plan your needs according to the size of your subnet and cluster footprint.  Because of these requirements, if the VPC DNS services can’t resolve the domain name in your DHCP options set (and unless it’s the default region.compute.internal domain name or hosted in Route 53, it can’t!), something has to resolve it.

OK, let’s get started!

Creating a BIND Server Configuration

If you have an existing DNS server, you can simply configure the necessary host and IP address records. This section shows an example BIND server configuration. Due to the tricky nature of setting up BIND servers, we won’t show a step-by-step install. We assume that you have an existing BIND server already configured and running some of your name resolution services.

If you are new to BIND setup, note that  it is extremely sensitive to errors in the zone files. If you wish to set this up from scratch, check out this step-by-step guide for setting up a private DNS server with Ubuntu.  Otherwise, you can skip directly to the section below where you learn how to configure Dnsmasq. Also, although we show using a BIND server as an example, the same operation works with any DNS server.

This example BIND server is configured to listen for queries on both the local loopback address and the private internal IP address:

	listen-on port 53 { 127.0.0.1; 10.20.30.5; };

All IP addresses in the VPC are allowed to make a recursive queries in order to discover unknown hostnames. Ensure that these settings are as restrictive as possible to prevent DNS replay attacks:

recursion yes;
allow-recursive { 10.20.30.0/24; };
allow-query { 10.20.30.0/24; };

Two zone files are configured: “hadoop.local” and “30.20.10.in-addr.arpa”.

//
// named.conf
//
// Provided by Red Hat bind package to configure the ISC BIND named(8) DNS
// server as a caching-only nameserver (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
options {
    listen-on port 53 { 127.0.0.1; 10.20.30.5; };
//  listen-on-v6 port 53 { ::1; };
    directory   "/var/named";
    dump-file   "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
    allow-query     { 10.20.30.0/24; };
    recursion yes;
    allow-recursion    { 10.20.30.0/24; };
    dnssec-enable no;
    dnssec-validation no;
//  dnssec-lookaside auto;

    /* Path to ISC DLV key */
    bindkeys-file "/etc/named.iscdlv.key";

    managed-keys-directory "/var/named/dynamic";
};

logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};

zone "." IN {
    type hint;
    file "named.ca";
};

zone "hadoop.local" IN {
    type master;
    file "hadoop.zone";
};

zone "30.20.10.in-addr.arpa" IN {
    type master;
    file "reverse.zone";
};

include "/etc/named.rfc1912.zones";
include "/etc/named.root.key";

The forward zone, hadoop.local,  contains nameserver (NS) records back to the BIND server at 10.20.30.5; all of the possible host names are defined as A records as follows:

$TTL 5M
@   IN SOA  ns.hadoop.local.    root.hadoop.local. (
                    0   ; serial
                    1D  ; refresh
                    1H  ; retry
                    1W  ; expire
                    3H )    ; minimum
    IN NS   ns.hadoop.local.
;host records

ns  IN  A   10.20.30.5
ip-10-20-30-6   IN  A   10.20.30.6
ip-10-20-30-7   IN  A   10.20.30.7
ip-10-20-30-8   IN  A   10.20.30.8
ip-10-20-30-9   IN  A   10.20.30.9
ip-10-20-30-10  IN  A   10.20.30.10
ip-10-20-30-11  IN  A   10.20.30.11
ip-10-20-30-12  IN  A   10.20.30.12
ip-10-20-30-13  IN  A   10.20.30.13
ip-10-20-30-14  IN  A   10.20.30.14
;Alias for nameserver
ip-10-20-30-5   IN  CNAME   ns

The same is true of your reverse zone 30.20.10.in-addr.arpa. All possible IP addresses in the subnet are mapped back to hostname A records in the hadoop.local zone.

$TTL 5M
@   IN SOA  ns.hadoop.local.    root.hadoop.local. (
                    0   ; serial
                    1D  ; refresh
                    1H  ; retry
                    1W  ; expire
                    3H )    ; minimum
    IN  NS  ns.hadoop.local.
;ptr records
5   IN  PTR ip-10-20-30-5.hadoop.local.
6   IN  PTR ip-10-20-30-6.hadoop.local.
7   IN  PTR ip-10-20-30-7.hadoop.local.
8   IN  PTR ip-10-20-30-8.hadoop.local.
9   IN  PTR ip-10-20-30-9.hadoop.local.
10  IN  PTR ip-10-20-30-10.hadoop.local.
11  IN  PTR ip-10-20-30-11.hadoop.local.
12  IN  PTR ip-10-20-30-12.hadoop.local.
13  IN  PTR ip-10-20-30-13.hadoop.local.
14  IN  PTR ip-10-20-30-14.hadoop.local.

Modifying Your VPC to Use a Custom Domain Name and DNS Server

In the last post, we showed you how to create a working VPC. The next step is to modify the domain-name and dns-server by creating a new DHCP options set and build a DNS server to resolve the hadoop.local zone. This DHCP options set will use a custom DNS server located at the first usable IP address in your subnet. (Remember that AWS reserves the first four and last addresses in any subnet, making your first usable address .5.)

Start by launching an Amazon EC2 instance with a private IP address of 10.20.30.5. Make sure to enter a key pair for SSH access. For this example, we show using the latest Amazon Linux HVM EBS-backed AMI in Oregon. You can find the AMI ID for your region here:

aws ec2 run-instances --image-id ami-dfc39aef --count 1 --key-name  --instance-type t2.micro --subnet-id subnet-907af9f5 --private-ip-address 10.20.30.5 --associate-public-ip-address

Issue the following command to create a new DHCP options set with a domain name of ‘hadoop.local’ and the DNS servers set to ‘10.20.30.5, AmazonProvidedDNS’:

$ aws ec2 create-dhcp-options --dhcp-configurations Key=domain-name,Values=hadoop.local Key=domain-name-servers,Values=10.20.30.5,AmazonProvidedDNS
{
    "DhcpOptions": {
        "DhcpConfigurations": [
            {
                "Values": [
                    {
                        "Value": "hadoop.local"
                    }
                ],
                "Key": "domain-name"
            },
            {
                "Values": [
                    {
                        "Value": "10.20.30.5"
                    },
                    {
                        "Value": "AmazonProvidedDNS"
                    }
                ],
                "Key": "domain-name-servers"
            }
        ],
        "DhcpOptionsId": "dopt-793fdf1c"
    }
}

Now, associate the dhcp options set with the VPC-id value for the vpc that the hostname will be in:

{
    "DhcpOptions": {
        "DhcpConfigurations": [
            {
                "Values": [
                    {
                        "Value": "hadoop.local"
                    }
                ],
                "Key": "domain-name"
            },
            {
                "Values": [
                    {
                        "Value": "10.20.30.5"
                    },
                    {
                        "Value": "AmazonProvidedDNS"
                    }
                ],
                "Key": "domain-name-servers"
            }
        ],
        "DhcpOptionsId": "dopt-793fdf1c"
    }
}

Modify the security group that the instance has launched into to allow inbound DNS queries from the instances in the VPC CIDR range. If you did not provide a subnet group ID to the run-instances command, this is the default security group in the VPC.

aws ec2 authorize-security-group-ingress --group-id sg-08bf9b6d --protocol udp --port 53 --cidr 10.20.30.0/24

Installing & Configuring Dnsmasq to Resolve Hostname Queries

If you are not running a DNS server inside of your VPC and want to run an EMR cluster without additional administration, a lightweight DNS server like Dnsmasq works just fine. Often found on home routers and other devices where resources are limited, Dnsmasq can provide name caching and forwarding services. Version 2.6.7 added a feature called synth-domain that allows you to provide artificial A/PTR records for any IP/hostname range that you choose.

--synth-domain=<domain>,<address range>[,<prefix>]

This allows you to respond to any possible IP address or hostname query without specifying each one, or without being restricted to a single subnet which has already defined the zone records. The amazon-linux yum repo only provides you with version 2.4.8; to use this functionality, you must enable an outside repository or compile from the source.

Use SSH to connect to the newly created server and submit the following command to install the development tools (including a compiler):

[ec2-user ~]$ sudo yum –y groupinstall "Development Tools"

Copy and paste the following commands into the terminal to download the latest Dnsmasq source code, untar, build, and link back to the expected system sbin path.

wget http://www.thekelleys.org.uk/dnsmasq/dnsmasq-2.72.tar.gz
gunzip dnsmasq-*.gz
tar -xf dnsmasq-*.tar
cd dnsmasq-2.72
sudo make install           
sudo ln -s /usr/local/sbin/dnsmasq /usr/sbin/dnsmasq

Use the command line to start a DNS server on the default listener port. This command provides resolution for the entire VPC CIDR range. Your VPC CIDR range is 10.20.30.0/24 (which you provided to the create-vpc command in the first post). The server parameter forwards queries that can’t be resolved to the default AWS DNS server located at the base of your VPC network range plus two.

sudo dnsmasq --interface=eth0 --listen-address=127.0.0.1 --synth-domain=hadoop.local,10.20.30.0/24,ip- --server=10.20.30.2

From a separate instance inside the VPC, test that queries to this server will be returned properly. Issue an nslookup command against a fictional instance with an ip-hostname value.

Now, issue the same lookup against the IP address:

[ec2-user@ip-10-20-30-6 ~]$ nslookup ip-10-20-30-40.hadoop.local 10.20.30.5
Server:		10.20.30.5
Address:	10.20.30.5#53

Name:	ip-10-20-30-40.hadoop.local
Address: 10.20.30.40
[ec2-user@ip-10-20-30-6 ~]$ nslookup 10.20.30.40 10.20.30.5
Server:		10.20.30.5
Address:	10.20.30.5#53

40.30.20.10.in-addr.arpa	name = ip-10-20-30-40.hadoop.local

They both resolved successfully; only one thing is left to test. Ensure that you can resolve an IP address for s3.amazonaws.com; this is a good test of your ability to resolve public DNS entries, especially service endpoints.

[ec2-user@ip-10-20-30-6 ~]$ nslookup s3.amazonaws.com 10.20.30.5
Server:		10.20.30.5
Address:	10.20.30.5#53

Non-authoritative answer:
s3.amazonaws.com	canonical name = s3.a-geo.amazonaws.com.
s3.a-geo.amazonaws.com	canonical name = s3-2.amazonaws.com.
Name:	s3-2.amazonaws.com
Address: 54.231.244.4

Now that everything is in order, fire up another wordcount job to test.

aws emr create-cluster --bootstrap-actions Path= s3://support.elasticmapreduce/bootstrap-actions/other/sethostforvpc.sh
 --steps Type=STREAMING,Name='Streaming Program',ActionOnFailure=CONTINUE,Args=[-files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3:///wordcount/output/] --ami-version 3.6.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

If everything is set up correctly, you will have a successful cluster in the next few minutes. If your termination reason is “Steps completed”, everything worked as expected.

aws emr describe-cluster --cluster-id  --query Cluster.Status.StateChangeReason.Message
"Steps completed" 

If you plan to use Route 53 to resolve your host records, you can successfully terminate the DNSmasq server at this point; you won’t need it anymore.

aws ec2 terminate-instances --instance-ids i-22d6792f

Configuring Route 53 to Host Your Private Zone

The final scenario we explore is to use the Route 53 private DNS zones to host the forward and reverse zone records. As we discussed before, the same issue arises with an existing DNS server; we do not have the ability to populate the host/address records without running our own DHCP server. To keep this post simple, we use a small subnet, and define every possible address. To make things even simpler, we use the example BIND zones above and import them directly into Route 53.

Route 53 requires that the Dns Hostnames parameter be enabled; issue this command to enable it if it was previously disabled:

aws ec2 modify-vpc-attribute --vpc-id vpc-055ef660  --enable-dns-hostnames

Create two zones, one for forward lookup and one for reverse lookup. Make sure to provide a unique –caller-reference value for each request.

aws route53 create-hosted-zone --name hadoop.local --vpc VPCRegion=us-west-2,VPCId=vpc-055ef660 --caller-reference Support01

aws route53 create-hosted-zone –name 30.20.10.in-addr.arpa –vpc VPCRegion=us-west-2,VPCId=vpc-055ef660 –caller-reference Support02 Take note of the hosted zone ID that is returned:

"Id": "/hostedzone/Z34WMQP8SCULM7 "

Issue the following command to associate your Route 53 private DNS zones with your VPC.

aws route53 associate-vpc-with-hosted-zone --hosted-zone-id /hostedzone/Z34WMQP8SCULM7 --vpc VPCRegion=us-west-2,VPCId=vpc-055ef660

One more time for the reverse zone as well.

aws route53 associate-vpc-with-hosted-zone --hosted-zone-id /hostedzone/Z1KTAQQMOLRO1Z --vpc VPCRegion=us-west-2,VPCId=vpc-055ef660

Now, import the zone records from your example BIND server above. On the Route 53 console, select the hadoop.local zone, choose Go to Record Sets, Import Zone File, and then paste the contents of your zone file into the Zone File field. Choose Import at bottom of the screen. Do this operation one more time for the 30.20.10.in-addr.arpa zone.

Create a new DHCP options set that points the DNS servers for hadoop.local to AmazonProvidedDNS, and removes the additional DNS server from the Dnsmasq example.

aws ec2 create-dhcp-options --dhcp-configurations Key=domain-name,Values=hadoop.local Key=domain-name-servers,Values=AmazonProvidedDNS

Take note of the new DHCP options set ID.

{
    "DhcpOptions": {
        "DhcpConfigurations": [
            {
                "Values": [
                    {
                        "Value": "hadoop.local"
                    }
                ],
                "Key": "domain-name"
            },
            {
                "Values": [
                    {
                        "Value": "AmazonProvidedDNS"
                    }
                ],
                "Key": "domain-name-servers"
            }
        ],
        "DhcpOptionsId": "dopt-936484f6"
    }
}

Associate the new DHCP option set with the VPC.

aws ec2 associate-dhcp-options --dhcp-options-id dopt-936484f6 --vpc-id vpc-055ef660

And now launch one final cluster action and verify that everything works properly.

Ensure that you substitute your own values, and also use a different output location than the previous examples, or this operation will fail.

aws emr create-cluster --bootstrap-actions Path= s3://support.elasticmapreduce/bootstrap-actions/other/sethostforvpc.sh
 --steps Type=STREAMING,Name='Streaming Program',ActionOnFailure=CONTINUE,Args=[-files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3:///wordcount/output/] --ami-version 3.6.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

Wait for several minutes and check the last state change reason:

aws emr describe-cluster --cluster-id j-2TEFHMDR3LXWD --query Cluster.Status.StateChangeReason.Message 
"Steps completed"

Note: If you would like to avoid DNS administration altogether, you can make use of bootstrap actions to set your cluster host names back to the defaults, and allow AWS services to handle everything.  Add the following custom bootstrap action to your cluster:

s3://support.elasticmapreduce/bootstrap-actions/other/sethostforvpc.sh

Conclusion

And that’s it! You’ve learned how to use a custom DNS server to launch an EMR cluster with a custom private domain name inside a VPC. As long as you are able to resolve all of the nodes in your cluster, you can use any method you like to launch an EMR cluster.  If you have an advanced EMR administration topic you would like to see covered in a future post, please let us know in the comments below.

If you have questions or suggestions, please leave a comment below.

—————————————-

Related:

Using IPython Notebook to Analyze Data with EMR

Getting Started with Elasticsearch and Kibana on EMR

Strategies for Reducing your EMR Costs

—————————————————————-

Love to work on open source? Check out EMR’s careers page.