AWS Big Data Blog
Dynamically Create Friendly URLs for Your Amazon EMR Web Interfaces
Amazon EMR enables data analysts and scientists to deploy a cluster running popular frameworks such as Spark, HBase, Presto, and Flink of any size in minutes. When you launch a cluster, Amazon EMR automatically configures the underlying Amazon EC2 instances with the frameworks and applications that you choose for your cluster. This can include popular web interfaces such as Hue workbench, Zeppelin notebook, and Ganglia monitoring dashboards and tools.
These web interfaces are hosted on the EMR master node and must be accessed using the public DNS name of the master node (master public DNS value). The master public DNS value is dynamically created, not very user friendly and is hard to remember— it looks something like ip-###-###-###-###.us-west-2.compute.internal. Not having a friendly URL to connect to the popular workbench or notebook interfaces may impact the workflow and hinder your gained agility.
Some customers have addressed this challenge through custom bootstrap actions, steps, or external scripts that periodically check for new clusters and register a friendlier name in DNS. These approaches either put additional burden on the data practitioners or require additional resources to execute the scripts. In addition, there is typically some lag time associated with such scripts. They often don’t do a great job cleaning up the DNS records after the cluster has terminated, potentially resulting in a security risk.
The solution in this post provides an automated, serverless approach to registering a friendly master node name for easy access to the web interfaces.
AWS services
Inspired in part by our colleague’s post on Building a Dynamic DNS for Route 53 using CloudWatch Events and Lambda, this solution leverages Amazon CloudWatch Events, AWS Lambda, and Amazon Route 53 to dynamically register a CNAME with a friendly name in a Route 53 private hosted zone.
Before I dive deeper, I review these key services and how they are part of this solution.
CloudWatch Events
CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules, you can match events and route them to one or more target functions or streams. An event can be generated in one of four ways:
- From an AWS service when resources change state
- From API calls that are delivered via AWS CloudTrail
- From your own code that can generate application-level events
- Issued on cron-style scheduling
In this solution, I cover the first type of event, which is automatically emitted by EMR when the cluster state changes. Based on the state of this event, either create or update the DNS record in Route 53 when the cluster state changes to STARTING, or delete the DNS record when the cluster is no longer needed and the state changes to TERMINATED. For more information about all EMR event details, see Monitor CloudWatch Events.
Route 53 private hosted zones
A private hosted zone is a container that holds information about how to route traffic for a domain and its subdomains within one or more VPCs. Private hosted zones enable you to use custom DNS names for your internal resources without exposing the names or IP addresses to the internet.
Route 53 supports resource record sets with a wide range of record types. In this solution, you use a CNAME record that is used to specify a domain name as an alias for another domain (the ‘canonical’ domain). You use a friendly name of the cluster as the CNAME for the EMR master public DNS value.
You are using private hosted zones because an EMR cluster is typically deployed within a private subnet and is accessed either from within the VPC or from on-premises resources over VPN or AWS Direct Connect. To resolve domain names in private hosted zones from your on-premises network, configure a DNS forwarder, as described in How can I resolve Route 53 private hosted zones from an on-premises network via an Ubuntu instance?.
Lambda
Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda executes your code only when needed and scales automatically to thousands of requests per second. Lambda takes care of high availability, and server and OS maintenance and patching. You pay only for the consumed compute time. There is no charge when your code is not running.
Lambda provides the ability to invoke your code in response to events, such as when an object is put to an Amazon S3 bucket or as in this case, when a CloudWatch event is emitted. As part of this solution, you deploy a Lambda function as a target that is invoked by CloudWatch Events when the event matches your rule. You also configure the necessary permissions based on the Lambda permissions model, including a Lambda function policy and Lambda execution role.
Putting it all together
Now that you have all of the pieces, you can put together a complete solution. The following diagram illustrates how the solution works:
- Start with a user activity such as launching or terminating an EMR cluster.
- EMR automatically sends events to the CloudWatch Events stream.
- A CloudWatch Events rule matches the specified event, and routes it to a target, which in this case is a Lambda function. In this case, you are using the EMR Cluster State Change
- The Lambda function performs the following key steps:
- Get the clusterId value from the event detail and use it to call EMR. DescribeCluster API to retrieve the following data points:
- MasterPublicDnsName – public DNS name of the master node
- Locate the tag containing the friendly name to use as the CNAME for the cluster. The key name containing the friendly name should be The value should be specified as host.domain.com, where domain is the private hosted zone in which to update the DNS record.
- Update DNS based on the state in the event detail.
- If the state is RUNNING or WAITING, the function calls the Route 53 API to create or update a resource record set in the private hosted zone specified by the domain tag. This is a CNAME record mapped to MasterPublicDnsName.
- Conversely, if the state is TERMINATED, the function calls the Route 53 API to delete the associated resource record set from the private hosted zone.
- Get the clusterId value from the event detail and use it to call EMR. DescribeCluster API to retrieve the following data points:
Deploying the solution
Because all of the components of this solution are serverless, use the AWS Serverless Application Model (AWS SAM) template to deploy the solution. AWS SAM is natively supported by AWS CloudFormation and provides a simplified syntax for expressing serverless resources, resulting in fewer lines of code.
Overview of the SAM template
For this solution, the SAM template has 129 lines of code, vs. almost 200 lines without SAM resources (and writing the template in YAML would be even slightly smaller). The solution can be deployed using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SAM Local.
CloudFormation transforms help simplify template authoring by condensing a multiple-line resource declaration into a single line in your template. To inform CloudFormation that your template defines a serverless application, add a line under the template format version as follows:
Before SAM, you would use the AWS::Lambda::Function resource type to define your Lambda function. You would then need a resource to define the permissions for the function (AWS::Lambda::Permission) and another resource to define a CloudWatch Events resource (Events::Rule) that triggers this function.
With SAM, you need to define just a single resource for your function, AWS::Serverless::Function. Using this single resource type, you can define everything that you need, including function properties such as function handler, runtime, and code URI, as well as the CloudWatch event.
A few additional things to note in the code example:
- CodeUri – Before you can deploy a SAM template, first upload your Lambda function code zip to S3. You can do this manually or use the aws cloudformation package CLI command to automate the task of uploading local artifacts to a S3 bucket, as shown later.
- CloudWatch Events rule – Instead of specifying a CloudWatch Events resource type, you are defining an event source object as a property of the function itself. When the template is submitted, CloudFormation expands this into a CloudWatch Events rule resource and automatically creates the Lambda resource-based permissions to allow the CloudWatch Events rule to trigger the function.
- Lambda execution role and permissions – The IAM policy for the Lambda execution role includes route53:ChangeResourceRecordSets, route53:GetHostedZone, route53:ListResourceRecordSets permissions for any hosted zone in your account (i.e. arn:aws:route53:::hostedzone/*). You can further scope down these permissions to specific hosted zones you plan to use with your EMR clusters (for example arn:aws:route53:::hostedzone/Z22AAAAABBBBB).
Deploying the solution using the console
1.) Log in to the CloudFormation console and choose Create stack.
2.) For Choose a template, select Specify an Amazon S3 template URL and enter the following URL:
NOTE: If you are trying this solution outside of us-east-1, then you should download the necessary files, upload them to the buckets in your region, edit the script as appropriate and then run it or use the CLI deployment method below.
3.) Choose Next.
4.) On the Specify Details page, keep or modify the stack name and choose Next.
5.) On the Options page, choose Next.
6.) On the Review page, acknowledge the three transform access capabilities. This allows the CloudFormation transform to create the required IAM resources. Choose Create Stack to deploy the stack.
Deploying the solution using the AWS CLI
- Download the Lambda function code emr-dns-setter.py and the SAM template emr-dns-setter-sam.json to your local machine.
- Modify the CodeUri property in emr-dns-setter-sam.json template to the local path of the function code:
- Use the aws cloudformation package CLI command to upload the package.
After the package is successfully uploaded, the output should look as follows:
The CodeUri property in serverless-output.template is now referencing the packaged artifacts in the S3 bucket that you specified:
- Use the aws cloudformation deploy CLI command to deploy the stack:
You should see the following output after the stack has been successfully created:
Validating results
To test the solution, launch an EMR cluster. The Lambda function looks for the cluster_name tag associated with the EMR cluster. Make sure to specify the friendly name of your cluster as host.domain.com where the domain is the private hosted zone in which to create the CNAME record.
Here is a sample CLI command to launch a cluster within a specific subnet in a VPC with the required tag cluster_name.
After the cluster is launched, log in to the Route 53 console. In the left navigation pane, choose Hosted Zones to view the list of private and public zones currently configured in Route 53. Select the hosted zone that you specified in the ZONE tag when you launched the cluster. Verify that the resource records were created.
You can also monitor the CloudWatch Events metrics that are published to CloudWatch every minute, such as the number of TriggeredRules and Invocations.
Now that you’ve verified that the Lambda function successfully updated the Route 53 resource records in the zone file, terminate the EMR cluster and verify that the records are removed by the same function.
Conclusion
This solution provides a serverless approach to automatically assigning a friendly name for your EMR cluster for easy access to popular notebooks and other web interfaces. CloudWatch Events also supports cross-account event delivery, so if you are running EMR clusters in multiple AWS accounts, all cluster state events across accounts can be consolidated into a single account.
I hope that this solution provides a small glimpse into the power of CloudWatch Events and Lambda and how they can be leveraged with EMR and other AWS big data services. For example, by using the EMR step state change event, you can chain various pieces of your analytics pipeline. You may have a transient cluster perform data ingest and, when the task successfully completes, spin up an ETL cluster for transformation and upload to Amazon Redshift. The possibilities are truly endless.
Additional Reading
If you found this post useful, be sure to check out Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet and Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events.
About the Authors
Ilya is a solutions architect with AWS. He helps customers to innovate on the AWS platform by building highly available, scalable, and secure architectures. He enjoys spending time outdoors and building Lego creations with his kids.
Roger is a solutions architect with AWS. He helps customers to implement cloud native architectures so they can focus more on what sets them apart as an organization. Outside of work, he can he found woodworking or cooking with his family.