Running an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR
Dominic Murphy is an Enterprise Solution Architect with Amazon Web Services
Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results. Zeppelin notebooks can be shared among several users, and visualizations can be published to external dashboards. Zeppelin uses the Spark settings on your cluster and can utilize Spark’s dynamic allocation of executors to let YARN estimate the optimal resource consumption.
With the 4.1.0 release, Amazon EMR introduced Zeppelin as an application that could be installed on an EMR cluster during set up. Zeppelin is installed on the master node of the EMR cluster and creates a Spark Context to run interactive Spark jobs on the EMR cluster where it’s installed. Also, Zeppelin notebooks are stored by default on the master node.
In this blog post, I will show you how to set up Zeppelin running “off-cluster” on a separate EC2 instance. You will be able to submit Spark jobs to an EMR cluster directly from your Zeppelin instance. By setting up Zeppelin off cluster, rather than on the master node of an EMR cluster, you will have the flexibility to choose which EMR cluster to submit jobs to, and can interact with your Zeppelin notebooks when your EMR cluster isn’t active. Finally, I will demonstrate how to store your Zeppelin notebooks on Amazon S3 for durable storage.
Make sure you have these resources before beginning the tutorial:
- AWS Command Line Interface installed
- An SSH client
- A key pair in the region where you’ll launch the Zeppelin instance
- An S3 bucket in same region to store your Zeppelin notebooks, and to transfer files from EMR to your Zeppelin instance
- IAM permissions to create S3 buckets, launch EC2 instances, and create EMR clusters
Create an EMR cluster
The first step is to set up an EMR cluster.
- On the Amazon EMR console, choose Create cluster.
- Choose Go to advanced options and enter the following options:
- Vendor: Amazon
- Release: emr-4.5.0
- Applications: Ensure that Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1 are selected. Deselect Pig and Hue.
- In the Add steps section, for Step type, choose Custom JAR.
- Choose Configure and enter:
- JAR location: command-runner.jar
- Arguments: aws s3 cp /etc/hadoop/conf/ s3://<YOUR_S3_BUCKET>/hadoopconf –recursive
- Action on Failure: Continue
- Choose Add and add a second step by choosing Configure again.
- JAR location: command-runner.jar
- Arguments: aws s3 cp /etc/hive/conf/hive-site.xml s3://<YOUR_S3_BUCKET>/hiveconf/hive-site.xml
- Action on failure: Continue
- Choose Add, Next.
- On the Hardware Configuration page, select your VPC and the subnet where you want to launch the cluster, keep the default selection of one master and two core nodes of m3.xlarge, and choose Next.
- On the General Options page, give your cluster a name (e.g., Spark-Cluster) and choose Next.
- On the Security Options page, for EC2 key pair, select a key pair. Keep all other settings at the default values and choose Create cluster.
Your three-node cluster takes a few moments to start up. Your cluster is ready when the cluster status is Waiting.
Note: You need the master public DNS, subnet ID, security groups, and VPC ID for Master and Core/Task for use in subsequent steps. You can retrieve the first three from the EMR console, and the VPC ID from the EC2 Instances page.
Launch an EC2 instance with Apache Zeppelin
Launch an EC2 Zeppelin instance with a CloudFormation template.
- In the CloudFormation console, choose Create Stack.
- Choose Specify an Amazon S3 template URL, and enter the following
- Choose Next.
- In the next page, give your stack a name and enter the following parameters:
- EMRMasterSecurityGroup: Security group of EMR master.
- EMRSlaveSecurityGroup: Security group of EMR core & task.
- Instance Type: I recommend m3.xlarge for this procedure.
- KeyName: Your key pair.
- S3HadoopConfFolder: Replace <mybucket> with an S3 bucket from your account.
- S3HiveConfFolder: Replace <mybucket> with an S3 bucket from your account.
- SSHLocation: CIDR block that will be allowed to connect using SSH into the Zeppelin instance.
- ZeppelinAccessLocation: CIDR block that will be allowed to connect to Zeppelin Web over port 8080.
- ZeppelinSubnetId: Subnet where your EMR cluster launched.
- ZeppelinVPCId: VPC where your EMR cluster launched.
- Choose Next.
- Optionally, specify a tag for your instance. Choose Next.
- Review your choices, and check the IAM acknowledgement, choose Create.
- Your stack will take several minutes to complete as it creates the EC2 instance and provisions Zeppelin and its prerequisites. While you are waiting, navigate to the S3 console and create a bucket for Zeppelin notebook storage. Create a folder in S3 for your Zeppelin user, and then a subfolder under that’s called notebook.
In the screen shot below, the Zeppelin storage bucket is called “zeppelin-bucket,” the Zeppelin user is “zeppelin-user,” and the notebook subfolder is in the user folder.
- Return to the CloudFormation console. When the CloudFormation stack status returns CREATE_COMPLETE, your EC2 instance is ready.
- Open the EC2 console to view your EC2 instance. Note the IP address and security group as you will use that in a subsequent step.
Configure your EMR security group to allow traffic from Zeppelin instance
- In the EMR console, select your cluster and navigate to the Cluster Details page.
- For Security group for Master, select a security group. The default is ElasticMapReduce-master.
- On the Security Group page, choose Inbound, Edit, Add Rule, All TCP. For Source, choose Custom IP, and in the next field enter the EC2 Zeppelin instance’s security group.
- Repeat the above steps for Security groups for Core & Task.
Finalize the Zeppelin instance configuration
- Connect to your Zeppelin EC2 instance using SSH. Note, if you are using PuTTY, you can follow the instructions in the Connecting to Your Linux Instance from Windows Using PuTTY topic.
## SSH as ec2-user to your instance ssh –i <your key pair file> ec2-user@<your EC2 instance IP address>
- Complete the zeppelin-env.sh settings with the S3 Bucket and S3 User Folder you entered earlier.
sudo nano /home/ec2-user/zeppelin/conf/zeppelin-env.sh export JAVA_HOME=/etc/alternatives/java_sdk_openjdk export MASTER=yarn-client export HADOOP_CONF_DIR=/home/ec2-user/hadoopconf # # export ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.S3NotebookRepo export ZEPPELIN_NOTEBOOK_S3_BUCKET=<myZeppelinBucket> export ZEPPELIN_NOTEBOOK_USER=<myZeppelinUser>
- Edit your /home/ec2-user/zeppelin/conf/zeppelin-site.xml file. Navigate to the following section and replace the bolded text with your S3 bucket and folder:
<!--If you use S3 for storage, the following folder structure is necessary: bucket_name/username/notebook/--> <property> <name>zeppelin.notebook.s3.user</name> <value><myZeppelinUser></value> <description>user name for S3 folder structure</description> </property> <property> <name>zeppelin.notebook.s3.bucket</name> <value><myZeppelinBucket>/value> <description>bucket name for notebook storage</description> </property>
Start and test Zeppelin
- Start your Zeppelin instance. From your /home/ec2-user/zeppelin directory, type:
sudo bin/zeppelin-daemon.sh start
You are now done with your SSH session.
- Switch to your client’s browser window. Test your instance by navigating to http://<yourZeppelinInstanceIP>:8080/#/
- On the Zeppelin homepage, choose Import Note and enter the location for the Zeppelin Tutorial JSON file as follows:
- Complete the import process and execute the notebook by choosing Run All Paragraphs.
After a few moments, you should see the Tutorial dashboard as follows:
In your Hadoop Resource Manager, you should see the Zeppelin application running on your EMR cluster.
Navigate to the S3 console and verify that you can see your notebook.json file in the following folder:
You can now clean up your instances to stop incurring charges:
- Navigate to the CloudFormation console and choose Delete Stack.
- Navigate to the EMR console, select your cluster, and choose Terminate.
- Navigate to the S3 console, from the Actions menu choose Delete Bucket. Type the name of the S3 bucket you used for this exercise and choose Delete.
In this blog post, you learned how to create a Zeppelin instance on EC2 and configure it as a YARN client. You also configured Zeppelin to store notebooks durably in S3 rather than on a local disk, so you can shutdown or even terminate your instance and still persist your notebook data.
In this example you first created an EMR cluster, and then configured Zeppelin to submit jobs to that cluster. In a future post, I will examine submitting jobs to multiple EMR clusters from Zeppelin.
If you have a question or suggestion, please leave a comment below