AWS Big Data Blog

Run Jupyter Notebook and JupyterHub on Amazon EMR

by Tom Zeng | on | | Comments

Tom Zeng is a Solutions Architect for Amazon EMR

Jupyter Notebook (formerly IPython) is one of the most popular user interfaces for running Python, R, Julia, Scala, and other languages to process and visualize data, perform statistical analysis, and train and run machine learning models. Jupyter notebooks are self-contained documents that can include live code, charts, narrative text, and more. The notebooks can be easily converted to HTML, PDF, and other formats for sharing.

Amazon EMR is a popular hosted big data processing service that allows users to easily run Hadoop, Spark, Presto, and other Hadoop ecosystem applications, such as Hive and Pig.

Python, Scala, and R provide support for Spark and Hadoop, and running them in Jupyter on Amazon EMR makes it easy to take advantage of:

  • the big-data processing capabilities of Hadoop applications.
  • the large selection of Python and R packages for analytics and visualization.

JupyterHub is a multiple-user environment for Jupyter. You can use the following bootstrap action (BA) to install Jupyter and JupyterHub on Amazon EMR:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh

These are the supported Jupyter kernels:

  • Python
  • R
  • Scala
  • Apache Toree (which provides the Spark, PySpark, SparkR, and SparkSQL kernels)
  • Julia
  • Ruby
  • JavaScript
  • CoffeeScript
  • Torch

The BA will install Jupyter, JupyterHub, and sample notebooks on the master node.

Commonly used Python and R data science and machine learning packages can be optionally installed on all nodes.

The following arguments can be passed to the BA:

--r Install the IRKernel for R.
--toree Install the Apache Toree kernel that supports Scala, PySpark, SQL, SparkR for Apache Spark.
--julia Install the IJulia kernel for Julia.
--torch Install the iTorch kernel for Torch (machine learning and visualization).
--ruby Install the iRuby kernel for Ruby.
--ds-packages Install the Python data science-related packages (scikit-learn pandas statsmodels).
--ml-packages Install the Python machine learning-related packages (theano keras tensorflow).
--python-packages Install specific Python packages (for example, ggplot and nilearn).
--port Set the port for Jupyter notebook. The default is 8888.
--password Set the password for the Jupyter notebook.
--localhost-only Restrict Jupyter to listen on localhost only. The default is to listen on all IP addresses.
--jupyterhub Install JupyterHub.
--jupyterhub-port Set the port for JuputerHub. The default is 8000.
--notebook-dir Specify the notebook folder. This could be a local directory or an S3 bucket.
--cached-install Use some cached dependency artifacts on S3 to speed up installation.
--ssl Enable SSL. For production, make sure to use your own certificate and key files.
--copy-samples Copy sample notebooks to the notebook folder.
--spark-opts User-supplied Spark options to override the default values.
--s3fs Use instead of s3nb (the default) for storing notebooks on Amazon S3. This argument can cause slowness if the S3 bucket has lots of files. The Upload file and Create folder menu options do not work with s3nb.

By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000.  The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.

The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.

If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.)  hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>

The following example CLI command is used to launch a five-node (c3.4xlarge) EMR 5.2.0 cluster with the bootstrap action. The BA will install all the available kernels. It will also install the ggplot and nilearn Python packages and set:

  • the Jupyter port to 8880
  • the password to jupyter
  • the JupyterHub port to 8001
aws emr create-cluster --release-label emr-5.2.0 \
  --name 'emr-5.2.0 sparklyr + jupyter cli example' \
  --applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Tez Name=Ganglia Name=Presto \
  --ec2-attributes KeyName=<your-ec2-key>,InstanceProfile=EMR_EC2_DefaultRole \
  --service-role EMR_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.4xlarge \
    InstanceGroupType=CORE,InstanceCount=4,InstanceType=c3.4xlarge \
  --region us-east-1 \
  --log-uri s3://<your-s3-bucket>/emr-logs/ \
  --bootstrap-actions \
    Name='Install Jupyter notebook',Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8880,--password,jupyter,--jupyterhub,--jupyterhub-port,8001,--cached-install,--notebook-dir,s3://<your-s3-bucket>/notebooks/,--copy-samples]

Replace <your-ec2-key> with your AWS access key and <your-s3-bucket> with the S3 bucket where you store notebooks. You can also change the instance types to suit your needs and budget.

If you are using the EMR console to launch a cluster, you can specify the bootstrap action as follows:

o_jupyter_1

When the cluster is available, set up the SSH tunnel and web proxy. The Jupyter notebook should be available at localhost:8880 (as specified in the example CLI command).

o_jupyter_2

After you have signed in, you will see the home page, which displays the notebook files:

o_jupyter_3

If JupyterHub is installed, the Sign in page should be available at port 8001 (as specified in the CLI example):

o_jupyter_4

After you are signed in, you’ll see the JupyterHub and Jupyter home pages are the same. The JupyterHub URL, however, is /user/<username>/tree instead of /tree.

o_jupyter_5

The JupyterHub Admin page is used for managing users:

o_jupyter_6

You can install Jupyter extensions from the Nbextensions tab:

o_jupyter_7

If you specified the --copy-samples option in the BA, you should see sample notebooks on the home page. To try the samples, first open and run the CopySampleDataToHDFS.ipynb notebook to copy some sample data files to HDFS. In the CLI example, --python-packages,'ggplot nilearn' is used to install the ggplot and nilearn packages. You can verify those packages were installed by running the Py-ggplot and PyNilearn notebooks.

The CreateUser.ipynb notebook contains examples for setting up JupyterHub users.

The PySpark.ipynb and ScalaSpark.ipynb notebooks contain the Python and Scala versions of some machine learning examples from the Spark distribution (Logistic Regression, Neural Networks, Random Forest, and Support Vector Machines):

o_jupyter_8

PyHivePrestoHDFS.ipynb shows how to access Hive, Presto, and HDFS in Python. (Be sure to run the CreateHivePrestoS3Tables.ipynb first to create tables.) The %%time and %%timeit cell magics can be used to benchmark Hive and Presto queries (and other executable code):

o_jupyter_9

Here are some other sample notebooks for you to try.

SparkSQLParquetJSON.ipynb:

o_SparkSQLParquetJSON

plot_separating_hyperplane.ipynb:

o_plot_separating_hyperplane

R-SVMLinearNonLinear.ipynb:

o_R-SVMLinearNonLinear

plot_iris.ipynb:

o_plot_iris

Julia-IrisPlot.ipynb:

o_Julia-IrisPlot

PyIrisPlot.ipynb:

o_PyIrisPlot

R-RandomForestVisualization.ipynb:

o_R-RandomForestVisualization

GrangerCausality.ipynb:

o_GrangerCausality

SQLite.ipynb:

o_SQLite

GraphvizDot.ipynb:

o_GraphvizDot

Conclusion

Data scientists who run Jupyter and JupyterHub on Amazon EMR can use Python, R, Julia, and Scala to process, analyze, and visualize big data stored in Amazon S3. Jupyter notebooks can be saved to S3 automatically, so users can shut down and launch new EMR clusters, as needed. EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets. This can greatly reduce the cost of data-science investigations.

If you have questions about using Jupyter and JupyterHub on EMR or would like share your use cases, please leave a comment below.


Related

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

sparklyr_2