AWS Big Data Blog

Run Jupyter Notebook and JupyterHub on Amazon EMR

NOTE: Please note that as of EMR 5.14.0, JupyterHub is an officially supported application. We recommend you use the most recent version of EMR if you would like to run JupyterHub on EMR. In addition, EMR Notebooks allow you to create and open Jupyter notebooks with the Amazon EMR console. We will not provide any additional updates to the content in this post.

Tom Zeng is a Solutions Architect for Amazon EMR

Jupyter Notebook (formerly IPython) is one of the most popular user interfaces for running Python, R, Julia, Scala, and other languages to process and visualize data, perform statistical analysis, and train and run machine learning models. Jupyter notebooks are self-contained documents that can include live code, charts, narrative text, and more. The notebooks can be easily converted to HTML, PDF, and other formats for sharing.

Amazon EMR is a popular hosted big data processing service that allows users to easily run Hadoop, Spark, Presto, and other Hadoop ecosystem applications, such as Hive and Pig.

Python, Scala, and R provide support for Spark and Hadoop, and running them in Jupyter on Amazon EMR makes it easy to take advantage of:

  • the big-data processing capabilities of Hadoop applications.
  • the large selection of Python and R packages for analytics and visualization.

JupyterHub is a multiple-user environment for Jupyter. You can use the following bootstrap action (BA) to install Jupyter and JupyterHub on Amazon EMR:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh

These are the supported Jupyter kernels:

  • Python
  • R
  • Scala
  • Apache Toree (which provides the Spark, PySpark, SparkR, and SparkSQL kernels)
  • Julia
  • Ruby
  • JavaScript
  • CoffeeScript
  • Torch

The BA will install Jupyter, JupyterHub, and sample notebooks on the master node.

Commonly used Python and R data science and machine learning packages can be optionally installed on all nodes. Use the Python 2 (or Python 3 if you use --python3 option) notebook to run PySpark code, use the R notebook to run SparkR code, and use Toree Scala notebook to Spark Scala code.

The following arguments can be passed to the BA:

--r Install the IRKernel for R.
--toree Install the Apache Toree kernel that supports Scala, PySpark, SQL, SparkR for Apache Spark.
--julia Install the IJulia kernel for Julia.
--torch Install the iTorch kernel for Torch (machine learning and visualization).
--ruby Install the iRuby kernel for Ruby.
--ds-packages Install the Python data science-related packages (scikit-learn pandas statsmodels).
--ml-packages Install the Python machine learning-related packages (theano keras tensorflow).
--bigdl Install Intel’s BigDL deep learning libraries.
--python-packages Install specific Python packages (for example, ggplot and nilearn).
--port Set the port for Jupyter notebook. The default is 8888.
--user Set the default user for JupyterHub, default is jupyter
--password Set the password for the Jupyter notebook.
--localhost-only Restrict Jupyter to listen on localhost only. The default is to listen on all IP addresses.
--jupyterhub Install JupyterHub.
--jupyterhub-port Set the port for JuputerHub. The default is 8000.
--notebook-dir Specify the notebook folder. This could be a local directory or an S3 bucket.
--cached-install Use some cached dependency artifacts on S3 to speed up installation.
--ssl Enable SSL. For production, make sure to use your own certificate and key files.
--copy-samples Copy sample notebooks to the notebook folder.
--spark-opts User-supplied Spark options to override the default values.
--python3 Packages and apps installed for Python 3 instead of Python 2.
--s3fs Use s3fs instead of the default, s3contents for storing notebooks on Amazon S3. This argument can cause slowness if the S3 bucket has lots of files.

By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000.  The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.

The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.

If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.)  jupyter, the default admin user for JupyterHub, can be used to set up other users. The --password option sets the password for Jupyter and for the jupyter user for JupyterHub.

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>

The following example CLI command is used to launch a five-node (c3.4xlarge) EMR 5.12. cluster with the bootstrap action. The BA will install all the available kernels. It will also install the ggplot and pybrain Python packages and set:

  • the Jupyter port to 8885
  • the password to jupyter
  • the JupyterHub port to 8005
aws emr create-cluster --release-label emr-5.12.1 \
  --name 'emr-5.12.1 jupyter/ cli example' \
  --applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Tez Name=Ganglia Name=Presto \
  --ec2-attributes KeyName=<your-ec2-key>,InstanceProfile=EMR_EC2_DefaultRole \
  --service-role EMR_DefaultRole \  
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.4xlarge \
    InstanceGroupType=CORE,InstanceCount=4,InstanceType=c3.4xlarge \
  --region us-east-1 \
  --log-uri s3://<your-s3-bucket>/emr-logs/ \
  --bootstrap-actions \
    Name='Install Jupyter notebook',Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8885,--password,jupyter,--jupyterhub,--jupyterhub-port,8005,--cached-install,--notebook-dir,s3://<your-s3-bucket>/notebooks/,--copy-samples]

Replace <your-ec2-key> with your AWS access key and <your-s3-bucket> with the S3 bucket where you store notebooks. You can also change the instance types to suit your needs and budget.

NOTE: If you specify all or most of the options, it will take the bootstrap action longer to run, and on smaller EC2 master instances, it could timeout after 60 minutes, the causing cluster launch to fail. If you’d like to try all the options, we recommend that you use 2xlarge instances (e.g. m3.2xlarge) or higher.

If you are using the EMR console to launch a cluster, you can specify the bootstrap action as follows:

o_jupyter_1

Notice the bootstrap action syntax difference between AWS CLI and AWS EMR Console: the CLI uses comma to separate optional arguments, and the EMR Console uses space. And for the --python-packages argument, the CLI uses single quote for multiple packages. The EMR Console uses double quotes.

When the cluster is available, set up the SSH tunnel and web proxy. The Jupyter notebook should be available at localhost:8885 (as specified in the example CLI command).

o_jupyter_2

After you have signed in, you will see the home page, which displays the notebook files:

o_jupyter_3

If JupyterHub is installed, the Sign in page should be available at port 8005 (as specified in the CLI example):

After you are signed in, you’ll see the JupyterHub and Jupyter home pages are the same. The JupyterHub URL, however, is /user/<username>/tree instead of /tree.

o_jupyter_5

The JupyterHub Admin page is used for managing users:

o_jupyter_6

You can install Jupyter extensions from the Nbextensions tab:

o_jupyter_7

If you specified the --copy-samples option in the BA, you should see sample notebooks on the home page. To try the samples, first open and run the CopySampleDataToHDFS.ipynb notebook to copy some sample data files to HDFS. In the CLI example, --python-packages,'ggplot nilearn' is used to install the ggplot and nilearn packages. You can verify those packages were installed by running the Py-ggplot and PyNilearn notebooks.

The CreateUser.ipynb notebook contains examples for setting up JupyterHub users.

The PySpark.ipynb and ScalaSpark.ipynb notebooks contain the Python and Scala versions of some machine learning examples from the Spark distribution (Logistic Regression, Neural Networks, Random Forest, and Support Vector Machines):

o_jupyter_8

PyHivePrestoHDFS.ipynb shows how to access Hive, Presto, and HDFS in Python. (Be sure to run the CreateHivePrestoS3Tables.ipynb first to create tables.) The %%time and %%timeit cell magics can be used to benchmark Hive and Presto queries (and other executable code):

o_jupyter_9

Here are some other sample notebooks for you to try.

SparkSQL – SparkSQLParquetJSON.ipynb:

o_SparkSQLParquetJSON

Plot of SVM separating hyperplanes – plot_separating_hyperplane.ipynb:

o_plot_separating_hyperplane

Linear vs non-linear Support Vector Machines – R-SVMLinearNonLinear.ipynb:

o_R-SVMLinearNonLinear

Plots of SVMs in Python – plot_iris.ipynb:

o_plot_iris

Julia iris plot – Julia-IrisPlot.ipynb:

o_Julia-IrisPlot

Python iris plot – PyIrisPlot.ipynb:

o_PyIrisPlot

Random Forest tree plot – R-RandomForestVisualization.ipynb:

o_R-RandomForestVisualization

Granger Causality test in R – GrangerCausality.ipynb:

o_GrangerCausality

The %%sql cell extension for SQL – SQLite.ipynb:

o_SQLite

The %%dot extension for Graphviz dot language – GraphvizDot.ipynb:

o_GraphvizDot

Conclusion

Data scientists who run Jupyter and JupyterHub on Amazon EMR can use Python, R, Julia, and Scala to process, analyze, and visualize big data stored in Amazon S3. Jupyter notebooks can be saved to S3 automatically, so users can shut down and launch new EMR clusters, as needed. EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets. This can greatly reduce the cost of data-science investigations.

If you have questions about using Jupyter and JupyterHub on EMR or would like share your use cases, please leave a comment below.


Related

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

sparklyr_2