How can I permanently install a Spark or Scala-based library on an Amazon EMR cluster?

Last updated: 2020-11-20

How can I permanently install a package on an Amazon EMR cluster and then access the package on an EMR notebook?


The following example installs GraphFrames from the GraphFrames website. Follow these steps to permanently install any Spark or Scala-based library that you want to access on an EMR notebook's PySpark kernel.

Prepare the bootstrap action script

1.    Download the JAR for your library.

2.    Upload the JAR to an Amazon Simple Storage Service (Amazon S3) bucket.

3.    Create a bootstrap action script similar to the following. This example script automatically installs the GraphFrames library on all nodes of an Amazon EMR cluster. Replace s3://doc-example-bucket/graphframes-0.8.0-spark2.4-s_2.11.jar with the path to the JAR in your S3 bucket.

# These two following statements install the graphframes library on all nodes of an EMR cluster for Python base version 2.7 and Python 3.
sudo pip-3.6 install graphframes
sudo pip install graphframes
# The following statement copies the GraphFrames Spark jar from an S3 bucket to all nodes of an EMR cluster on the required path.
sudo aws s3 cp s3://doc-example-bucket/graphframes-0.8.0-spark2.4-s_2.11.jar /usr/lib/spark/jars/

3.    Upload the bootstrap action script to your S3 bucket.

4.    Open the Amazon EMR console.

5.    Choose Create cluster, and then choose Go to advanced options.

6.    In the Software configuration section, choose Hive, Livy, and Spark. These software packages are required to run EMR notebooks. For more information, see Cluster requirements.

7.    Continue creating the cluster. On the Step 3: General Cluster Settings page, enter the path to your bootstrap action script. For more information, see Add custom bootstrap actions using the console.

8.    Finish creating the cluster.

9.    Create an EMR notebook using the cluster that you just created.

10.   When the notebook is Ready, choose Open in JupyterLab.

11.   To test the PySpark code using the PySpark kernel, run a code snippet similar to the following. If this code is successful, GraphFrames is installed correctly.

          from pyspark import *
          from pyspark.sql import *
          from graphframes import *
          spark = SparkSession.builder.appName('fun').getOrCreate()
          vertices = spark.createDataFrame([('1', 'Carter', 'Derrick', 50),
                                            ('2', 'May', 'Derrick', 26),
                                           ('3', 'Mills', 'Jeff', 80),
                                            ('4', 'Hood', 'Robert', 65),
                                            ('5', 'Banks', 'Mike', 93),
                                           ('98', 'Berg', 'Tim', 28),
                                           ('99', 'Page', 'Allan', 16)],
                                           ['id', 'name', 'firstname', 'age'])
          edges = spark.createDataFrame([('1', '2', 'friend'),
                                         ('2', '1', 'friend'),
                                        ('3', '1', 'friend'),
                                        ('1', '3', 'friend'),
                                         ('2', '3', 'follows'),
                                         ('3', '4', 'friend'),
                                         ('4', '3', 'friend'),
                                         ('5', '3', 'friend'),
                                         ('3', '5', 'friend'),
                                         ('4', '5', 'follows'),
                                        ('98', '99', 'friend'),
                                        ('99', '98', 'friend')],
                                        ['src', 'dst', 'type'])
          g = GraphFrame(vertices, edges)
          ## Take a look at the DataFrames

          ## Check the number of edges of each vertex

Did this article help?

Do you need billing or technical support?