How can I resolve the ModuleNotFoundError on an Amazon SageMaker notebook that's running the Sparkmagic kernel?

Last updated: 2020-06-15

I'm trying to run an Amazon SageMaker notebook instance with the Sparkmagic (PySpark) kernel. I used pip to install the Python libraries, but I get this error: "ModuleNotFoundError: No module named my_module_name." What do I do to resolve this error?

Short Description

When you use the Sparkmagic kernel, the Amazon SageMaker notebook acts as an interface for the Apache Spark session that's running on a remote Amazon EMR cluster or an AWS Glue development endpoint.

When you use pip to install the Python library on the notebook instance, the library is available only to the local notebook instance. To resolve the ModuleNotFoundError, install the library on the AWS Glue development endpoint or on each node of the EMR cluster.

Note: If the code that uses the library doesn't need much computing power (for example, viewing results), you can use local mode (%%local) to run the cell on the local notebook instance only. When you do this, you don't have to install the library on the remote cluster or development endpoint.

Resolution

To install libraries on a remote AWS Glue development endpoint, see Loading Python Libraries in a Development Endpoint.

To install libraries on a remote EMR cluster, you can use a bootstrap action when you create the cluster. If you already connected an EMR cluster to the Amazon SageMaker notebook instance, then manually install the library on all cluster nodes:

1.    Connect to the master node using SSH.

2.    Install the library. This example shows how to install pandas:

sudo python -m pip install pandas

3.    Confirm that the module is installed successfully:

python -c "import pandas as pd; print(pd.__version__)"

4.    Open the Amazon SageMaker notebook instance, and then restart the kernel.

5.    To confirm that the library works as expected, run a command that requires the library. Example:

pdf = spark.sql("show databases").toPandas()

6.    Connect to the other cluster nodes using SSH, and then install the library on each node.

If you don't need to run the code on the remote cluster or development endpoint, use the local notebook instance instead. For example, instead of installing matplotlib on each node of the Spark cluster, use local mode (%%local) to run the cell on the local notebook instance.

The following example shows how to export results to a local variable and then run code in local mode:

1.    Export the result to a local variable:

%%sql -o query1
SELECT 1, 2, 3

2.    Run the code locally:

%%local
print(len(query1))

You can also run a local Spark session on a notebook instance using the Amazon SageMaker Spark library. This allows you to use SageMakerEstimator estimators in a Spark pipeline. You can manipulate data through Spark using a local SparkSession. Then, use the Amazon SageMaker Spark library for training and inference. For more information, see the pyspark_mnist_kmeans example notebook on the AWS Labs GitHub repository. This example notebook uses the conda_python3 kernel and isn't backed by an EMR cluster. For jobs with heavy workloads, create a remote Spark cluster, and then connect it to the notebook instance.


Did this article help you?

Anything we could improve?


Need more help?