How can I resolve the ModuleNotFoundError on an Amazon SageMaker notebook that's running the Sparkmagic kernel?

3 minute read

I'm trying to run an Amazon SageMaker notebook instance with the Sparkmagic (PySpark) kernel. I used pip to install the Python libraries, but I get the following error: "ModuleNotFoundError: No module named my_module_name."

Short description

When you use the Sparkmagic kernel, the Amazon SageMaker notebook acts as an interface for the Apache Spark session that's running on a remote Amazon EMR cluster or an AWS Glue development endpoint.

When you use pip to install the Python library on the notebook instance, the library is available only to the local notebook instance. To resolve the ModuleNotFoundError, install the library on the AWS Glue development endpoint or on each node of the EMR cluster.

Note: If the code that uses the library isn’t compute intensive, you can use local mode (%%local). Local mode runs the cell on the local notebook instance only. When using local mode, you don't have to install the library on the remote cluster or development endpoint.

Resolution

To install libraries on a remote AWS Glue development endpoint, see Loading Python libraries in a development endpoint.

To install libraries on a remote EMR cluster, you can use a bootstrap action when you create the cluster. If you already connected an EMR cluster to the Amazon SageMaker notebook instance, then manually install the library on all cluster nodes:

1. Connect to the master node using SSH.

2. Install the library. This example shows how to install pandas:

sudo python -m pip install pandas

3. Confirm that the module is installed successfully:

python -c "import pandas as pd; print(pd.__version__)"

4. Open the Amazon SageMaker notebook instance, and then restart the kernel.

5. To confirm that the library works as expected, run a command that requires the library. Example:

pdf = spark.sql("show databases").toPandas()

6. Connect to the other cluster nodes using SSH, and then install the library on each node.

If you don't need to run the code on the remote cluster or development endpoint, then use the local notebook instance instead. For example, instead of installing matplotlib on each node of the Spark cluster, use local mode (%%local) to run the cell on the local notebook instance.

The following example shows how to export results to a local variable and then run code in local mode:

1. Export the result to a local variable:

%%sql -o query1
SELECT 1, 2, 3

2. Run the code locally:

%%local
print(len(query1))

You can also run a local Spark session on a notebook instance using the Amazon SageMaker Spark library. This allows you to use SageMakerEstimator estimators in a Spark pipeline. You can manipulate data through Spark using a local SparkSession. Then, use the Amazon SageMaker Spark library for training and inference. For more information, see the pyspark_mnist_kmeans example notebook on the AWS Labs GitHub repository. This example notebook uses the conda_python3 kernel and isn't backed by an EMR cluster. For jobs with heavy workloads, create a remote Spark cluster, and then connect it to the notebook instance.

Related information

Use Apache Spark with Amazon SageMaker

Topics

Machine Learning & AI

Relevant content

How update the kernel in SageMaker
YihanZHang
asked 4 months ago
How to choose which Spark kernel to use in SageMaker Studio?
yann_stoneman
asked a year ago
Sagemaker Notebook Kernel Dying During Training
rePost-User-9200148
asked 2 years ago
How can I update the kernel on my lightsail server
RVAWEB
asked a year ago
Jupyter kernel dies on SageMaker notebook instance when running join operation on large DataFrames using pd.merge
rePost-User-4249991
asked a year ago
How do I install the rJDBC package in the R environment of my Amazon SageMaker notebook instance?
AWS OFFICIALUpdated a year ago
How can I modify the Spark configuration in an Amazon EMR notebook?
AWS OFFICIALUpdated 2 years ago
How do I resolve the error ConnectTimeoutError when connecting to an Amazon EMR cluster from my Amazon SageMaker Studio notebook?
AWS OFFICIALUpdated a year ago
How can I install Python packages to a Conda environment on an Amazon SageMaker notebook instance?
AWS OFFICIALUpdated 2 years ago
How do you run the Petrel software on AWS?
EXPERT
Alberto-AWS
published 8 months ago