如何解决 Amazon EMR 上的 Spark 中的“java.lang.ClassNotFoundException”异常?

上次更新日期:2022 年 3 月 2 日

我在 Amazon EMR 上使用 spark-submit 或 PySpark 任务中的自定义 JAR 文件时,收到 java.lang.ClassNotFoundException 错误。

简短描述

如果满足以下任一条件,则会发生此错误:

  • spark-submit 任务无法在类路径中找到相关文件。
  • 引导操作或自定义配置覆盖了类路径。发生此情况时,类加载器会仅选择您在配置中指定的位置中存在的 JAR 文件。

解决方法

检查堆栈跟踪以查找缺失类的名称。然后,将您的自定义 JAR 的路径(包含缺失的类)添加到 Spark 类路径中。您可以在集群运行、启动新集群或者提交任务时执行此操作。

正在运行的集群上

/etc/spark/conf/spark-defaults.conf 中,将您的自定义 JAR 的路径附加到错误堆栈跟踪中指定的类名称。在以下示例中,/home/hadoop/extrajars/* 是自定义 JAR 路径。

sudo vim /etc/spark/conf/spark-defaults.conf
spark.driver.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/*
spark.executor.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/*

在新集群上

在创建集群时提供配置对象,从而在 /etc/spark/conf/spark-defaults.conf 中将自定义 JAR 路径附加到现有类路径中。

注意:要使用此选项,您必须使用 Amazon EMR 发行版本 5.14.0 或更高版本创建集群。

对于 Amazon EMR 5.14.0 到 Amazon EMR 5.17.0,请包含以下内容:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*"
    }
  }
]

对于 Amazon EMR 5.17.0 到 Amazon EMR 5.18.0,请添加 /usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar 作为其他 JAR 路径:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*"
    }
  }
]

对于 Amazon EMR 5.19.0 到 Amazon EMR 5.32.0,请按如下方式更新 JAR 路径:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*"
    }
  }
]

对于 Amazon EMR 5.33.0 到 Amazon EMR 5.34.0,请按如下方式更新 JAR 路径:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/"
    }
  }
]

对于 Amazon EMR 发行版本 6.0.0 及更高版本,使用配置更新 JAR 路径不起作用。对于这些版本,.conf 文件包含多个 jar 文件路径。此外,您要更新的每个属性的配置长度不能超过 1024 个字符。但是,您可以添加引导操作以将自定义 JAR 位置传递给 spark-defaults.conf。有关更多信息,请参阅如何在引导阶段之后更新所有的 Amazon EMR 节点?

创建与以下类似的 bash 脚本:

注意:

  • 请务必将 s3://doc-example-bucket/Bootstraps/script_b.sh 替换为您选择的 Amazon Simple Storage Service(Amazon S3)路径。
  • 请务必将 /home/hadoop/extrajars/* 替换为您的自定义 JAR 文件路径。
  • 请确保 Amazon EMR 执行角色拥有访问此 S3 存储桶的权限。
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
  sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

启动 EMR 集群并添加类似以下内容的引导操作

#!/bin/bash
pwd
aws s3 cp s3://doc-example-bucket/Bootstraps/script_b.sh .
chmod +x script_b.sh
nohup ./script_b.sh &

对于单个任务

使用 --jars 选项在您运行 spark-submit 时传入自定义 JAR 路径。

示例:

spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi --master yarn spark-examples.jar 100 --jars /home/hadoop/extrajars/*

注意:要防止类冲突时,请勿在使用 --jars 选项时包含标准 JAR。例如,请勿包含 spark-core.jar,因为它已经存在于集群中。

有关配置分类的更多信息,请参阅配置 Spark