How do I configure Amazon EMR to run a PySpark job using Python 3.4 or 3.6?

2 minute read
0

Python 3.4 or 3.6 is installed on my Amazon EMR cluster instances, but Spark is running Python 2.7. I want to I upgrade Spark to Python 3.4 or 3.6.

Short description

In most Amazon EMR release versions, cluster instances and system applications use different Python versions by default:

  • Amazon EMR release versions 4.6.0-5.19.0: Python 3.4 is installed on the cluster instances. Python 2.7 is the system default.
  • Amazon EMR release versions 5.20.0 and later: Python 3.6 is installed on the cluster instances. For 5.20.0-5.29.0, Python 2.7 is the system default. For Amazon EMR version 5.30.0 and later, Python 3 is the system default.

To upgrade the Python version that PySpark uses, point the PYSPARK_PYTHON environment variable for the spark-env classification to the directory where Python 3.4 or 3.6 is installed.

Resolution

On a running cluster

Amazon EMR release version 5.21.0 and later

Submit a reconfiguration request with a configuration object similar to the following:

[
  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
          }
       }
    ]
  }
]

Amazon EMR release version 4.6.0-5.20.x

1.    Connect to the master node using SSH.

2.    Run the following command to change the default Python environment:

sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/spark-env.sh

3.    Run the pyspark command to confirm that PySpark is using the correct Python version:

[hadoop@ip-X-X-X-X conf]$ pyspark

The output shows that PySpark is now using the same Python version that is installed on the cluster instances. Example:

Python 3.4.8 (default, Apr 25 2018, 23:50:36)

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 3.4.8 (default, Apr 25 2018 23:50:36)
SparkSession available as 'spark'.

Spark uses the new configuration for the next PySpark job.

On a new cluster

Add a configuration object similar to the following when you launch a cluster using Amazon EMR release version 4.6.0 or later:

[
  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
          }
       }
    ]
  }
]

Related information

Configure Spark

Apache Spark

PySpark documentation

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago