How do I configure Amazon EMR to run a PySpark job using Python 3.4 or 3.6?

Last updated: 2019-05-09

Python 3.4 or 3.6 is installed on my Amazon EMR cluster instances, but Spark is running Python 2.7. How do I upgrade Spark to Python 3.4 or 3.6?

Short Description

In most Amazon EMR release versions, cluster instances and system applications use different Python versions by default:

  • Amazon EMR release versions 4.6.0-5.19.0: Python 3.4 is installed on the cluster instances. Python 2.7 is the system default.
  • Amazon EMR release versions 5.20.0 and later: Python 3.6 is installed on the cluster instances. Python 2.7 is the system default.

To upgrade the Python version that PySpark uses, point the PYSPARK_PYTHON environment variable for the spark-env classification to the directory where Python 3.4 or 3.6 is installed.


On a running cluster

Run the following command to change the default Python environment:

sudo sed -i -e '$a\export PYSPARK_PYTHON=/usr/bin/python3' /etc/spark/conf/

Run the pyspark command to confirm that PySpark is using the correct version of Python:

[hadoop@ip-X-X-X-X conf]$ pyspark

The output shows that PySpark is now using the same Python version that is installed on the cluster instances. Example:

Python 3.4.8 (default, Apr 25 2018, 23:50:36)

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1

Using Python version 3.4.8 (default, Apr 25 2018 23:50:36)
SparkSession available as 'spark'.

Spark will use the new configuration for the next PySpark job.

On a new cluster

Add a configuration object similar to the following when you launch a cluster using Amazon EMR release version 4.6.0 or later:

     "Classification": "spark-env",
     "Configurations": [
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"

Did this article help you?

Anything we could improve?

Need more help?