I know Python 3.4.3 is installed on an Amazon EMR cluster instances, but the default Python version used by Spark and other programs is Python 2.7.10. How do I change the default Python version to Python 3 and run a pyspark job?

To change the default Python environment while launching the EMR cluster, update your EMR configuration file and set the PYSPARK_PYTHON environment variable to one of the following path:

/usr/bin/python3

After making the necessary changes, your EMR configuration file will contain JSON similar to the following:

[

    {

    "Classification": "spark-env",

    "Configurations": [

            {

                "Classification": "export",

                "Properties": {

                    "PYSPARK_PYTHON": "/usr/bin/python3"

                }

            }

        ]

    }

]

To run a pyspark job with the Python 3 runtime without changing the Spark defaults, you can directly pass the PYSPARK_PYTHON environment variable when calling a script at a specific location (this example assumes that your script is located at s3://mybucket/myscript.py):

$ command-runner.jar spark-submit --deploy-mode cluster --conf PYSPARK_PYTHON=/usr/bin/python3 s3://mybucket/mypath/myscript.py

pyspark, Python 3, EMR Python3, EMR Spark Python3, EMR Spark Python default


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2016-10-26