AWS Big Data Blog
Build, deploy, and run Spark jobs on Amazon EMR with the open-source EMR CLI tool
Today, we’re pleased to introduce the Amazon EMR CLI, a new command line tool to package and deploy PySpark projects across different Amazon EMR environments. With the introduction of the EMR CLI, you now have a simple way to not only deploy a wide range of PySpark projects to remote EMR environments, but also integrate with your CI/CD solution of choice.
In this post, we show how you can use the EMR CLI to create a new PySpark project from scratch and deploy it to Amazon EMR Serverless in one command.
Overview of solution
The EMR CLI is an open-source tool to help improve the developer experience of developing and deploying jobs on Amazon EMR. When you’re just getting started with Apache Spark, there are a variety of options with respect to how to package, deploy, and run jobs that can be overwhelming or require deep domain expertise. The EMR CLI provides simple commands for these actions that remove the guesswork from deploying Spark jobs. You can use it to create new projects or alongside existing PySpark projects.
In this post, we walk through creating a new PySpark project that analyzes weather data from the NOAA Global Surface Summary of Day open dataset. We’ll use the EMR CLI to do the following:
- Initialize the project.
- Package the dependencies.
- Deploy the code and dependencies to Amazon Simple Storage Service (Amazon S3).
- Run the job on EMR Serverless.
For this walkthrough, you should have the following prerequisites:
- An AWS account
- An EMR Serverless application in the
- An S3 bucket for your code and logs in the
- An AWS Identity and Access Management (IAM) job role that can run EMR Serverless jobs and access S3 buckets
- Python version >= 3.7
If you don’t already have an existing EMR Serverless application, you can use the following AWS CloudFormation template or use the
emr bootstrap command after you’ve installed the CLI.
Install the EMR CLI
You can find the source for the EMR CLI in the GitHub repo, but it’s also distributed via PyPI. It requires Python version >= 3.7 to run and is tested on macOS, Linux, and Windows. To install the latest version, use the following command:
You should now be able to run the
emr --help command and see the different subcommands you can use:
If you didn’t already create an EMR Serverless application, the
bootstrap command can create a sample environment for you and a configuration file with the relevant settings. Assuming you used the provided CloudFormation stack, set the following environment variables using the information on the Outputs tab of your stack. Set the Region in the terminal to
us-east-1 and set a few other environment variables we’ll need along the way:
us-east-1 because that’s where the NOAA GSOD data bucket is. EMR Serverless can access S3 buckets and other AWS resources in the same Region by default. To access other services, configure EMR Serverless with VPC access.
Initialize a project
Next, we use the
emr init command to initialize a default PySpark project for us in the provided directory. The default templates create a standard Python project that uses
pyproject.toml to define its dependencies. In this case, we use Pandas and PyArrow in our script, so those are already pre-populated.
After the project is initialized, you can run
cd my-project or open the
my-project directory in your code editor of choice. You should see the following set of files:
Note that we also have a Dockerfile here. This is used by the
package command to ensure that our project dependencies are built on the right architecture and operating system for Amazon EMR.
If you use Poetry to manage your Python dependencies, you can also add a
--project-type poetry flag to the
emr init command to create a Poetry project.
If you already have an existing PySpark project, you can use
emr init --dockerfile to create the Dockerfile necessary to package things up.
Run the project
Now that we’ve got our sample project created, we need to package our dependencies, deploy the code to Amazon S3, and start a job on EMR Serverless. With the EMR CLI, you can do all of that in one command. Make sure to run the command from the
This command performs several actions:
- Auto-detects the type of Spark project in the current directory.
- Initiates a build for your project to package up dependencies.
- Copies your entry point and resulting build files to Amazon S3.
- Starts an EMR Serverless job.
- Waits for the job to finish, exiting with an error status if it fails.
You should now see the following output in your terminal as the job begins running in EMR Serverless:
And that’s it! If you want to run the same code on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), you can replace
--cluster-id j-11111111. The CLI will take care of sending the right
spark-submit commands to your EMR cluster.
Now let’s walk through some of the other commands.
PySpark projects can be packaged in numerous ways, from a single .py file to a complex Poetry project with various dependencies. The EMR CLI can help consistently package your projects without having to worry about the details.
For example, if you have a single .py file in your project directory, the
package command doesn’t need to do anything. If, however, you have multiple .py files in a typical Python project style, the
emr package command will zip these files up as a package that can later be uploaded to Amazon S3 and provided to your PySpark job using the
--py-files option. If you have third party dependencies defined in
emr package will create a virtual environment archive and start your EMR job with the
The EMR CLI also supports Poetry for dependency management and packaging. If you have a Poetry project with a corresponding
poetry.lock file, there’s nothing else you need to do. The
emr package command will detect your
poetry.lock file and automatically build the project using the Poetry Bundle plugin. You can use a Poetry project in two ways:
- Create a project using the
emr initcommand. The commands take a
--project-typepoetry option that create a Poetry project for you:
- If you have a pre-existing project, you can use the
emr init --dockerfileoption, which creates a Dockerfile that is automatically used when you run
Finally, as noted earlier, the EMR CLI provides you a default Dockerfile based on Amazon Linux 2 that you can use to reliably build package artifacts that are compatible with different EMR environments.
emr deploy command takes care of copying the necessary artifacts for your project to Amazon S3, so you don’t have to worry about it. Regardless of how the project is packaged,
emr deploy will copy the resulting files to your Amazon S3 location of choice.
One use case for this is with CI/CD pipelines. Sometimes you want to deploy a specific version of code to Amazon S3 to be used in your data pipelines. With
emr deploy, this is as simple as changing the
For example, let’s assume you’ve already packaged your project using the
emr package command. Most CI/CD pipelines allow you to access the git tag. You can use that as part of the
emr deploy command to deploy a new version of your artifacts. In GitHub actions, this is
github.ref_name, and you can use this in an action to deploy a versioned artifact to Amazon S3. See the following code:
In your downstream jobs, you could then update the location of your entry point files to point to this new location when you’re ready, or you can use the
emr run command discussed in the next section.
Let’s take a quick look at the
emr run command. We’ve used it before to package, deploy, and run in one command, but you can also use it to run on already-deployed artifacts. Let’s look at the specific options:
If you want to run your code on EMR Serverless, the
emr run command takes an
--job-role parameters. If you want to run on EMR on EC2, you only need the
Required for both options are
--entry-point is the main script that will be called by Amazon EMR. If you have any dependencies,
--s3-code-uri is where they get uploaded to using the emr deploy command, and the EMR CLI will build the relevant spark-submit properties pointing to these artifacts.
There are a few different ways to customize the job:
- –job-name – Allows you to specify the job or step name
- –job-args – Allows you to provide command line arguments to your script
- –spark-submit-opts – Allows you to add additional
--conf spark.jarsor others
- –show-stdout – Currently only works with single-file .py jobs on EMR on EC2, but will display
stdoutin your terminal after the job is complete
As we’ve seen before,
--build invokes both the
deploy commands. This makes it easier to iterate on local development when your code still needs to run remotely. You can simply use the same
emr run command over and over again to build, deploy, and run your code in your environment of choice.
The EMR CLI is under active development. Updates are currently in progress to support Amazon EMR on EKS and allow for the creation of local development environments to make local iteration of Spark jobs even easier. Feel free to contribute to the project in the GitHub repository.
To avoid incurring future charges, stop or delete your EMR Serverless application. If you used the CloudFormation template, be sure to delete your stack.
With the release of the EMR CLI, we’ve made it easier for you to deploy and run Spark jobs on EMR Serverless. The utility is available as open source on GitHub. We’re planning a host of new functionalities; if there are specific requests you have, feel free to file an issue or open a pull request!
About the author
Damon is a Principal Developer Advocate on the EMR team at AWS. He’s worked with data and analytics pipelines for over 10 years and splits his team between splitting service logs and stacking firewood.