How do I determine whether to use a bootstrap action or a step on an Amazon EMR cluster?
Last updated: 2020-05-11
What are the use cases for running a bootstrap action or running a step on an Amazon EMR cluster?
Use bootstrap actions to install additional software on an EMR cluster. Use steps to submit work to an EMR cluster, or to process data.
- Bootstrap actions run after an EMR cluster transitions from the STARTING state to the BOOTSTRAPPING state. Bootstrap actions execute before core services, such as Hadoop or Spark, are installed. If a bootstrap action fails, the cluster doesn't start. For more information, see Understanding the Cluster Lifecycle.
- Bootstrap actions run on all cluster nodes. Bootstrap actions are scripts that run as the Hadoop user by default—but they can also run as the root user with the sudo command. You can configure bootstrap actions to run commands conditionally, based on instance-specific values in the instance.json or job-flow.json file.
Note: On Amazon EMR 2.x and 3.x releases, bootstrap actions execute after core services are installed. Most predefined bootstrap actions for Amazon EMR AMI versions 2.x and 3.x aren't supported in later Amazon EMR releases. For more information, see Create Bootstrap Actions to Install Additional Software.
- A step is a unit of work that contains one or more Hadoop jobs. Steps are usually used to transfer or process data. One step might submit work to a cluster. Other steps might process the submitted data and then send the processed data to a particular location.
- Steps start after bootstrap actions and run only on the master node. For more information, see Running Steps to Process Data.
- In Amazon EMR release versions 5.28.0 and later, you can run multiple steps in parallel. In earlier Amazon EMR release versions, steps complete their work sequentially.
- When you configure a step, you can choose what happens after a step fails.
For more information about steps, see Work with Steps Using the AWS CLI and Console.