How do I determine whether to use a bootstrap action or a step on an Amazon EMR cluster?
Last updated: 2020-03-16
What are the use cases for running a bootstrap action or running a step on an Amazon EMR cluster?
Bootstrap actions are usually used to install additional software on an EMR cluster. Steps are used to submit work to an EMR cluster, or to process data.
- Bootstrap actions are the first thing to run after an EMR cluster transitions from the STARTING state to the BOOTSTRAPPING state. Because bootstrap actions execute before core services, such as Hadoop or Spark, are installed, the cluster doesn't start if a bootstrap action fails. For more information, see Understanding the Cluster Lifecycle.
- Bootstrap actions run on all cluster nodes. Bootstrap actions are scripts that run as the Hadoop user by default—but they can also run as the root user with the sudo command. You can configure bootstrap actions to run commands conditionally, based on instance-specific values in the instance.json or job-flow.json file.
Note: On Amazon EMR 2.x and 3.x releases, bootstrap actions execute after core services are installed. Most predefined bootstrap actions for Amazon EMR AMI versions 2.x and 3.x aren't supported in later Amazon EMR releases. For more information, see Create Bootstrap Actions to Install Additional Software.
- A step is a unit of work that contains one or more Hadoop jobs. Steps are usually used to transfer or process data. One step might submit work to a cluster. Other steps might process the submitted data and then send the processed data to a particular location.
- Steps start after bootstrap actions and run only on the master node. Steps complete their work sequentially. For more information, see Running Steps to Process Data.
- When you configure a step, you can choose what happens after a step fails.
For more information about steps, see Work with Steps Using the AWS CLI and Console.