What are the use cases for running a bootstrap action or running a step on my EMR cluster?

Both a bootstrap action and an EMR step are used to complete some work or task on an Amazon EMR cluster. The distinctions between them are determined by when and where they run during the life cycle of a cluster and the type of work that they do.

Bootstrap Action

As described in the diagram at Life Cycle of a Cluster, bootstrap actions are the first thing to run after an Amazon EMR cluster has been provisioned and transitions from the STARTING cluster state to the BOOTSTRAPPING cluster state. Bootstrap actions, which run on all cluster nodes, are scripts that run as the Hadoop user by default, but they can also run as the root user with the sudo command. You can specify up to 16 bootstrap actions per cluster by providing multiple bootstrap-action parameters from the console, AWS CLI, or API. Bootstrap actions can be used to install additional software on your cluster and can be configured to run commands conditionally based upon instance-specific values in the instance.json or j-flow.json file. Because bootstrap actions execute before core services such as Hadoop or Spark are installed, the cluster will not start if a bootstrap action fails.

Note: On AMI versions 2.x and 3.x of Amazon EMR, bootstrap actions execute after core services such as Hadoop or Spark are installed. Most predefined bootstrap actions for Amazon EMR AMI version 2.x and 3.x are not supported in Amazon EMR releases 4.x. For more information, see (Optional) Create Bootstrap Actions to Install Additional Software.

Step

A step is a distinct unit of work, comprising one or more Hadoop jobs that run only on the master node of an Amazon EMR cluster. Because a cluster does not start if a bootstrap action fails, steps must always start after bootstrap actions. Steps are primarily focused on transferring or processing data. One step might submit work to a cluster, and others might process the submitted data and send the processed data to a particular location. Steps complete their work sequentially, as depicted in the diagram at Steps. When configuring a step, you can choose what happens after a step fails, which provides a measure of fault tolerance. For more information about creating steps, see Add Steps Using the CLI and Console.

Amazon EMR cluster, Hadoop, bootstrap, step, bootstrapping, state, lifecycle, Hadoop, nodes, jobs


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2016-10-28