AWS Big Data Blog

Using SaltStack to Run Commands in Parallel on Amazon EMR

Miguel Tormo is a Big Data Support Engineer in AWS Premium Support

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. Amazon EMR defines three types of nodes: master node, core nodes, and task nodes.

It’s common to run commands on each node by using SSH agent forwarding and running a loop on the master node to connect through SSH to every core or task node. However, there are cases in which you might want to run commands on select nodes only (for example, to generate a report on a particular instance type). For this reason, it helps to have an alternative approach for automating command execution on Amazon EMR clusters.

SaltStack is an open source project for automation and configuration management. It started as a remote execution engine designed to scale to many machines while delivering high-speed execution. Saltstack uses its own protocol, which is based on the ZeroMQ library.

SaltStack bootstrap action

You can use the new bootstrap action that installs SaltStack on Amazon EMR. It provides a basic configuration that enables selective targeting of the nodes based on instance roles, instance groups, and other parameters. Even if an instance group gets resized, each new node will execute the bootstrap action that installs SaltStack and registers the node with the master.

After your Amazon EMR cluster is up and running, and SaltStack is successfully deployed, you can now use the SaltStack CLI to configure and run commands on your cluster nodes.

Here are some examples of salt commands:

To check connectivity to all registered nodes

sudo salt '*' test.ping

To restart the YARN NodeManager on every task instance

sudo salt -N task service.restart hadoop-yarn-nodemanager

To run a report of a YARN queue

If SaltStack is installed in external or syndicated mode, you can print a report of the status of the default YARN queue (for example, for every registered EMR cluster running EMR version 4.7.2):

sudo salt -C 'G@emr:version:4.7.2 and G@emr:instance_role:master' cmd.run 'yarn queue -status default'

To execute a script located on the master

This command will execute the script located in the path /srv/salt/myscript of the SaltStack master on all nodes in the ig-FFFFFFFFFFFF instance group:

sudo salt -G 'emr:instance_group_id:ig-FFFFFFFFFFFF' cmd.script salt://myscript

You can find the bootstrap action with instructions and more examples here:

https://github.com/awslabs/emr-bootstrap-actions/tree/master/saltstack

Happy salting! If you have a question or suggestion, please leave a comment below.