Modify your cluster on the fly with Amazon EMR reconfiguration

April 2024: This post was reviewed for accuracy.

If you are a developer or data scientist using long-running Amazon EMR clusters, you face fast-changing workloads. These changes often require different application configurations to run optimally on your cluster.

With the reconfiguration feature, you can now change configurations on running EMR clusters. Starting with EMR release emr-5.21.0, this feature allows you to modify configurations without creating a new cluster or manually connecting by SSH into each node.

In this post, I go over the following topics:

Using reconfiguration
Instance group states, configuration versions, and events
Reconfiguration example use cases
Reconfiguration benefits

Using reconfiguration

The following tasks are updated in EMR release emr-5.21.0:

Submitting a reconfiguration
Modifying configurations
Defining configuration levels

Submitting a reconfiguration

You can submit a recognition through the EMR console, SDK, or AWS CLI. For more information, see submitting a reconfiguration and additional information.

Modifying configurations

When submitting a reconfiguration, you must include all of the configurations you want to apply to the cluster. The update only applies those items, removing all others. As you modify configurations, the EMR console also tracks your previous cluster configurations for you.

Defining configuration levels

Define cluster-level and instance-group-level configurations for your applications. Supply cluster-level configurations as you create a cluster. These configurations are then automatically applied to all your instance groups, even those added after the cluster’s up and running. After the configuration starts, you can’t modify your cluster-level configurations. But you can supplement or override those configurations on the instance-group level through reconfiguration requests. Whenever you submit a reconfiguration request for an instance group, these new instance-group-level configurations take precedence over inherited cluster-level configurations.

To better understand how cluster-level and instance-group-level configurations work together on an instance group, look at a simple demonstration in the EMR console:

Under the Configuration tab, select an instance group in the Filter drop-down list. Navigate to the desired instance group’s configuration table. The Source column of the configuration table indicates the level of your configurations.

This cluster starts with the cluster-level configuration set:

[
  {
    "Classification": "core-site",
    "Properties": {
      "Key-A": "Value-1",
      "Key-B": "Value-2"
    }
  }
]

As you can see in the console, the instance group ig-Y4E3MN8C4YBP automatically inherited the cluster-level configuration set. Now, reconfigure the instance group as follows:

[
  {
    "Classification": "core-site",
    "Properties": {
      "Key-A": "Value-a",
      "Key-C": "Value-3"
    }
  }
]

Once the request goes through, the value of configuration “Key-A” gets overridden by the instance-group-level configuration and changes from “Value-1” to “Value-a.” In contrast, the value of configuration “Key-B” remains unchanged. Meanwhile, your request introduces the new, supplemental configuration, “Key-C.” The configuration table in your console always displays these kinds of subtle changes.

For more information about how to customize cluster-level and instance-group-level configurations, see Supplying a Configuration when Creating a Cluster.

Instance group states, configuration versions, and events

The states of your reconfiguration requests appear in instance group state transitions, configuration version increases, and CloudWatch events. Understand how each works to keep from losing track of any reconfiguration request:

Instance group states: After an instance group receives a reconfiguration request, it transitions from the RUNNING state to the RECONFIGURING state. The RECONFIGURING state indicates the start of the reconfiguration process. After the process completes and the new configurations have taken effect, the instance group returns to the RUNNING state. Then, you can verify your configurations either via your application’s Web UI or application-specific commands.
Configuration versions: Every reconfiguration request you submit establishes a new configuration set, distinguished by a new version number. Configuration versions start from 0 and increase by 1 for each new configuration set that you submit. Each instance group keeps its respective configuration version number. Version numbers increase depending on the number of times that you reconfigure the different instance groups.
Events:EMR posts a state for each reconfiguration request as an Amazon CloudWatch event. These events list the exact times when the request is submitted, the reconfiguration operation starts, and when it completes. For easy tracking, each request is posted together with its associated configuration version. For example, the following event flow shows how EMR executes a typical reconfiguration request in an instance group:

For a complete list of EMR events and instance group state transitions for reconfiguration operation, see the EMR Management Guide.

Reconfiguration example use cases

Here are some use case examples of reconfiguration operations:

Reconfiguring HDFS blocksize
Configuring capacity-scheduler queues

Reconfiguring HDFS blocksize

You may deal with fluctuating workloads. These changes can call for new application configurations throughout the lifetime of a cluster.

For example, suppose that you’ve recently seen growth in the workload and filesize for your long-running cluster. You’d like to account for this growth without replacing your current cluster.

To increase your HDFS block size for better performance, take advantage of the new reconfiguration feature. HDFS NameNode tracks each data block in your cluster. Increasing this block size could increase HDFS performance by reducing the number of blocks watched by NameNode. In addition, this feature improves job performance by reducing the number of required mappers.

To increase the HDFS block size from the default of 128 GB to 256 GB, submit a reconfiguration request to the master instance group, which runs the same node:

$ aws emr modify-instance-groups --cli-input-json file://reconfiguration.json

Here’s the example reconfiguration.json file.

reconfiguration.json:
{
  "ClusterId": "j-MyClusterID",
  "InstanceGroups": [
  {
    "InstanceGroupId": "ig-MyMasterId",
    "Configurations": [
    {
      "Classification": "hdfs-site",
      "Properties": {
        "dfs.blocksize": "256m"
      },
    "Configurations": []
    }]
  }]
}

The EMR reconfiguration process then modifies the “dfs.blocksize” parameter to the provided “256 m” value within the hdfs-size.xml file. The reconfiguration process also automatically restarts NameNode, to pick up the new configuration. Any new blocks added to the cluster automatically use the new default blocksize of 256 MB. If you’d like any existing blocks to pick up this default, follow these steps:

Copy the blocks to a new location.
Delete the originals.
Copy the blocks back to their original location.

The restored blocks pick up the new default block size. NameNode is inactive during the short restart period.

Configuring capacity-scheduler queues

Do you want to change cluster resources sharing strategies among different Hadoop jobs? Modify YARN CapacityScheduler configurations on a running cluster? Add new queues on a large shared cluster that you manage with another organization? Alter the capacity allocation between different queues to meet your changing workloads?

Using the EMR reconfiguration feature, you can make changes by submitting a reconfiguration request to the master node. New configurations take effect on your queues in a few minutes. You don’t have to go through the hassle of logging into the master node, directly updating the configuration file, or manually refresh queues.

EMR clusters come with a single queue by default. To create two additional queues, alpha, and beta, and allocate each 30% of the total resource capacity of your cluster to each of them. Here’s a sample command that submits a reconfiguration request to accomplish the desired change:

$ aws emr modify-instance-groups --cli-input-json file://reconfiguration.json

Here’s the example reconfiguration.json file.

reconfiguration.json:
{
   "ClusterId":"j-MyClusterID",
   "InstanceGroups":[
      {
         "InstanceGroupId":"ig-MyMasterId",
         "Configurations":[
            {
               "Classification":"capacity-scheduler",
               "Properties":{
                  "yarn.scheduler.capacity.root.queues":"default,alpha,beta",
                  "yarn.scheduler.capacity.root.default.capacity":"40",
                  "yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity":"40",
                  "yarn.scheduler.capacity.root.alpha.capacity":"30",
                  "yarn.scheduler.capacity.root.alpha.accessible-node-labels":"*",
                  "yarn.scheduler.capacity.root.alpha.accessible-node-labels.CORE.capacity":"30",
                  "yarn.scheduler.capacity.root.beta.capacity":"30",
                  "yarn.scheduler.capacity.root.beta.accessible-node-labels":"*",
                  "yarn.scheduler.capacity.root.beta.accessible-node-labels.CORE.capacity":"30"
               },
               "Configurations":[]
            }
         ]
      }
   ]
}

Access to the “*” label was given to both queues so that each can access labeled core nodes. Additionally, the sum of capacities for all queues must be equal to 100. The capacity of the default queue decreases to 40%.

Finally, the capacity for each queue’s access to the core label matches the capacity of the queue itself. That means that the core partition splits between queues at the same ratio as the rest of the cluster.

After completing this step, go to the YARN ResourceManager Web UI to verify that your modifications have taken place.

EMR reconfiguration benefits

The following are EMR reconfiguration benefits:

Rolling reconfiguration process
Reconfiguration failure and reversion

Rolling reconfiguration process

One key benefit of EMR reconfiguration is a rolling reconfiguration process. From the documentation:

“Amazon EMR follows a ‘rolling’ process to reconfigure the instances in the Task and Core instance groups. Only 10 percent of the instances in an instance group are modified and restarted at a time. This process takes longer to finish but reduces the chance of potential application failure in a running cluster.”

Rolling reconfiguration protects against any HDFS downtime by allowing 90% of core nodes to stay running during reconfiguration. YARN on EMR additionally has NodeManager recovery enabled. NodeManager recovers containers running after the reconfiguration restart.

Because containers are always active, some MapReduce jobs can continue to run successfully during the reconfiguration process. However, not all applications can recover after a restart. For example, Spark on YARN (the EMR default) may encounter executor issues and job failure after NodeManager restart.

Test applications with the type of reconfiguration that you plan to do in a safe environment before reconfiguring in production.

Finally, the rolling reconfiguration process might result in a temporary mismatch of your instance group’s configuration state. While mismatched, some instances may have old configurations while others may have the newly requested ones. When reconfiguring your cluster, consider any possible side effects of this situation.

Reconfiguration failure and reversion

EMR can also recover your instance group from a reconfiguration failure.

To make your new configuration take effect, EMR restarts your reconfigured applications and ensures that they are running before declaring the reconfiguration operation complete.

However, if any application fails to restart successfully on any node, the reconfiguration operation fails and the instance group remains in the RECONFIGURING state. Such failures might result from problematic configuration values. For example, an invalid address for `yarn.resourcemanager.scheduler.address` can cause the YARN ResourceManager to fail to restart.

In such situations, EMR automatically triggers a configuration reversion. Reversion re-applies the previous working configuration set on the instance group. Reversion brings the instance group state back to the RUNNING state as soon as the reversion completes. Your instance group thus returns to a functioning state and maintains the availability of your applications on the cluster. Rolling reconfiguration continues throughout the process.

If applications still fail to start after the previous working configurations have been re-applied, EMR places the instance group in te ARRESTED state rather than make further reconfiguration attempts. To release the instance group from the ARRESTED state, submit a new reconfiguration request.

Summary

In this post, I showed you the basics of how to configure instance groups on running clusters using the new EMR cluster reconfiguration feature. I walked through the extra semantics of submitting reconfiguration requests, important configuration level concepts, and ways of reconfiguration tracking methods. I provided some real-world reconfiguration examples and covered two useful features of reconfiguration.

Try the new cluster reconfiguration feature and share your experience with us in the comments below!

About the Authors

Brandon Scheller is a software development engineer for Amazon EMR. His passion lies in developing and advancing the applications of the Hadoop ecosystem and working with the open source community. He enjoys mountaineering in the Cascades with his free time.

Junyang Li is a software development engineer for Amazon EMR. She works on cutting-edge features of EMR and is also involved in open source projects. Besides work, she enjoys arts and crafts, exercising and traveling.

Audit History

Last reviewed and updated in April 2024 by Uday Narayanan | Principal Solutions Architect