Introducing Amazon EMR Managed Scaling – Automatically Resize Clusters to Lower Cost
AWS is happy to announce the release of Amazon EMR Managed Scaling—a new feature that automatically resizes your cluster for best performance at the lowest possible cost. With EMR Managed Scaling you specify the minimum and maximum compute limits for your clusters and Amazon EMR automatically resizes them for best performance and resource utilization. EMR Managed Scaling continuously samples key metrics associated with the workloads running on clusters. EMR Managed Scaling is supported for Apache Spark, Apache Hive and YARN-based workloads on Amazon EMR versions 5.30.1 and above.*
Use cases and benefits
Before EMR Managed Scaling, you had to predict workload patterns in advance or write custom automatic scaling rules that depended on in-depth understanding of the application framework (for example, Apache Spark or Apache Hive). Predicting your workload or writing custom rules can be difficult and error-prone. Incorrect sizing of cluster resources can often lead to either missed SLA or unpredictable performance, or underutilization of resources and cost overruns.
EMR Managed Scaling solves this problem by automatically sizing cluster resources based on the workload for best performance and lowest cost. You don’t need to predict your workload patterns or write custom logic to scale your cluster. EMR Managed Scaling constantly monitors key metrics based on workload and optimizes the cluster size for best resource utilization. Amazon EMR can scale the cluster up during peaks and scale it down gracefully during idle periods, reducing your costs and optimizing cluster capacity for best performance. With a few clicks, you can set the compute limits for your cluster and Amazon EMR manages the rest. With EMR Managed Scaling, Amazon EMR also emits high-resolution metrics at 1-minute granularity, allowing you to visualize how Amazon EMR Managed Scaling is reacting to your incoming workload. For more information, see Understanding Managed Scaling Metrics.
To illustrate by example, we configured an EMR cluster with EMR Managed Scaling to scale between 1 to 20 nodes, with 16 VCPU per node. We submitted multiple parallel Spark jobs (from the TPC-DS benchmark) to the cluster at 30-minute intervals. We set EMR cluster settings to default and turned on EMR Managed Scaling. The following Amazon CloudWatch dashboard shows how EMR Managed Scaling sized the cluster to the cluster load, scaling it up during peaks and scaling down when idle. Enabling EMR Managed Scaling in this use case lowered costs by 60% compared to a fixed size cluster.
EMR Managed Scaling vs. Auto Scaling
Amazon EMR offers two ways to scale your clusters: you can either use Amazon EMR’s support for Auto Scaling released in 2016, or EMR Managed Scaling. If you’re running Apache Spark, Apache Hive, or YARN-based applications and want a completely managed experience, we recommend using EMR Managed Scaling. If you need to define custom rules involving custom metrics for applications running in the cluster, you should use Auto Scaling. The following table summarizes the differences between these methods.
|EMR Managed Scaling||Auto Scaling|
|Scaling rules management||Amazon EMR managed algorithm that constantly monitors key metrics based on the workloads and optimizes the cluster size for best resource utilization.||You can choose custom metrics and apply scaling rules.|
|Cluster types supported||Instance groups and instance fleets||Instance groups only|
|Configuration granularity||Cluster level minimum / maximum constraints||Instance group level configuration|
|Minimum Amazon EMR release version||5.30+||4.0+|
|Metric collection frequency to aid scaling decisions||Every 1–5 seconds||Every 5 minutes|
|Evaluation frequency||Every 5–10 seconds||Every 5 minutes|
No configuration required. Amazon EMR follows the dynamic scaling strategy and computes the actual cluster’s resource requirement and reaches to the correct scale directly. This happens in both scale-up and scale-down use cases. Amazon EMR automatically detects the need to scale up or down without specific cooldown periods.
|Auto Scaling allows you to define a fixed count of instances to add or remove in case of condition breach.|
|Cooldowns between resizes||You can choose to define your own cooldown periods between consecutive resizes|
|Scaling based on custom metrics||You can choose to define custom application or infrastructure metrics. You can also define custom scaling actions and thresholds.|
EMR Managed Scaling now supports EMR instance fleets
Spot Instances are spare Amazon Elastic Compute Cloud (Amazon EC2) compute capacity in the AWS Cloud, and are available to you at a savings of up to 90% less compared to On-Demand prices. Amazon EC2 can interrupt Spot Instances with 2 minutes of notification when it needs the capacity back. A common use case is to run data processing workloads with Amazon EMR using Spot Instances, because of the fault-tolerant nature of Spark and other YARN-based workloads. For more information, see Best Practices for running Apache Spark applications using Amazon EC2 Spot Instances with Amazon EMR.
EMR Managed Scaling also introduces support for Amazon EMR instance fleets. You can seamlessly scale Spot Instances, On-Demand Instances, and instances that are part of a Savings Plan all within the same cluster.
Combining capacity from multiple Spot capacity pools (a combination of EC2 instance type in an Availability Zone) across multiple instance families and sizes decreases the impact on your workload in case Spot capacity is interrupted or unavailable. We call this Spot diversification. Amazon EMR automates this strategy by allowing you to configure instance fleets. If you’re running large-scale data processing workloads with Amazon EMR and your workload is fault-tolerant, the recommended way to use Spot Instances is to scale using the task fleet with Spot Instances, and use On-Demand Instances in the core fleet for the non-fault tolerant part of the cluster, such as the Spark Driver or HDFS nodes.
You can configure EMR Managed Scaling in a way (see Node Allocation Strategy section below) so that EMR clusters only scale task nodes on Spot instance. With instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within the cluster. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones. When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. When Amazon EMR launches the cluster, it looks across provided subnets, Availability Zones, instance families, and sizes to provision the cluster capacity that has lowest chance of getting interrupted, for the lowest cost. With EMR Managed Scaling, you can also resize instance fleets and set On-Demand and Spot limits to each instance fleet. For more information, see Configure Instance Fleets.
Configuring EMR Managed Scaling
Configuring EMR Managed Scaling is simple. Just enable EMR Managed Scaling and set the minimum and maximum limits on the number of instances or VCPUs (in case of instance groups) or capacity units (in case of instance fleets) for the cluster nodes. You can enable Managed Scaling on a running cluster or at the time of provisioning the cluster. For more information, see Using EMR Managed Scaling in Amazon EMR.
Node allocation strategy
EMR Managed Scaling lets you control the minimum and maximum capacity that the cluster can scale up to. The parameters that let you control these are:
- MinimumCapacityUnits – Lower boundary of the size of the cluster
- MaximumCapacityUnits – Upper boundary of the size of the cluster
- MaximumCoreCapacityUnits – Upper boundary of core node group
- MaximumOnDemandCapacityUnits – Upper boundary of capacity to be provisioned from the On-Demand market
The last two parameters let you choose if you want to scale core nodes or task nodes (MaximumCoreCapacityUnits) or use instances from the On-Demand or Spot market (MaximumOnDemandCapacityUnits). You can use these parameters to split capacity between core and task and On-Demand or Spot. A simpler way to think about them is to think about sliders on a line.
Scaling only core nodes
In this use case, your cluster has a minimum of 5 nodes and maximum of 100 nodes. By setting maximum core nodes to maximum capacity and maximum On-Demand nodes to maximum, you’re instructing EMR Managed Scaling to only scale core nodes and only scale them On-Demand. The following diagram illustrates this configuration.
Scaling only task nodes (On-Demand)
If you change the maximum core to the minimum value, you would only scale the nodes using task instances. In this use case, the cluster only scales task nodes and only On-Demand. The following diagram illustrates this configuration.
Scaling only task nodes (Spot Instances)
If you change the maximum On-Demand capacity to minimum, you can scale the cluster between the minimum and maximum capacity using Spot nodes. Because the maximum core is also set to the minimum size of the cluster, the scaling only happens using task nodes. The following diagram illustrates this configuration.
Best Practice: Scaling core nodes On-Demand and task nodes on Spot
As a best practice, it’s recommended to use core nodes (because they have HDFS) On-Demand and task nodes on Spot Instances. Because core nodes contain HDFS, a sudden loss of a node can lead to degraded job performance or loss of data. The following diagram shows that the cluster scales up core nodes to maximum core capacity using the On-Demand market and scales the rest using task nodes on the Spot market.
This post discussed EMR Managed Scaling, which automatically resizes your cluster for best performance at the lowest possible cost. For more information, see Using EMR Managed Scaling in Amazon EMR, or view the EMR Managed Scaling demo:
*We do not support EMR Managed Scaling on EMR 6.0 but will support in subsequent releases 6.1+
Amazon works closely with customers for testing and improving new features. Amazon EMR team would like to thank ZS associates, a consulting and professional services firm headquartered in Illinois, who partnered with Amazon EMR for beta testing. ZS’s feedback contributed to the enhanced performance of Managed Scaling today. ZS also benefited greatly from adopting this feature. “We leverage Amazon EMR (mainly using Apache Spark) heavily for data analytics strategy and had tremendous success over the years Previously, tuning auto-scaling policies with appropriate parameters was a challenge. The data engineering team had to closely monitor the workload changes and update the parameters. Managed Scaling made it really easy. The simplicity of this feature coupled with our Cloud Center of Excellence’s central governance will allow us to adopt Managed Scaling across numerous EMR Spark workloads quickly. We are happy to see a 30% cost reduction, and plan to enable Managed Scaling for all future workloads.” Said Anirudh Vohra, Cloud Engineering Architect, ZS.
About the Authors
Abhishek Sinha is a Principal Product Manager at Amazon Web Services.
Joseph Marques is a principal engineer for EMR at Amazon Web Services.
Srinivas Addanki is a Software Development Manager at Amazon Web Services.
Vishal Vyas is a Software Development Engineer at Amazon Web Services.