Optimizing Amazon EC2 Spot Instance Usage with Qubole Data Platform

By Balaji Mohanam, Director of Product Management at Qubole
By Dhiraj Sehgal, Director of Product Marketing at Qubole
By Dilip Rajan, Partner Solutions Architect, Data and Analytics at AWS

Amazon EC2 Spot Instances let you reduce costs by taking advantage of unused capacity. You can further reduce costs by using the policy-based automation in Qubole Data Platform to balance performance, cost, and SLA requirements anytime you use Spot Instances.

Using Qubole Data Platform brings inner peace by minimizing Spot interruptions while reducing costs.

In this post, we explain how the Qubole Data Platform optimizes your Spot usage, and how it applies policy-based automation to balance your performance, cost, and SLAs whenever you use Amazon EC2 Spot Instances.

High-traffic applications generate hundreds of petabytes (PB) of data, whose cost can be difficult to contain. This is especially true in modern applications, where in addition to large volumes of data, traffic can be volatile and fluctuate with little warning.

Consider that in such a high-traffic environment, you are tasked with the following:

Manage a platform that enables hundreds or even thousands of users with self-service data democratization, executes hundreds of thousands of application jobs, and processes PBs of data.
Use an elastic, on-demand, and ephemeral infrastructure that can have a lifecycle from a few minutes to several days.
Monitor and prevent cost overruns, and maximize savings, without compromising SLA or availability.

Spot Instances enable you take advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity at up to a 90 percent discount compared to On-Demand pricing. That inner peace can be disturbed, however, by impromptu interruptions that, in turn, create Spot interruptions.

By combining Qubole Data Platform with Spot Instances, you can reduce impromptu interruptions. This reduces resource usage and expenses, without affecting availability and your SLAs with customers.

Qubole, an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in both Machine Learning and Data & Analytics, also applies heuristics and machine learning (ML) to continually analyze, learn, and then automatically manage workload demands based on underlying usage patterns.

How Qubole Optimizes Amazon EC2 Spot Instance Usage

A key to the effectiveness of the Qubole Data Platform is its Spot Instance type diversification strategy, in which the worker nodes belonging to a particular cluster can belong to multiple Amazon EC2 instance types. This diversification provides many benefits.

The information in Figure 1 highlights the results Qubole customers have experienced by using Qubole Open Data Lake Platform to manage their Amazon EC2 Spot Instances.

Qubole customers have saved more than $50 million a year annualized when using EC2 clusters with Qubole Open Data Lake Platform for their ad-hoc analytics, data exploration, streaming analytics, and ML workloads.

This is equivalent of consuming 1.5 billion compute hours, 7 exabytes of data processed, and running 600 million commands executed over that time.

Qubole verified each of these benefits by creating an EC2 cluster and observing how Qubole Data Lake Platform optimized its use of Spot Instances.

Figure 1 – Results of using Qubole Data Lake Platform.

In the following sections of this post, we describe Qubole’s observations for each purported benefit.

Maximizing Spot Request Fulfillment

Achieving your desired scale with Spot Instances depends on the available capacity, since they are spare Amazon EC2 capacity. A lack of capacity can severely impact ongoing operations, as the cluster is unable to scale up to meet the demands of the workloads.

An obvious choice is to cancel that request and fall back to on-demand nodes. However, this approach can negatively impact the total cost of ownership (TCO) of the cluster due to use of a different instance type or different AWS Availability Zone, neither of which may fit within the desired financial profile.

For example, in the cluster shown in Figure 2, the primary node type is r3.4xlarge. When Spot is no longer available for this instance type, the requested Spot Instances are provisioned from instance types r3.xlarge, r4.8xlarge, etc., rather than falling back to on-demand.

Figure 2 – Selecting other spot instance types when a primary node type is no longer available.

In that cluster, we observed that over two-thirds of the nodes were provisioned on multiple Spot Instance types, as shown in Figure 3.

Figure 3 – Provisioning from multiple Spot Instance types instead of On-Demand.

Without instance type diversification, the cluster would have fallen back to On-Demand nodes of the r3.4xlarge instance type. This would have cost a total of $1,746.56 for the entire duration of this particular cluster.

However, with diversification, the required nodes are provisioned from other instance types at an 80 percent discount, saving $1079.55 (50 percent) in costs.

Figure 4 – Cost savings from diversifying Spot Instance types.

Minimizing Spot Loss Exposure

When AWS needs spare capacity for its On-Demand or other priority services, it reclaims resources from Spot Instances with a two-minute warning. This “Spot Instance Interruption” can happen at any time to any Spot Instance type, which can interrupt processing on all of the nodes in the cluster belonging to the reclaimed instance type.

The graph in Figure 5 shows the number of instance types (among r4.8xlarge, r3.xlarge, r4.2xlarge, r4.xlarge, r3.4xlarge, r3.8xlarge, m2.xlarge, m4.xlarge, r3.2xlarge) we observed experiencing Spot Instance interruptions during a typical day.

Figure 5 – Interruptions across multiple Spot Instance types.

In some cases, the interruptions affected two instance types, but most of the time interruptions only affected one instance type.

If the nodes provisioned in the cluster belong to this instance type and are running in the same AWS Availability Zone, there’s a chance that other instances in that capacity pool will also be interrupted, since Amazon EC2 needs the capacity back in that capacity pool for on-demand usage.

When Qubole Data Platform applied Spot Instance type diversification, spot loss exposure was minimized.

Figure 6 – Minimizing the number of nodes exposed to Spot loss.

Instead of using only one Spot Instance type (r3.4xlarge), Qubole Data Platform spread the workload across multiple instance types (r4.8xlarge, r3.xlarge, r4.2xlarge, r4.xlarge, r3.4xlarge, r3.8xlarge, and r3.2xlarge).

As a result, the cluster used a single instance type less than 40 percent of the time, reducing the number of cluster nodes exposed to potential Spot Instance interruption.

Increasing Instance Duration

Spot Interruption decisions are based on available capacity, not price. In some cases, instance types that are highly constrained could cause Spot Instances to be interrupted after a short time period, not allowing jobs to finish. The frequent occurrence of this scenario leads to “spot thrashing,” where the instances are repeatedly provisioned but immediately interrupted.

Qubole Data Platform provisions instances from Spot capacity pools that have the deepest capacity pools. If the requested number of Spot nodes is not available for a particular instance type, they are allocated across multiple instance types (based on the available capacity ranking), until the request is fulfilled.

By prioritizing instance pools with the deepest capacity pools available, Qubole Data Platform reduces both the frequency of interruption and the probability of immediate interruption. As a result, the Spot Instances can run for a longer duration.

This approach is well suited for clusters where the business cost of re-computation is high. With Spot Loss Aware Provisioning, when a new set of nodes is requested for upscaling the cluster, Qubole automatically provisions Spot Instances from the deepest capacity pools. This further optimizes the instance diversification to increase the Spot Instance duration.

Maximizing Savings

With a cost-optimized Spot diversification strategy, instance types that are available at the highest discount are provisioned first.

For example, if the m2.xlarge and r3.xlarge instances are not available, cost-optimized spot diversification automatically provisions the available instances in the order of:

Lowest price
Available capacity

Rather than provisioning the m4.xlarge instance, which is at 27 percent of On-Demand cost, Qubole would automatically provision r3.8xlarge, which is at 20 percent cost.

While Spot Instances, in general, provide cost reduction, cost-optimized allocation strategy further maximizes cost avoidance. In this case, a 35 percent cost avoidance with r3.8xlarge over m4.xlarge. This approach flattens the price curve even more, as shown in Figure 7.

Figure 7 – Flattening the price curve even more.

The price also adjusts more gradually over a period of time rather than rapidly fluctuating. When the price of a particular instance time increases (lower discount), it also takes more time to reduce.

A cluster without any Spot diversification will continue to run workloads on the same instance type, which costs more and results in higher TCO.

Faster Spot Auto Scaling

By default, Amazon EC2 fulfills Spot Instance requests based on the availability of the spare instances. The request waits until the capacity is fulfilled or the configured timeout limit is reached.

This wait time adds to cluster scale-up time which, in turn, slows the response time of the workload. On the other hand, reducing the timeout period may increase the performance but lower the savings. This is because the cluster will fallback and provision higher cost on-demand nodes.

With Spot Instance type diversification, Qubole Data Platform spreads the request across multiple instance types. When a particular instance type doesn’t have any spare capacity, the request can be immediately fulfilled by other instance types that can accommodate the requested capacity.

Qubole automatically requests for Spot Instances in a synchronous manner using Amazon EC2 Fleet in instant mode to instantly fulfil the request. Therefore, clusters with Spot Instances can now scale up faster without compromising savings.

Qubole provides the same benefits for Amazon EC2 Spot blocks that can be reserved for a predefined duration between 1-6 hours. If average cluster life cycle is under six hours and user wants similar benefits, this gives them an option to guaranteed uninterrupted workload execution with relatively cheaper price and higher stability compared to On-Demand instances and all Qubole outlined benefits above.

Summary

Managing Amazon EC2 Spot Instances with the intelligent spot management capabilities in Qubole Data Platform allows organizations to optimize their use of Spot Instances. This results in substantial cost savings on their projects.

Using Qubole’s policy-based automation platform to balance performance, cost, and SLA requirements, you can at last have inner peace.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.

.

.

Qubole – APN Partner Spotlight

Qubole is an AWS Competency Partner that simplifies the provisioning, management, and scaling of big data analytics workloads leveraging data stored on AWS.

Contact Qubole | Solution Overview | AWS Marketplace

*Already worked with Qubole? Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.