Future Proof Cost Optimization with Attribute-Based Instance Type Selection and Amazon EC2 Spot

By Ajay Nalawade, Sr. Manager, DevOps – Druva
By Abhishek Gupta, Staff Engineer, DevOps – Druva
By Ashwini Kumar, Sr. EC2 Spot Specialist Solutions Architect – AWS
By Navin Yadav, Sr. Solutions Architect – AWS

Druva

Druva, an AWS Competency Partner and cloud-based data protection company, serves 4,000+ customers across 20 countries, delivering cyber, data, and operational resilience via a single software-as-a-service (SaaS) platform running on Amazon Web Services (AWS).

As part of this SaaS platform, Druva runs containerized applications with microservices architecture using Auto Scaling groups and Amazon Elastic Container Service (Amazon ECS).

To ensure low-latency data transfers and compliance with data residency regulations for its fast-expanding global customer base, Druva auto scales its compute infrastructure in any AWS region closer to its customers in a cost-optimized manner with Amazon EC2 Spot instances.

For this purpose, Druva is currently using hundreds of Auto Scaling groups across 18 AWS regions. Refer to this case study for details on how Druva optimizes costs at high scale.

Spot is unused spare Amazon Elastic Compute Cloud (Amazon EC2) capacity, and to increase the chances of getting an aggregate desired capacity one should use as many Spot capacity pools as possible, because spare capacity might be fairly limited in each given pool on its own. It is therefore recommended to select and specify multiple instance types with Auto Scaling groups to access the highest amount of Spot capacity.

Until recently, the operations team at Druva had been manually selecting, maintaining, and updating lists of instance types in hundreds of Auto Scaling groups across regions, which posed an operational challenge and overhead for the team. Due to manual operations, the team overlooked many existing and newly-released instance types, missing out on many Spot capacity pools in its Auto Scaling group configurations.

To overcome these challenges, Druva decided to adopt attribute-based instance type selection (ABS) with its Auto Scaling groups. In this post, we will walk you through the cost optimization journey of Druva using Spot instances with Auto Scaling groups, challenges faced by them during this journey, and how ABS came to the rescue in addressing those challenges in optimizing costs.

Druva SaaS Platform Architecture on AWS

Druva’s SaaS platform provides services like data backup and restore, disaster recovery as a service (DRaaS), proactive compliance, and metadata search, as well as anomaly engine and event services for various data centers, endpoints, and cloud services.

Druva customers use the Druva Data Resiliency Cloud’s (DRC) Druva Cloud Platform (DCP) as an admin portal, which is a single pane of access for accessing these services.

Figure 1 – Druva Data Resiliency Cloud on AWS (control and data plane).

Druva’s platform control plane is in a virtual private cloud (VPC) in a single AWS region (us-east-1) hosting the control plane services, while the data plane services are hosted in storage VPCs deployed across 18 different AWS regions close to its end users and customers.

In this control plane, containerized services including a session manager service and scaling service are running with Amazon ECS tasks on EC2 nodes in an ECS cluster.

All of the customer endpoints connect to a single regional endpoint (“cloud.druva.com”) in the control plane region. Customer endpoints get the configurations from the control plane region, and then clients execute their data backup and restore operations to the region closest to them—that is where all data transfer occurs. This ensures low latency and lower cost of data transfer in these operations.

Session manager service takes care of spinning up ECS tasks for backup and restore jobs in the storage service nodes in Druva’s data plane. Storage service nodes run in ECS clusters, in a region close to the client’s request. Session manager also takes care of tracking and health management of jobs and replacing unhealthy tasks.

Session manager service also manages capacity of the storage service nodes with the help of a scaling service and maintains the required capacity to complete storage service tasks. Storage service nodes run on Spot instances with Auto Scaling groups in ECS clusters in Druva’s data plane.

Cost Optimization Journey: Reserved Instances to EC2 Spot

Druva had initially relied on a mix of up to 80% Amazon EC2 Reserved Instances, 10% On-Demand Instances, and very few Spot instances.

Considering elasticity and flexibility of the workloads, Druva has been increasing usage of Spot instances in Auto Scaling groups, shifting away from committed usage.

Instance Diversification with Spot Instances

As Spot capacity fluctuates independently for each Spot capacity pool, it’s important that Auto Scaling groups have a diverse set of instance types to choose from when launching or scaling Spot capacity.

Subsequently, an allocation strategy determines how Auto Scaling groups fulfill capacity from Spot instance pools specified in the configuration.

For workloads requiring minimal interruptions and capacity challenges, a capacity-optimized allocation strategy should be used to launch Spot instances from the most-available pools.

Alternatively, a lowest-price allocation strategy (across N pools) can be used to launch Spot capacity from the cheapest N pools out of all the instance pools specified, but lowest-price strategy does not account for pool capacity depth as it deploys Spot instances.

Spot prices change slowly over time based on long-term trends in supply and demand, but capacity fluctuates in real time. Even if the cost is the highest priority, it’s recommended to use capacity-optimized strategy due to improved access to capacity and fewer interruptions, though you won’t get the absolute cheapest pools.

Druva’s workloads are instance type flexible with certain vCPU requirements and can be run on multiple types of EC2 instances. There’s also an in-house checkpointing mechanism implemented, which makes the service recovery possible without loss of work completed in the wake of Spot interruptions. This helped Druva in adopting Spot instances with diversification to reduce costs.

To control compute costs by using the lowest-priced Spot instance pools, Druva employed the lowest-price allocation strategy.

Let’s now look at why and how Druva moved away from manual selection of instance types and adopted automated attribute-based instance type selection feature with its Auto Scaling groups.

Before: Instance Diversification with Manual Instance Types List

Previously, the cloud operations team at Druva was manually selecting and specifying multiple instance types in launch template overrides for their Auto Scaling group configurations. They faced following challenges in the process:

Due to the manual process of finding and maintaining instance types in overrides lists, Druva was overlooking some of the Spot instance pools, missing out on the lowest-priced pools at times and pushing up costs.
.
Druva runs workloads on multiple Auto Scaling groups across 18 AWS regions. As some regions may not have certain EC2 instance types available, whereas others may have more instance types, Druva had to maintain different instance types list for each region.
.
It was an overhead for Druva teams to keep track of newly-released instance types in different regions and add them manually to Auto Scaling groups’ overrides lists. This resulted in missing out new Spot instance pools, which could have been the lowest-priced at times.
.
Though Spot instance prices change infrequently and predictably over time, they are still dynamic in nature and price changes depend upon long-term supply and demand trends of Spot capacity pools. For its interruption-tolerant workloads, Druva aimed to control costs by provisioning Spot instances from only a certain X% of lower priced pools from all Spot pools available.
.
Druva tried the lowest-priced allocation strategy (across N pools) but by design it only works on a best-effort basis to launch capacity from cheapest N Spot pools and goes on to provision from higher priced pools to maintain target capacity. This is in case Spot capacity is not available in the cheapest pools. Druva was getting higher-priced Spot instances due to no price protection cap.

To understand the above challenges better, we will now walk you through a use case comparing manual instance types configuration vs. attribute-based instance types selection in an Auto Scaling group. To minimize costs for our interruption tolerant workload, we intend to launch 100% Spot instances from two lowest-priced instance pools per AWS Availability Zone (AZ) in an Auto Scaling group with lowest-price Spot allocation strategy (across two pools) in us-east-1 region.

In the first scenario, we’ll configure Auto Scaling group with Spot pools diversification by specifying a manual list of instance types.

Figure 2 – A manual instance types list in an Auto Scaling group.

Let’s now discover Spot prices of these instance pools in one of the AZs (us-east-1a) using the describe-spot-price-history API at the time of launching this Spot capacity in an Auto Scaling group. We will sort them from lowest to highest Spot prices to find the two lowest-priced instance types and their Spot prices in us-east-la.

aws ec2 describe-spot-price-history --instance-types m4.4xlarge m5.4xlarge m5d.4xlarge m5a.4xlarge m5ad.4xlarge m5n.4xlarge m5dn.4xlarge r5.4xlarge r5a.4xlarge r4.4xlarge c5.4xlarge --start-time 2022-03-08T05:00:00 --end-time 2022-03-08T06:00:00 --availability-zone us-east-1a --product-descriptions "Linux/UNIX"

Figure 3 – Spot prices of different instance types (sorted from lowest to highest).

Due to automatic AZ balancing of capacity in Auto Scaling groups, with a desired capacity of 30 instances across three AZs, EC2 Auto Scaling will launch 10 instances per AZ. With lowest-price allocation strategy (across two pools) applied in each AZ, five Spot instances each will be launched from the two lowest-priced instance types in the AZ.

As you can see above, r5a.4xlarge and c5.4xlarge are the two lowest-priced Spot pools in us-east-1a at the time of launching Spot capacity in Auto Scaling group. As expected, Spot instances in us-east-1a are launched from these two lowest-priced Spot instance types (see below).

Figure 4 – Spot instances launched from the two lowest-priced pools.

Now, let’s automate our instance pools diversification with attribute-based instance type selection in our Auto Scaling group, in the same way as Druva implemented in its Auto Scaling group configurations.

After: Instance Diversification with ABS

As an alternative to manually choosing instance types for your instance type overrides list in Auto Scaling groups, attribute-based instance type selection (ABS) lets you express your instance type requirements as a set of attributes, such as vCPUs, memory, memory per vCPU, storage, CPU architecture, GPU count, and many other instance attributes. It also allows you to exclude certain EC2 instance families or instance types based upon your specific workload requirements and business objectives.

Your requirements are automatically translated to all matching instance types, simplifying the creation and maintenance of instance types list. This allows you to automatically use newer generation instance types when they are released and access broader pools of compute capacity. Auto scaling groups select and launch instances that fit the specified attributes, removing the need to manually pick instance types.

Druva adopted ABS for all of its Auto Scaling groups, moving away from manual lists of instance types. Druva’s cloud operations team updated Auto Scaling group configurations to replace instance types lists with InstanceRequirements to automate instance types selection. Below is an example of updating Auto Scaling group with ABS using the update-auto-scaling-group API.

In this example, instance requirements like number of vCPUs, memory, memory per vCPU, processor manufacturers, and instance generation are specified along with an exclusion list to exclude certain instance families. To avoid picking up instances with GPUs, customers can specify min and max GPU accelerator count as zero.

aws autoscaling update-auto-scaling-group --cli-input-json file://asg-config-abs.json

“asg-config-abs.json”

{
  "AutoScalingGroupName": "MySpotASG",
  "DesiredCapacityType": "units",
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateName": "NewTemplateForSpot",
        "Version": "3"
      },
      "Overrides": [
        {
          "InstanceRequirements": {
            "VCpuCount": {
              "Min": 16,
              "Max": 16
            },
            "MemoryMiB": {
              "Min": 32768,
              "Max": 131072
            },
            "MemoryGiBPerVCpu": {
              "Min": 2,
              "Max": 8
            },
            "CpuManufacturers": [
              "intel",
              "amd"
            ],
            "InstanceGenerations": [
              "current"
            ],
            "AcceleratorCount": {
              "Min": 0,
              "Max": 0
            },
            "ExcludedInstanceTypes": [
              "d*",
              "h*",
              "x*",
              "i*",
              "z*"
            ]
          }
        }
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 0,
      "OnDemandPercentageAboveBaseCapacity": 0,
      "SpotAllocationStrategy": "lowest-price",
      "SpotInstancePools": 2,
      "OnDemandAllocationStrategy": "lowest-price"
    }
  },
  "MinSize": 0,
  "MaxSize": 120,
  "DesiredCapacity": 30,
  "VPCZoneIdentifier": "subnet-9a87c5d7,subnet-8dfc4eeb,subnet-d73f86f6"
}

Figure 5 – Auto Scaling group with ABS (console).

As can be seen in the matching instance types preview below, ABS not only generated automatic instance types list by including all matching existing instance types but also included many newly-released instance types (M6i, M6a, C6i, C6a, R6i), which were missed out in earlier instance types list configured manually.

This helped Druva automatically tap into more Spot capacity pools in its Auto Scaling groups across different AWS regions. For example, 25 Spot instance pools (instance types) per Availability Zone in this example can be seen in the preview below.

Figure 6 – Preview matching instance types as per ABS configuration.

We’ll now find out which two instance types were launched by EC2 Auto Scaling in us-east-1a from the above 25 instance types, as per the lowest-price Spot allocation strategy (across two pools). We intend to see if Auto Scaling group launches the same instance types (r5a.4xlarge and c5.4xlarge) as before (in case of manual instance types list) or different instance types this time. We’ll analyze and compare Spot prices of the instance types launched in both scenarios.

Figure 7 – Spot instances launched with ABS from two lowest-priced pools.

As can be seen above, the Auto Scaling group provisioned C6a.4xlarge and C6i.4xlarge instance types, from the 25 Spot instance pools automatically provided by ABS configuration. These should be the two lowest-priced instance types from 25 Spot capacity pools available.

Let’s determine Spot prices of all 25 instance types from the list generated by ABS at the time of launching capacity with our Auto Scaling group. We’ll sort them from lowest to highest Spot prices to find the two lowest-priced instance types and their Spot prices.

aws ec2 describe-spot-price-history --instance-types m4.4xlarge m5.4xlarge m5d.4xlarge m5a.4xlarge m5ad.4xlarge m5n.4xlarge m5dn.4xlarge m6a.4xlarge m6i.4xlarge r5.4xlarge r5a.4xlarge r5ad.4xlarge r5b.4xlarge r5d.4xlarge r5dn.4xlarge r5n.4xlarge r6i.4xlarge r4.4xlarge c5.4xlarge c5a.4xlarge c5d.4xlarge c5ad.4xlarge c5n.4xlarge c6a.4xlarge c6i.4xlarge --start-time 2022-03-08T05:00:00 --end-time 2022-03-08T06:00:00 --availability-zone us-east-1a --product-descriptions "Linux/UNIX"

Figure 8 – Spot prices of 25 instance types from ABS (sorted from lowest to highest).

As can be seen above, C6a.4xlarge and C6i.4xlarge are the two lowest-priced Spot instance pools picked up by the Auto Scaling group, from much wider instance type choices generated automatically by ABS. We were missing out on these instance types earlier in our manual list and used Spot instance pools which were more expensive than these pools at that point in time.

R5a.4xlarge and C5.4xlarge were launched in earlier scenario with manual instance types list. Let’s now compare Spot prices of these instances and do some cost analysis.

For example, the Spot price difference between C6i.4xlarge and C5.4xlarge is 0.0605 US$/hour and between R5a.4xlarge and C6i.4xlarge instance is 0.0455 US$/hour at the time of launching this Spot capacity.

As Spot instance prices vary slowly and infrequently without spikes, this can make huge impact to overall costs for scaling thousands of instances with hundreds of Auto Scaling groups, as in the case of Druva. Having automated, future-proof instance types diversification with ABS also help Druva access larger Spot capacity across AWS regions.

This shows that automated instance pools diversification with attribute-based instance types selection not only helped Druva overcome Spot capacity challenges but also reduced compute costs further.

Price Protection Cap with Attribute-Based Instance Type Selection

Druva wanted to control costs by avoiding Spot instances with extreme price differences, and wanted to provision Spot instances from only a certain percentage of lower-priced instance types from all Spot pools available.

This was made possible by setting a price protection threshold for Spot instances, which by default is set to 100% by ABS. That means EC2 Auto Scaling groups configured with ABS can provision any instance type (matching specified attributes) with Spot prices ranging from X to 2X, where X is the baseline price (the price of the least expensive M/C/R instance type with the specified attributes).

While the default 100% price protection cap widens Spot pools diversification choices, Druva also wanted to avoid Spot instance types with extreme price differences at any point in time to control its Spot costs further. As Druva’s worker nodes in the data plane are processing chunks of data with a checkpointing algorithm in place, their workloads are resilient against Spot interruptions and can maintain service continuity even below target capacity at times.

Druva therefore decided to use a lower value of Spot price protection thresholds in the range of 60-70% (as shown below in the example ABS configuration with a value of 60%). This meant instance types with only Spot price variation of X to 1.6X were used while launching Spot capacity (X being the price of least expensive M/C/R instance type with the specified attributes at the time of provisioning capacity).

…
"InstanceRequirements": {
"VCpuCount": {"Min": 16, "Max": 16},
"MemoryMiB": {"Min": 32768, "Max": 131072},
"MemoryGiBPerVCpu": {"Min": 2.0, "Max": 8.0},
"SpotMaxPricePercentageOverLowestPrice": 60,
"CpuManufacturers": ["intel","amd"],
"InstanceGenerations": ["current"],
"AcceleratorCount": {"Min": 0, "Max": 0},
"ExcludedInstanceTypes": ["d*", "h*", "x*", "i*","z*"]
}
….

Druva also maintained a balance between instance diversification and capping of instance prices in its Spot usage. Druva automated a way to modify/increase price protection threshold values in ABS configurations to include more Spot pools without any downtime in Auto Scaling groups in any region, whenever it faced persistent Spot capacity challenges.

Conclusion

This post demonstrated the ease and benefits of adopting attribute-based instance type selection (ABS) with Auto Scaling groups. We also showed how customers like Druva are able to automate and future proof its instance diversification choices to maximize cost optimization benefits of EC2 Spot instances for elastic, scalable, and flexible workloads.

Druva has been able to bring down compute costs for its EC2 Spot usage by 10-15%, by adopting ABS with features like price protection thresholds, lowest-price allocation strategy diversified across pools, implementation of robust interruption handling, work checkpointing, and service recovery mechanisms.

With ABS, Druva was able to reduce operational challenges and overhead for its teams by eliminating the regular need for manual efforts and intervention in maintaining and updating instance type diversification in hundreds of Auto Scaling groups across regions, with its fast-expanding global customer base. This helped Druva repurpose its infrastructure operations efforts into the company’s core business of innovating on products and services instead.

.

.

Druva – AWS Partner Spotlight

Druva is an AWS Competency Partner and cloud-based data protection company that delivers cyber, data, and operational resilience via a single SaaS platform running on AWS.

Contact Druva | Partner Overview | AWS Marketplace