AWS for Industries

Optimizing HPC deployments with EC2 Fleet and IBM Spectrum LSF

Introduction

High performance computing (HPC) workloads are becoming complex with the advent of big data, advanced node electronic design automation (EDA) for chip design, and high-precision verification. Enterprises are adopting Amazon Web Services (AWS) to meet the constantly growing compute demands in HPC. Worldwide HPC in the Cloud Forecast 2020–2026 from Hyperion Research (June 2022) suggests that HPC cloud spend will outpace on-premises spend over the next 5 years.

AWS continues to lower the barrier to cloud adoption for our HPC customers. EC2 Fleet makes it possible to launch a group of Amazon Elastic Compute Cloud instances, offering secure and resizable compute capacity for virtually any workload, with a single API call. IBM Spectrum LSF , a popular HPC workload manager, has integrated EC2 Fleet into its resource connector module. With EC2 Fleet support in LSF, we can now realize flexible architectures with a single template configuration, reducing operational overhead. EC2 Fleet API helps provision HPC clusters with tens of thousands of Amazon EC2 instances in minutes as against hours.

This post details the steps to activate EC2 Fleet with LSF. We provide configurations for the below-mentioned cluster patterns:

  • Cluster with heterogenous Amazon EC2 instance types in multiple Availability Zones
  • Cluster with a combination of Amazon ECS Spot Instances, which lets you take advantage of unused Amazon EC2 capacity in the AWS Cloud, and Amazon EC2 On-Demand Instances, which lets you pay for compute capacity by the hour or second with no long-term commitments
  • Cluster with instance prioritization capability

Life before EC2 Fleet–LSF integration

A high-level HPC workflow on AWS with LSF comprises of steps shown below:

Figure 1: A high-level HPC workflow on AWS with LSF

As shown in Figure 1,

1.     the user logs into a remote workstation running on Amazon EC2;
2.     the user submits job(s). LSF management host handles the job scheduling request; and
3.     the LSF management host brings up the required Amazon EC2 instances. It schedules the jobs on these instances based on parameters specified in the resource connector configuration files.

Prior to EC2 Fleet integration, LSF called the Amazon EC2 RunInstances and Spot Fleet APIs to launch instances on AWS. This mechanism had challenges. In heterogenous clusters, Amazon EC2 RunInstances calls were made serially for each instance type. This resulted in a significant runtime impact. It took hours to provision clusters with tens of thousands of cores. Custom scripts often implemented exponential back-off algorithms to acquire compute capacity. This also contributed to increased cluster setup times. Multiple templates were required to configure different instance types and Availability Zones. This resulted in operational overhead. An inability to combine instance purchase options meant reduced flexibility for cost optimization.

Configuring EC2 Fleet with LSF

With EC2 Fleet in the LSF resource connector, you can now launch tens of thousands of compute cores on AWS within minutes. With a single template, a heterogenous cluster spanning multiple Availability Zones can be configured for provisioning. Furthermore, Amazon EC2 On-Demand Instances, Spot Instances, and Reserved Instances (which provide a significant discount compared to On-Demand pricing) purchasing options can be used together in the LSF configuration. This makes configuring the cluster for cost optimization simpler and more flexible. The integration provides the ability to design cost-efficient, performant, and resilient HPC environments on AWS.

EC2 Fleet support in LSF is available in Fix Pack 13. It can be activated following the installation steps here. Once installed, an EC2 LaunchTemplate with configuration information to launch an EC2 instance needs to be created. Note the LaunchTemplateId. It will be used to configure the EC2 Fleet. The architecture patterns discussed above can be configured in the configuration file located at $LSF_TOP/conf/resource_connector/aws/conf/ec2-fleet-config.json

(A) Cluster with heterogenous Amazon EC2 instance types in multiple Availability Zones
LSF clusters can have hundreds to thousands of compute cores. An effective mechanism to achieve massive scale on AWS is to build a heterogenous LSF cluster with different Amazon EC2 instance types. This maximizes chances of compute capacity availability. LSF clusters can also span multiple Availability Zones within a single AWS Region for increased resiliency, in addition to gaining access to more compute capacity. A heterogenous cluster across multiple Availability Zones can be configured in ec2-fleet-config.json as below:

{
   "LaunchTemplateConfigs":[
    {
      "LaunchTemplateSpecification":{
      "LaunchTemplateId": "LauncTemplateId_Created_Earlier",
      "Version":"1"
     },

// instanceType lets you specify multiple instance types within
// a single ec2 fleet configuration. Example below specifies c5
// large and 2xlarge. Subnets in the configuration lets you 
// define subnets in different AWS Availability Zones. 
     "Overrides":[
     {
       "InstanceType":"c5.large",
       "SubnetId":"subnet-0fe69d290ae026155",
       "WeightedCapacity":1
     },
     {
        “InstanceType”:”c5.2xlarge”,
        "SubnetId":"subnet-0dfee843e19bfeb52",
        "WeightedCapacity":1
     },
   ]
  }
],

"TargetCapacitySpecification":{
"TotalTargetCapacity": $LSF_TOTAL_TARGET_CAPACITY, 
"OnDemandTargetCapacity": $LSF_ONDEMAND_TARGET_CAPACITY,
"SpotTargetCapacity": $LSF_SPOT_TARGET_CAPACITY,

// DefaultTargetCapacityType can be ‘on-demand’ or ‘spot’
"DefaultTargetCapacityType": "on-demand"
},

// Type is instant for 'on-demand' and 'request' for Spot
   "Type":"instant
}

The above template defines two instance types. Without EC2 Fleet, we would have had to define two templates in LSF to achieve the configuration. Also, the number of templates would grow linearly with instance types and Availability Zones—making configuration error prone and maintainability complex. With EC2 Fleet, we can achieve this using a single template and a single API call.

(B) Cluster with a combination of Spot and On-Demand Instances
Spot Instances let us take advantage of unused Amazon EC2 capacity on AWS. They are available at up to a 90 percent discount compared to On-Demand prices. With EC2 Fleet support in LSF, we can now maximize compute instance availability by combining Spot Instances with On-Demand Instances and use different allocation strategies. This can be configured in ec2-fleet-config.json as below:

{
   "OnDemandOptions":{
      "AllocationStrategy": "lowest-price"
   },
   "SpotOptions":{
      "AllocationStrategy": "price-capacity-optimized",
      "InstanceInterruptionBehavior": "terminate"
   },
   "LaunchTemplateConfigs":[
      {
         "LaunchTemplateSpecification":{
            "LaunchTemplateId": "LauncTemplateId_Created_Earlier ",
            "Version":"1"
         },
         "Overrides":[
           {
               "InstanceType":"c5.2xlarge",
               "SubnetId":"subnet-06bd452f95ffd6193",
                 "WeightedCapacity":4
            }
         ]
 }
   ],

   "TargetCapacitySpecification":{
      "TotalTargetCapacity": $LSF_TOTAL_TARGET_CAPACITY,
      "OnDemandTargetCapacity": $LSF_ONDEMAND_TARGET_CAPACITY,
      "SpotTargetCapacity": $LSF_SPOT_TARGET_CAPACITY,
      "DefaultTargetCapacityType": "spot"
   },
   "Type":"request"
}

(C) Cluster with instance prioritization capability
In some cases, customers wish to prioritize certain instance types over others. Examples are preference for higher memory instances for CPU-based licensing scenarios or the ability to pack jobs better in certain instance sizes. With EC2 Fleet API and LSF, customers can choose to assign priority for launch specification in ec2-fleet-config.json as below:


{
  "OnDemandOptions": 
  // Allocation strategy needs to be set to ‘prioritized’
  “AllocationStrategy” : “prioritized” 
  },
  “SpotOptions”:{ 
  // AllocationStrategy can be specified as below
  “AllocationStrategy”: “price-capacity-optimized”', 
  “InstanceInterruptionBehavior”:“terminate”
},
"LaunchTemplateConfigs":[
      {
         "LaunchTemplateSpecification":{
            "LaunchTemplateId": "LauncTemplateId_Created_Earlier ",
            "Version":"1"
         },
         "Overrides":[
         {
             "InstanceType":"c5.2xlarge",
             "SubnetId":"subnet-hgr84786",
             "WeightedCapacity":1,
             // Lower the number, higher the priority for the instance type
             "Priority": 30. 
         }
       ]
 }
   ],

   "TargetCapacitySpecification":{
   "TotalTargetCapacity": $LSF_TOTAL_TARGET_CAPACITY,
   "OnDemandTargetCapacity": $LSF_ONDEMAND_TARGET_CAPACITY,
   "SpotTargetCapacity": $LSF_SPOT_TARGET_CAPACITY,
    "DefaultTargetCapacityType": "spot"
  },
  "Type":"request"
}

Configuration flexibility outlined in (B) and (C) above were not possible without EC2 Fleet support in LSF.

Once the patterns are defined in ec2-fleet-config.json, $LSF_TOP/conf/resource_connector/aws/conf/awsprov_templates.json needs to be updated with an ec2-fleet template definition as below:

{
“templateId” : “ec2-fleet”,
“maxNumber” : 5000,
“attributes”: {
“types”: [“String”, “X86_64”],
“ncores”: [“Numeric”, “1”],
“ncpus”: [“Numeric”, “1”],
“mem”: [“Numeric”, “3750”],
“awshost”:[“Boolean”,“1”]
},
// Specify path to the ec2-fleet-config.json defined above
“ec2FleetConfig” : “/opt/config/ec2-fleet-config.json”, 

// You can specify the ratio for on-demand to spot instances
“OnDemandTargetCapacityRatio” :0.5 

Known limitations

As of date of publication of this post, support for the following does not exist:

  • “Maintain” fleet type (click here for EC2 Fleet types)
  • InstanceRequirements and TargetCapacityUnitType attributes in EC2 Fleet configuration
  • Weighted capacity support for Spot Fleet request
  • Other dimension target capacity unit type (memory, GPU, other resources, and so forth)

EC2 Fleet API and LSF configuration for AWS features are under constant development. We encourage readers to follow the new feature announcements for increased coverage of EC2 Fleet integration with LSF.

Conclusion

EC2 Fleet integration with LSF simplifies configuring complex HPC architectural patterns on AWS. Although a few patterns are covered here, more are outlined in this blog and AWS documentation for EC2 Fleet. We also encourage readers to refer to IBM documentation regarding configuring LSF on AWS.

EC2 Fleet API integration is available in LSF Fix Pack 13. The feature has the potential to significantly boost your productivity with shortened cluster provisioning times. It also offers the benefit of reduced operational overhead. We invite our customers to work with respective AWS account teams to pioneer the technology with a proof of concept.

Click here to learn more about semiconductors and electronics on AWS and here for HPC on AWS. Also try an AWS EDA Workshop with IBM Spectrum LSF.

Kartik Gopal

Kartik Gopal

Kartik Gopal is a Sr. Solutions Architect at AWS and specializes in helping semiconductor customers adopt AWS for their design needs. Kartik’s domain knowledge comes from a combination of authoring EDA tools for silicon design and helping enterprises adopt cloud in his 17 years of industry experience.

Bill McMillan

Bill McMillan

Bill McMillan is a principal product manager on the IBM Spectrum Computing team. With over 30 years of experience in HPC in a wide range of industries, McMillan has the overall product management responsibility for Spectrum Computing. His interests include the use of GPUs, containers, and cloud for high performance and high throughput computing. He is based in the United Kingdom.

Martin Gao

Martin Gao

Martin Gao is a software engineer on the IBM Spectrum Computing team. He has over a decade of experience in HPC, with a specialization in integrating Spectrum Computing products with cloud infrastructure. Gao is based in Toronto, Canada.

Mayur Runwal

Mayur Runwal

Mayur Runwal is a senior solutions architect at AWS and an EDA specialist. He works closely with semiconductor customers, building and architecting EDA solutions on AWS. Runwal worked at IT organizations in semiconductor companies building virtual desktop infrastructure for over 10 years. He has extensive experience in leading, architecting, and implementing IT solutions.