Scheduling Tasks on AWS with IBM Spectrum LSF and IBM Spectrum Symphony
By Geert Wenes, Partner Solutions Architect at AWS
Many high performance computing (HPC) and grid customers with large technical and on-premises computing systems select IBM Spectrum LSF and IBM Spectrum Symphony for policy-driven control and scheduling.
Spectrum Symphony schedules tasks very fast: in milliseconds, rather than in seconds for conventional schedulers. It also supports tens of thousands of compute nodes and hundreds of applications on a scalable, shared, and heterogeneous multi-site grid.
For you, this translates into better application performance, better throughput, better utilization, and the ability to respond quickly to business demands.
Many on-premises infrastructures are aging, however, and strained by expanding requirements.
In the financial services industry (FSI), these requirements are increasingly regulatory in nature. For example, satisfying the Fundamental Review of the Trading Book (FRTB) requirements will lead to more frequent risk calculations on a wider range of data.
In HPC segments such as manufacturing or electronic design analysis (EDA), these requirements are shorter-term-to-solution and higher-fidelity results with larger and multi-scale models and more complex multi-physics.
To cope with increasing requirements without slowing down the pace of innovation, customers are testing and validating different cost-effective deployment models of Spectrum LSF and Spectrum Symphony.
Some start with hybrid deployments, but complexities associated with security, data movement, and predictable costs have driven many customers to simplify their deployments on Amazon Web Services (AWS).
Even so, deployments on the AWS Cloud may benefit from Spectrum Computing LSF’s and Symphony’s policy-driven control and scheduling capabilities. Both solutions have plugin-based technology called resource connector and host factory, respectively, which allows you to define policies to make these environments automatically elastic based on workload demand.
In particular, both hybrid and cloud-native deployment can take advantage of Amazon EC2 On-Demand Instances, as well as Spot Instances; unused Amazon EC2 capacity that is available at up to a 90 percent discount compared to On-Demand Instance prices.
In response to customer requests, IBM and AWS are on an ongoing journey to enable the IBM Spectrum Computing family of products on AWS.
AWS and IBM, an AWS Partner Network (APN) Select Technology Partner, have completed testing and validation for hybrid deployments of Spectrum LSF and Spectrum Symphony. Both provide enterprise workload management for distributed high performance computing and analytics, and have an established brand within their respective markets.
Today, you can reliably and cost-effectively deploy Spectrum LSF on AWS. IBM offers a deployment guide (including deployment options and steps) and best practices in Ansible Playbooks. Spectrum LSF conforms to best practices with respect to operations, security, cost-effectiveness, and backup and recovery.
Spectrum LSF can be deployed in two modes:
- Stretch cluster
Stretch cluster mode assumes you have a cluster in another location—either on-premises or running on another cloud or cloud location. It’s defined as a single cluster stretched over a wide area network (WAN) so that compute nodes in the cloud communicate with a master scheduling host on the originating location.
In the LSF Stretch Cluster architecture, the on-premises cluster resources can be dynamically “stretched” over a WAN to include cloud resources to accommodate spikes in demand.
Though simpler in concept than the multi-cluster mode, this generally means all LSF daemon communication with the master scheduler happens over the WAN, which can be a source of extra cost or lowered reliability. The following diagram shows Spectrum LSF deployed with the stretch cluster configuration:
Figure 1 – IBM Spectrum LSF stretch cluster configuration.
Multi-cluster mode architecture adds a master scheduler running on AWS. This architecture simplifies communication and coordination between the on-premises and cloud-based clusters by reducing it to task meta-data exchanges between master schedulers in a “job forwarding” model.
Hence, it eliminates all communication from the cloud compute instances to the on-premises master. In fact, in multi-cluster mode, all compute capacity can reside on AWS and none needs to reside on-premises.
The following diagram shows Spectrum LSF deployed with the multi-cluster configuration.
Figure 2 – IBM Spectrum LSF multi-cluster configuration.
Both configurations offer certain advantages and trade-offs. Each is covered in detail in the deployment guide and can be downloaded from GitHub.
Spectrum LSF includes an additional capability, the LSF resource connector, which enables policy-driven cloud bursting to AWS. In particular, this enables you to use either On-Demand or Spot Instances to request computing capacity. While your request for a Spot Instance will be fulfilled as long as capacity is available, you also have the option to hibernate, stop, or terminate your Spot Instances when Amazon EC2 reclaims the capacity back with two-minutes of notice.
Spot Instances are a cost-effective choice if you can be flexible about when your applications run, and if your applications can be interrupted. If you use Spot instances and your application does get interrupted, Spectrum LSF may be able to requeue your job within the two-minutes termination notice, as it checks periodically for any Spot Instances that are planned to be reclaimed and requeues the job.
However, if you have specified hibernation as the interrupt behavior, you do not receive the two-minutes warning because the hibernation process begins immediately and Spectrum LSF may not be able to requeue the job.
Using Spot instances with Spectrum LSF, you can significantly reduce the cost of running your applications and yet maintain capacity even for hyper-scale workloads.
Spectrum Symphony contains a similar framework to the LSF resource connector, called host factory. This enables your on-premises clusters to dynamically include compute hosts from AWS based on the resource demands of applications in your cluster and on AWS. You can control bursting for your cluster through policy configurations, which define when and how resource scale-out and scale-in requests are triggered.
With host factory, you can leverage the on-demand capabilities of the AWS infrastructure to provision as many resources as you need and pay only for what you use.
Today, Spot enablement for host factory is released in limited availability mode. To enable Spot in Spectrum Symphony v7.1 and v7.2, engage with your IBM sales team for Engineering Feature Requests (EFR) or download IBM-supported patches. You may also engage with the AWS sales team and specialist solutions architects for custom enablement.
Spot enablement for host factory will be made generally available (GA) with the next revision release of Spectrum Symphony v7.3.
Using IBM Spectrum LSF and IBM Spectrum Symphony on the AWS Cloud offers the following key outcomes:
- Easier migration path.
- Flexible deployment options that include native AWS or hybrid cloud mode.
- Policy-driven scheduling capability for maximization of compute resources and optimal application performance.
If you’re a Spectrum LSF and Spectrum Symphony customer, you now have flexible options for running on AWS. The IBM cloud-friendly licensing model (PAYG), along with elastic scaling capabilities, makes AWS the ideal target for bursting workloads. You also have the option to bring your own licenses (BYOL) to AWS. IBM continues to deliver support directly, just as it does when those licenses are deployed on IBM customer premises.
Spot Instances are currently enabled in Spectrum LSF, can be enabled in Spectrum Symphony upon request, and will soon be GA for Spectrum Symphony. As a result, you can significantly reduce the cost of running applications, grow your application compute capacity and throughput for the same budget, and enable new types of cloud computing applications.
To get started, please visit GitHub which will take you through the two deployment options for Spectrum LSF. You can watch the videos and you find the Ansible Playbooks used in these videos. They are public and freely available for you to take and customize.
IBM – APN Partner Spotlight
IBM is an APN Select Technology Partner. Customers around the world rely on IBM’s advanced cloud technologies and on the deep industry and technology expertise of IBM services and solutions professionals and consultants.
*Already worked with IBM? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.