Running FSI workloads on AWS with YellowDog

This post was contributed by Kirill Bogdanov and Alan Parry, CTO at YellowDog

Large compute grids are critical components of any financial services industry (FSI) organisation today. They’re used in daily operations from pricing products, to risk management and monitoring, and even regulatory reporting. Historically, though, compute grids have been complex to operate and expensive to run. Even worse, some organizations’ demand for compute regularly exceeds their limited on-premises capacity. All of this has led to FSI firms exploring cloud for modernisation and as a cost effective alternative.

YellowDog is a high performance computing (HPC) environment that specializes in large-scale workloads and hybrid-cloud deployments, allowing customers to use their on-premises capacity while simultaneously taking advantage of the scale and capabilities of AWS.

In this post, we’ll look at YellowDog’s performance and its operational characteristics that are important for FSI customers including: scheduling latency, throughput, and integration with AWS services. This is the second post in a series – in our previous post, we showed you how YellowDog can scale up to 3.2 million vCPUs on AWS in just under 33 minutes.

The challenge familiar to FSIs

Today, new financial regulatory requirements like FRTB (the Fundamental Review of the Trading Book), RWA (Risk Weighted Assets), and ESG (Environmental, Social, and Governance) have resulted in a significant increase (up to 10x) in computational demand, compelling FSI firms to find additional compute capacity.

Furthermore, FSI workloads often involve large numbers of short-running tasks, requiring a variation of HPC known as high throughput computing (HTC). It’s not uncommon to process more than 100 million tasks per day, with a substantial portion of these tasks taking less than two seconds to run. These characteristics can make many ‘normal’ job schedulers ill-suited for the task. But the ability to perform more computations, sooner, translates directly into improved quality of service and market competitiveness for an institution.

These factors are driving FSI firms to reimagine their computing grids and to explore cloud-based modernization pathways in pursuit of improved functionality and cost efficiency.

YellowDog is a leading cloud provisioning and job scheduling stack that enables customers across FSI to optimise their compute infrastructure and run complex workloads across multiple AWS regions, simultaneously. Today we’re evaluating the suitability of running YellowDog on AWS for FSI HTC workloads, specifically.

Definitions

For the rest of this post, let’s standardize on some terms so we don’t get tangled up:

Task – A single unit of work submitted for processing.
Job – A logical group of tasks that are submitted and executed together, often with interdependencies.
Workload – A larger group of jobs that are executed to deliver a business outcome, for example, a complete nightly batch.
Worker – A process of some kind running on EC2 that can execute a task. An EC2 instance can facilitate support multiple workers, consuming different amounts of resource, each launching processes or containers.

Scheduling overhead

Scheduling overhead (or scheduling latency) is the time it takes for a client application to submit a job to the grid and to receive a processing result, excluding the time for the actual computational processing itself.

Low scheduling overhead can let traders price instruments and assess risk exposure in close to real-time, which can be critical in many situations. Furthermore, when processing short-running jobs, low scheduling overhead ensures high grid utilisation, with most of the compute time spent on job processing rather than job scheduling and book-keeping.

Let’s look at how YellowDog serves these near real-time intraday workloads. Figure 1 illustrates the concepts of scheduling overhead.

Figure 1: The timeline depicting events from the moment a task is submitted to a grid until results are received by the client. Scheduling overhead is the difference between the completion time perceived by the client and the actual task execution time.

This figure presents a timeline depicting the sequence of events from the moment a task is submitted to a grid until results are received by the client. For the sake of simplicity, we’ll consider a job consisting of a single task.

At time T1, a client application prepares a job and submits it to the grid at time T2. At this stage, the task enters the scheduling queue of the grid and waits to be assigned to a worker process. At time T3, a worker picks up the task and executes it until completion at T4.

Finally, at time T5, using push or pull mechanisms, the client application becomes aware that the job has completed, allowing the collection of results. At time T6, the client application collects the results (though sometimes, notification and result retrieval are combined).

It’s important to note that while the actual execution time of the task spans from T3 to T4, the perceived time from the client’s perspective is T2 to T6. For this post, we’ll refer to the time difference between execution time perceived by the client application and actual job execution time as the scheduling overhead.

Based on these definitions, a client can submit at most (1 / T2 – T6) tasks per second into the grid.

Testing YellowDog’s scheduling

We wanted to stress YellowDog’s scheduling components, so we used zero-work tasks (tasks that did literally nothing), so the processing step would simply return a hard coded output. That way we minimised the T3 to T4 time interval, and focused on the scheduling overhead itself.

Figure 2 shows the cumulative distribution functions (CDFs) of the measured scheduling overhead. In this figure, the x-axis shows the end-to-end latency from the moment of submission of the job until all the results of the task are returned to the client, while the y-axis shows the percentile – the fraction of jobs that completed within the indicated latency.

We performed these measurements using jobs containing 1, 10, and 100 zero-work tasks, and plotted them using full, dashed, and dotted lines respectively in Figure 2.

We also considered two distinct job-completion notification mechanisms. The first mechanism relies on YellowDog’s task-level REST API. This API allows us to check the status of individual tasks.

The second mechanism is a custom extension we built with Amazon ElastiCache. This mechanism relies on the worker to mark task completion directly in the cache, with the client application pulling results from the cache. This is more consistent with how real-world tasks would usually be managed.

Figure 2: End-to-end task completion time as perceived by a client. Median latency for different notification mechanisms is around 400ms.

The results indicated that YellowDog can provide an end-to-end scheduling overhead of a few hundred milliseconds, which is sufficiently low to be unnoticeable by a user like a trader, requiring a more or less real-time response.

Efficiency with short duration tasks

For ‘long’ running tasks (that is, minutes to hours) a scheduling latency of one second might not be relevant. But this changes when processing time is measured in seconds: a low scheduling overhead directly translates into efficiency when processing short running tasks.

YellowDog’s Worker polls the YellowDog Scheduler for tasks to execute. If there are no tasks for a Worker to process it will sleep for a randomly chosen, but configurable, duration which defaults to be between 0-60 seconds.

Once a task becomes available, the Worker will retrieve it, execute it, and will then attempt to retrieve the next task from the queue. If there are no further tasks waiting, the Worker will sleep again until another task becomes available.

The duration of the Worker sleep interval is only relevant when there are no tasks in the Worker’s queue, and it can be adjusted at a system-wide level if required, to reduce the average interval.

By using a prepopulated task queue, a single YellowDog Worker can efficiently process up to 25 tasks per second, implying a base scheduling overhead of only 40 milliseconds. This processing speed translates into an efficiency rate of 98% for tasks that complete within two seconds.

That means a single YellowDog Worker can efficiently consume short tasks from a queue and deliver results in a timely fashion.

Throughput

Throughput refers to the rate at which tasks can be processed. Throughput is particularly important for nightly batches when millions of tasks need to be processed as a part of a larger workload, as efficiently as possible. The ability to schedule a large volume of tasks quickly and efficiently is of paramount importance for a successful FSI grid scheduler.

More importantly, a fast scheduler can scale horizontally to take full advantage of the compute resources available at AWS, allowing workloads to complete faster, thereby providing an opportunity for additional computation or to re-run workloads that required correction.

To measure YellowDog’s throughout we configured 2,000 YellowDog Workers (each with 1vCPU and 4GiB RAM), and we used 16 clients to submit zero-work jobs comprising of 5,000 to 10,000 tasks with 10,000,000 tasks in total.

We measured the total time from the moment we initiated submission, until the last task was complete.

Using this approach, we measured the average tasks per second (TPS) to be 3,000. This translates to 85 million tasks per typical overnight eight-hour window.

Integration with AWS services

When it comes to running large scale grids on AWS, it’s critical to follow the best practices and use the correct APIs to obtain the necessary compute capacity quickly – and cost effectively.

YellowDog supports the Amazon Elastic Computer Cloud (Amazon EC2) Fleet API, including important options like EC2 Spot, Allocation Strategies (including the recommended ‘price-capacity-optimized’). It also supports attribute-based instance selection, which greatly simplifies finding suitable Amazon EC2 instances for your workload to include in that mix for Spot.

Using ‘dynamic templates’ YellowDog automatically selects the best source of compute based on your requirements and preferences, especially useful when your workload characteristics can change. This automation is fed by YellowDog Insights which measures Amazon EC2 instance types across price, performance, benchmarking, location, and energy efficiency. Included in this data is a comparison of spot and on-demand pricing and YellowDog’s managed compute requirements ensure that if you experience spot interruptions, compute can be re-provisioned and tasks can be re-tried on other workers.

Conclusion

We saw a median scheduling overhead of 400ms, and a throughput greater than 3,000 tasks per second. In our previous post we showed YellowDog scaling to more than 3.2 million vCPUs, and maintaining a high utilization for that fleet. Taken together, these results indicate that YellowDog has the capability to handle a wide range of FSI workloads.

Based on these results, we think YellowDog has the necessary scale, efficiency, and low scheduling latency to handle both batch and intraday jobs effectively for financial services customers. It’s also well integrated with AWS services, which allows it to take full advantage of the power of AWS when provisioning compute resources at scale.

YellowDog’s existing FSI customers are made up of large asset managers and hedge funds running complex hybrid-cloud workloads, at scale. You can access YellowDog through AWS Marketplace, which will include the cost of the solution in your monthly AWS bill.

You can get in touch with YellowDog directly to arrange a discovery call, discuss your business requirements, and learn how they can support you. Alternatively, reach out to the FSI HPC experts at AWS by emailing us at ask-hpc@amazon.com.

AWS HPC Blog

Running FSI workloads on AWS with YellowDog

The challenge familiar to FSIs

Definitions

Scheduling overhead

Testing YellowDog’s scheduling

Efficiency with short duration tasks

Throughput

Integration with AWS services

Conclusion

Resources

Follow

Learn

Resources

Developers

Help