HTC-Grid – examining the operational characteristics of the high throughput compute grid blueprint
In a previous blog post, the first in a series on HTC-Grid, we outlined the challenges that financial services industry (FSI) organizations face when trying to balance increasingly volatile demands on their on-premises compute platforms with the challenges of reducing costs. We also introduced the HTC-Grid blueprint, which demonstrates how organizations that require a high throughput HPC scheduler can leverage AWS services to meet their needs.
This post goes into more detail on the operational characteristics (latency, throughput, and scalability) of the current HTC-Grid implementation to help you to understand if this solution meets your needs.
Scheduling latency is the round-trip time taken by a job to traverse the system excluding the computation time: i.e., from the time a client submits a job (potentially comprised of multiple tasks), until it receives all the results for that submission.
Low scheduling latency provides fast response times from the grid. In the financial industry fast response times are particularly important for intraday trade management and pricing where computation results need to be produced and returned to a user as quickly as possible to respond to changing markets. More generally, low scheduling latency is especially important for jobs containing short running tasks (e.g., execution times at the order of a second or less), as at these levels the scheduling overhead can start to make up a material proportion of the total response time and as a result, reduce the efficiency of the system.
To measure scheduling latency, we used jobs of variable sizes comprised of zero-work tasks (tasks that did nothing). Each job was treated as any other normal job, but the executable returned a pre-defined, hardcoded output and then exited. This way we can isolate and measure cumulative delays associated with job processing by HTC-Grid separately from the computation time associated with the task processing.
We did three sets of measurements using batches of 1,10, and 100 tasks. Figure 1 charts cumulative distribution functions (CDFs) of the measured scheduling overhead. In this figure, the X-axis shows end-to-end latency from the moment of submission of the job until all the results of the task are returned to the client, while the Y-axis shows the percentile – the fraction of jobs that completed within the indicated latency. The blue line in Figure 1 shows that the median scheduling latency for a single task was under 280ms while the long-tail latency doesn’t exceed 420ms.
In production it’s common to submit tasks in batches, and often batches are also created by splitting a single task into multiple sub-tasks to parallelize the execution (when possible) and to reduce the overall latency of the job. That’s why we also introduced jobs of size 10 and 100 tasks in the measurement (red and green lines in Figure 1, respectively).
Relative to a single task, batch submissions show increased scheduling overhead. This is because HTC-Grid API needs to register multiple tasks in the system, and then wait for the last and slowest task in the batch to complete (this is the long-tail latency). One of the key benefits of parallelism in the grid system is that the average per-task execution time in a batch isn’t significantly affected by the increased overhead of submitting a batch of tasks (instead of submitting a single task). Nevertheless, overall execution time of a batch is often much faster than the execution of individual tasks submitted sequentially.
HTC-Grid uses a simple greedy pull-based algorithm where workers of a particular type and size pull new tasks from a dedicated queue. This allows HTC-Grid to achieve low scheduling overhead and to scale linearly.
The overall end-to-end latency is dependent upon the client-side and grid-side configurations which control the frequency at which clients poll for results, the retry logic, and back-off logic. Increasing these intervals reduces the load on the system, at the expense of increased scheduling latency. Different configurations would typically be used, optimizing HTC-Grid service for nightly batch jobs, or more real-time intra-day jobs. Refer to the HTC-Grid documentation for more details.
Our experience tells us that FSI workloads are usually characterized by a high volume of short running tasks. For example, large FSI customers can process anywhere between a few million to hundreds of millions of tasks per night. Moreover, the distribution of the compute times for the large portion of these tasks can range from seconds to a few minutes.
Under these conditions, the scheduling component of the grid can become a bottleneck, leading to decreased throughput which consequently increases the time for the overall nightly batch. This situation can also ‘starve’ compute resources: tasks can’t be assigned for computation quickly enough, which leads to underutilization.
A common approach to deal with a high volume of short running tasks is to use “task packing”, where we group multiple tasks and treat them as a single unit of work at the scheduling level. Unfortunately, while this improves efficiency, this technique can significantly increase response time for the individual tasks especially in the case where compute time can’t be accurately predicted, making this technique ill-suited for tasks with strict latency requirements.
As we mentioned, we measured HTC-Grid’s throughput using zero-work tasks. To measure the throughput, we scaled HTC-Grid to 12,000 workers and used 16 clients to continuously submit jobs of 200 zero-work tasks. To make our measurement representative (and comparable to the scale of the workloads in the FSI industry) we measured throughput using a batch of 10,000,000 tasks.
Figure 2 outlines the results we obtained. Using 12,000 workers a single HTC-Grid deployment was able to exceed the throughput of 30,000 tasks per second and finish the batch of 10,000,000 tasks in under six minutes. This performance is in line with the needs of our customers today.
Now, we achieved this using zero-work tasks, so it excludes real computational work. The actual throughput of the grid will depend on the distribution of the tasks’ compute times and the number of workers (i.e., the scale of the grid). This can easily be derived from the measurements above.
The demonstrated throughput of 30,000 tasks per second is not an absolute limit for HTC-Grid. Within the scope of this measurement, the throughput was bound by the configured provisioned capacity of Amazon DynamoDB which was set to 550,000 write units. We’ll discuss more benchmarking measurements, and in greater detail, in a future AWS whitepaper on this topic.
Scalability & Elasticity
Next, we tested the scale at which a single deployment of HTC-Grid can operate and measured the expected time that it took for the deployment to achieve that scale.
Figure 3 shows the scaling time from 1 worker running on a single EC2 instance, to 20,000 workers running on close to 600 Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances using Amazon Elastic Kubernetes Service (Amazon EKS). The scaling time was approximately 20 minutes.
Amazon EKS actively monitors the load on Control Plane instances and automatically scales them to ensure high performance. We saw consistent pod-scheduling performance as HTC-Grid scaled up to 600 nodes. You can find more details on the scaling logic in our first blog post and in our workshop. We used Amazon EC2 Spot Best Practices to maximally diversify the list of acceptable instance types. In all our experiments we configured workers to use 1vCPU and 4GB RAM.
HTC-Grid uses pluggable architecture and supports different compute planes including Amazon Elastic Container Service (Amazon ECS). This allows you to select the most appropriate compute backend depending on your workload characteristics and the technical skills of the team operating your grid.
Figure 4 shows that HTC-Grid backed by the Amazon ECS compute plane scaled to 76,000 workers within 20 minutes. We achieved this almost 4x improvement by configuring HTC-Grid to use multiple Amazon ECS clusters as part of its compute plane. That allowed us to parallelize and speed up the scaling process and achieve a higher scale, overall.
Next, we assessed the scalability of HTC-Grid with non-zero workloads. Using two batches of tasks configured to run for 10 and 60 seconds respectively, we scaled the HTC-Grid environment to 1000, 5000, and 20,000 workers, respectively. Figure 5 shows the result: the throughput of the system scales linearly with the scale of the system.
The future of HTC-Grid
HTC-Grid is a simple asynchronous architecture that leverages AWS cloud-native services. There are two differentiators behind HTC-Grid’s approach: (1) the assertion that “all compute looks like lambda”; and (2) HTC-Grid’s runtime structure is determined by the composition of services at deployment, to optimize the performance and cost trade-offs per workload.
The HTC-Grid project continues to build and evolve these concepts. We’re currently working on several areas:
Support for inter-task dependencies – with options for both client-side & server-side management of a directed acyclic graph (DAG) of calculations.
Support for sticky sessions, or ‘soft affinity’ – a preference to run certain task types on certain resources. The data needed by a task might have already been pre-cached on some workers courtesy of previously hosted tasks, so it makes sense to use it again for tasks with similar needs.
Task-to-resource dependency management, or ‘hard affinity’ – some tasks must run on workers with a specific memory configuration.
Dynamic infrastructure – the ability for HTC-Grid to provision new compute resource types, or dedicated resources, in response to the hard affinity metadata attached to a job.
We think these enhancements will help HTC-Grid be an even more compelling high-performance, cloud-native solution for legacy HPC workloads that can be migrated with no algorithmic changes.
However, a low-latency, high-throughput distributed compute substrate is also a critical element in subsequent algorithmic modernization, so we’re also spending time on:
Optimizing quant libraries against the latest silicon – like AWS Graviton 3, GPUs, and FPGAs. This will improve performance and help customers reduce their CO2 footprint.
New mathematical techniques to significantly reduce calculation time– operator overload and automated adjoined differentiation techniques.
These elements can dramatically shorten task execution times with corresponding demands for increased throughput and decreased latency on the grid substrate.
Finally, many organizations are evaluating the potential of AI/ML (with Amazon SageMaker) and quantum computing (with Amazon Braket). But in both cases a hybrid approach is needed, with high-throughput classical computing capability needed to quantitatively explore the sensitives around minima that these novel techniques identify.
In this post, we explored HTC-Grid’s performance and operational characteristics, to help you understand how you might use – and customize – the solution.
We are already seeing AWS customers in financial services and other industries embrace HTC-Grid’s “all compute looks like Lambda” approach, by seeking to enable the development of innovative cloud-native distributed systems that you might not usually classify as traditional Risk HPC.