Using the MAQAO framework to analyze application performance across instances

This post was contributed by H. Bolloré, Research Engineer, C. Valensi, Technical Team Leader, and W. Jalby, Professor, MAQAO team at the University of Versailles Saint-Quentin-en-Yvelines (UVSQ). It was coordinated by G. Tourpe, Principal HPC Business Development Manager at AWS.

AWS offers a variety of compute types equipped with a range of processors architectures. Different instances offer distinct features and characteristics, each potentially advantageous for specific applications. To choose the optimal instance for an application, it’s crucial to assess and compare its performance and behavior across multiple instance types.

Our team at UVSQ develops MAQAO (the Modular Assembly Quality Analyzer and Optimizer) which is a performance analysis and optimization framework operating at binary level with a focus on core performance. Our main goal is to guide application developers through the optimization process using synthetic reports and hints.

In this post, we’ll show you how to use the MAQAO framework to inspect and compare application performance with a sample application, GROMACS – from the molecular dynamics community. We’ll use two different CPU variants of the AWS Graviton but could just as easily choose two completely different architectures, like Graviton and x86.

We’ll show you how you can make more informed decisions for your application – and do some compiler tuning, too.

Our experiment

MAQAO mixes both dynamic and static analyses based on its ability to reconstruct high level structures such as functions and loops from an application binary code. Since MAQAO operates at binary level, it is agnostic regarding the language used in the source code and does not require recompiling the application to perform analyses. MAQAO has also been designed to concurrently support multiple architectures.

For our purposes, we chose two different instance types from the Amazon Elastic Compute Cloud (Amazon EC2). First, the Hpc7g instances which feature the AWS Graviton3E processor. Tailored specifically for HPC workloads, these deliver “up to 35% higher vector instruction performance” compared to the earlier version (the Graviton3 processor) found in C7g instances.

As we mentioned, we selected GROMACS, because it uses vector operations intensively, making it a good candidate for benefiting from these kinds of improvements.

First, we used the MAQAO ONE View module to analyze and characterize GROMACS’ behavior on a single instance. ONE View automates execution and analysis processes, consolidating results from all MAQAO modules. To respond to different use cases, ONE View can execute various combinations of profiling, analysis, and comparison – generating specialized reports targeting specific issues.

These reports are accessible through any web browser, providing a user-friendly interface to navigate between multiple tabs and explore profiles, identify issues, and provide actionable advice.

We used the Summary tab to obtain a comprehensive overview of the application workflow, the compilation and libraries environment, and the precision of the MAQAO profiling.

Summary tab

The Summary tab lists all the identified issues and common problems assessed to give clean pathways for optimizations. We’ve organized these items into three categories: first the stylizer contains all elements necessary to perform a proper profiling of the application (Figure 1); second, the strategizer assesses structural code characteristics (Figure 2); and third, the optimizer dives into details of individual loop issues (not represented here).

The user is then guided through the optimization process by a score representing how many among several key characteristics are satisfactory. The core in square-brackets is also color-coded using red (worst case, low score) and green (best case, high score) (figures 1, 2 and 3). For the optimizer, potential optimizations are ordered by their coverage and each one of them is associated to a synthetic score evaluating the cost of their resolution in terms of time and implementation complexity (lower is better).

Figure 1– The stylizer table displays the profile conformity check. Some server configurations or applications compilation options can make the profile imprecise or incomplete. MAQAO being self-aware of its limitations, it warns when such case is encountered.

In this example compilation options were not available in our initial profiling, which is represented by the score of 0, in the top left paragraph (highlighted in red); this issue was resolved by adding the -grecord-gcc-switches compilation option. Figure 2 presents the stylizer of the second profiling performed after adding this option.

Figure 2– The stylizer table now displays additional information concerning compilation flags.

All indications now point to a reliable profiling. Only a very small fraction of the code could not be categorized (4). The profiling duration is sufficient to ensure precision (1). Tuning for the CPU-type has been enabled (3). Finally, options to obtain debugging information have been used (2).

Figure 3 – The strategizer gives an overview of the application with respect to its structure and its library use. It assesses the difficulty of optimizing the application and the improvement potential by measuring certain key aspects.

Even though optimization is not our goal during this experiment, some analyses are particularly relevant to the study. The code spent an overwhelming amount of time in loops (1), particularly in innermost loops (2). This is indicative of an application spending most of its time in computation – precisely our focus.

With a high level of confidence in the profile and an application meeting our selection criteria, we were able to conduct a proper comparison.

Comparison: MAQAO global metrics

We compared the application’s behavior between AWS Graviton3 and AWS Graviton3E using the comparison mode of the MAQAO ONE View module to analyze GROMACS on a single node with 64 cores. We compiled the code using Arm Compiler for Linux (ACfL) 22.1 and used the Arm Performance Library (Arm PL). We used the same executable for both analyses and conducted the comparison using small datasets to assess any potential improvements before scaling up to larger datasets. You can find the detailed comparison report in our repository, GROMACS ONE View comparison report. Figure 4 provides an overview of these runs as depicted in MAQAO.

Figure 4 – The global metrics table shows basic metrics characterizing the application. The first panel displays key code characteristics obtained by measurements while the second panel presents estimations of performance impacts of major transformations (such as vectorization) obtained by simulation. The results presented here were obtained using the same executable, with identical parallelism configurations on two different compute nodes (c7g with AWS Graviton3 and Hpc7g with AWS Graviton3E).

The overall performance exhibits a significant speedup of approximately 13%. However, it’s important to note that while the computational segments of the program demonstrate increased efficiency, there is a proportional increase in the time taken by other components which were not impacted by the optimization. This evidence of this is a 5-6% decrease in the Time in analyzed loops metric. While this behavior was somewhat expected, it warrants further investigation to ensure that other aspects of the application have not been adversely affected.

From this point onward, our focus shifts to identifying which parts of the code are affected by the new processor capabilities. Our initial approach involves examining the impact on individual functions in the codebase.

Functions comparison

The comparison mode of ONE View incorporates a dedicated view for comparing function timings between runs. Figure 5 illustrates this view for our specific comparison.

Figure 5 – The function view shows timings metrics for all functions identified during the profiling (functions never called or taking a very small amount of time will not appear here). Functions are ordered from the most time consuming to the least. (Graviton3 stands for AWS Graviton3, Graviton3E for AWS Graviton3E)

In our specific case, we are interested in functions that exhibit either increased or decreased runtime on the Graviton3E compared to the Graviton3.

The first two functions in the GROMACS library (functions nbnxm_kernel_*, first and third in the list, also highlighted in green) demonstrate notable improvements on Graviton3E, with respective reductions of 9 seconds and 1.30 seconds compared to Graviton3. With initial timings of 31.81 seconds and 4.92 seconds, these improvements represent a significant 39% and 34% speedup, respectively.

When we examine the OpenMP wait function (function kmp_flag_64<false, true>::wait, second in the list, also highlighted in orange), we can tell that it now accounts for a larger proportion of the application’s runtime (the Coverage column), which aligns with our expectations if certain sections of the code have been accelerated. However, the report also indicates an increase in the time spent on this function. A possible explanation could be that the benefits of acceleration are not evenly distributed across all threads, thus adversely affecting synchronization. So, while computation sped up, some of the time gained was offset by waiting for slower threads to complete. Nonetheless, the overall profile demonstrated a substantial performance gain.

MAQAO provides several ways to check imbalances. As an example, Figure 6 has a visual presentation of the time spent in various functions across all threads.

Figure 6 – The Application tab includes a visual description of functions’ timings allowing to quickly identify imbalanced functions’ behavior across threads.

GROMACS did not exhibit any significant irregular pattern or work imbalance, validating our previous hypothesis. Finding out why the time spent in synchronization increased would require a more in-depth analysis beyond the scope of this blog.

Loops comparison

We guessed that changes brought by Graviton3E would primarily affect the computational aspects of the code, particularly loops. To delve deeper into this aspect, we conducted a similar investigation as we did for functions. We’ve depicted our findings in Figure 7 for our experiment.

Figure 7 – The loop view shows timings metrics and other relevant information for all loops identified during the profiling (loops never used or taking a very small amount of time will not appear here). Loops are ordered from the most time consuming to the least. This figure presents side by side the timings measured on C7g instances with AWS Graviton3 (left) and Hpc7g instances using AWS Graviton3E (right) for the two main loops of the application.

With this view, we can more easily detect differences in performance behavior at the loop level. For both loops, the compiler generated two different variants. For one of its variants, “AWS Graviton3E” showed a substantial performance gain – a remarkable 42% speedup for the largest assembly loop. Note that other metrics remained unaffected because this experiment used the same binary executable on both systems.

To delve further into this analysis, we can open the detailed report for loop 764. Figure 8 presents this tab, focusing on the details of the assembly loop.

Figure 8 – The loop tab shows static and dynamic analyses performed on a loop. Drop-down menus allow to select which analysis result to explore.

The static analysis reveals excellent code vectorization, particularly within a computation-intensive loop containing a total of 2176 floating-point operations. A substantial number of these computations used fused multiply-add (FMA) operations in their vectorized form, which is recognized as the most efficient approach for such calculations. This code adeptly leveraged Arm vector instructions and can be expected to reap considerable benefits from the changes made in Graviton3E cores. Confirming the loop timings we saw earlier, it is evident that the Graviton3E processor enhances performance when executing floating-point vector operations.

Scalability report

Finally, building on the observed performance gains, our objective was to assess the scalability characteristics on both architectures.

To carry out this test, we used ONE View scalability mode: this executes a given application across various parallel settings and evaluates the scalability of functions and loops for each scenario. Figure 9 depicts the parallel efficiency of loops when running on a C7g instance (using Graviton3) with the number of MPI processes ranging from 1 to 64 MPI ranks.

Loops scalability report

Figure 9 – The loop tab allows to explore scalability at the loop level and to quickly identify which parts of the code are the limiting factors when scaling. By computing the difference between the theoretical maximum speed up and the measured speed up ONE View provides an efficiency ratio indicating the loop efficiency for each configuration. This figure presents the scalability analysis performed on the C7g instance.

On the C7g instance, the critical sections of the GROMACS application began to experience diminishing returns when we used all 64 cores. The two loops we focused on exhibited parallel efficiency ratios of 68% and 67%, respectively. The same analysis on the Hpc7g instance showed that the Graviton 3E maintained a higher level of performance, with both loops sustaining an efficiency ratio of 84%.

The Graviton3E not only improved the performance of intensive computational workloads but also enabled a more efficient use of all its cores (up to 84%).

Conclusion

The AWS Graviton 3E processor offers tangible advantages for developers of HPC applications, making it a highly desirable choice for computationally-bound programs. However, for memory-intensive or balanced workloads, the benefits of these core differences may not be as pronounced – making the case for a more thorough case-by-case evaluation.

The experiment we described here underscores how useful MAQAO can be for analyzing and comparing application behavior across diverse architectures. Given the extensive range of instances available on AWS, it’s clear that MAQAO is a valuable tool for selecting the most efficient instance types for specific requirements. This work is currently extended by integrating MAQAO into a complementary framework QaaS (Quality as a service): QaaS is an Open Source project combining advanced benchmarking techniques with innovative performance analysis to perform a detailed application characterization across a wide space of parameters (compilers, options….). QaaS will allow users to submit jobs for comprehensive automatic analyses targeting better performing HW, compilers, and manual code changes via OneView.

MAQAO can analyze performance and detect a wide range of issues, and its variety of profiling methods and analyses makes it a versatile tool suitable for many purposes. It can guide developers by providing detailed and dedicated views, and assist benchmarkers doing profiling of applications and libraries.

MAQAO is available for both x86-64 and aarch64 architectures at no-cost, and has extensive documentation.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

Select your cookie preferences

AWS HPC Blog