AWS Compute Blog

Using Amazon APerf to go from 50% below to 36% above performance target

This post is written by Tyler Jones, Senior Solutions Architect – Graviton, AWS.

Performance tuning the Renaissance Finagle-http benchmark

Sometimes software doesn’t perform the way it’s expected to across different systems. This can be due to a configuration error, code bug, or differences in hardware performance. Amazon APerf is a powerful tool designed to help identify and address performance issues on AWS instances and other computers. APerf captures comprehensive system metrics simultaneously and then visualizes them in an interactive report. The report allows users to analyze metrics such as CPU usage, interrupts, memory usage, and CPU core performance counters (PMU) together. APerf is particularly useful for performance tuning workloads across different instance types, as it can generate side-by-side reports for easy comparison. APerf is valuable for developers, system administrators, and performance engineers who need to optimize application performance on AWS. From here on we use the Renaissance benchmark as an example to demonstrate how APerf is used to debug and find performance bottlenecks.

The example

The Renaissance finagle-http benchmark was unexpectedly found to run 50% slower on a c7g.16xl Graviton3 than on a reference instance, both initially using the Linux-5.15.60 kernel. This is unexpected behavior. Graviton3 should be performing as good or better than our reference instance as it does for other Java based workloads. It’s likely there is a configuration problem somewhere. The Renaissance finagle-http benchmark is written in Scala but produces Java byte code, so our investigation will focus on the Java JVM as well as system-level configurations.

Overview

System performance tuning is an iterative process that is conducted in two main phases, the first focuses on overall system issues, and the second focuses on CPU core bottlenecks. APerf is used to assist in both phases.

APerf can render several instances’ data in one report, side by side, typically a reference system and the system to be tuned. The reference system provides values to compare against. In isolation, metrics are harder to evaluate. A metric may be acceptable in general but the comparison to the reference system makes it easier to spot room for improvement.

APerf helps to identify unusual system behavior, such as high interrupt load, excessive I/O wait, unusual network layer patterns, and other such issues in the first phase. After adjustments are made to address these issues, for example by modifying JVM flags, the second phase starts. Using the system tuning of the first phase, fresh APerf data is collected and evaluated with a focus on CPU core performance metrics.

Any inferior metric of the SUT CPU core, as compared to the reference system, holds potential for improvement. In the following section we discuss the two phases in detail.
For more background on system tuning, refer to the Performance Runbook in the AWS Graviton Getting Started guide.

Initial data collection

Here is an example for how 240 seconds of system data is collected with APerf:

#enable PMU access
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
#APerf has to open more than the default limit of files.
ulimit -n 65536
#usually aperf would be run in another terminal. 
#For illustration purposes it is send to the background here
./aperf record --run-name finagle_1 --period 240 &
#With 64 CPUs it takes APerf a little less than 15s to report readiness 
#for data collection.
sleep 15
java -jar renaissance-gpl-0.14.2.jar -r 8 finagle-http

The APerf report is generated as follows:

./aperf report --run finagle_1

Then, the report can be viewed with a web browser:

firefox aperf_report_finagle_1/index.html

The APerf report can render several data sets in the same report, side-by-side.

./aperf report --run finagle_1_c7g --run finagle_1_reference --run ...

Note that it is crucial to examine the CPU usage over time shown in APerf. Some data may be gathered while the system is idle. The metrics during idle times have no significant value.

First phase: system level

In this phase we look for differences on the system level. To do this APerf collects data during runs of finagle-http on c7g.16xl and the reference. The reference system provides the target numbers to compare against. Any large difference warrants closer inspection.

The first differences can be seen in the following figure.

The APerf CPU usage plot shows larger drops on Graviton3 (highlighted in red) at the beginning of each run than on the reference instance.

Figure 1: CPU usage. c7g.16xl on the left, reference on the right.

Figure 1: CPU usage. c7g.16xl on the left, reference on the right.Figure 1: CPU usage. c7g.16xl on the left, reference on the right.

The log messages about the GC runtime hint at a possible reason as they coincide with the dips in CPU usage.

====== finagle-http (web) [default], iteration 0 started ======
GC before operation: completed in 32.692 ms, heap usage 87.588 MB → 31.411 MB.
====== finagle-http (web) [default], iteration 0 completed (6534.533 ms) ======

The JVM tends to spend a significant amount of time in garbage collection, during which time it has to suspend all threads, and choosing a different GC may have a positive impact.

The default GC on OpenJDK17 is G1GC. Using parallelGC is an alternative given that the instances have 64 CPUs, and thus GC can be performed highly parallel. The Graviton Getting Started guide also recommends checking the GC log when working on Java performance issues. A cross check using the JVM’s -Xlog:gc option confirms the reduced GC time with parallelGC.

The second difference is evident in the following figure, the CPU-to-CPU interrupts (IPI). There is more than 10x higher activity on Graviton 3, which means additional IRQ work c7g.16xl on which the reference system does not have to spend CPU cycles.

Figure 2: IPI0/RES Interrupts. c7g.16xl on the left, reference on the right.

 Figure 2: IPI0/RES Interrupts. c7g.16xl on the left, reference on the right. Figure 2: IPI0/RES Interrupts. c7g.16xl on the left, reference on the right.

Grepping through kernel commit messages can help find patches that address a particular issue, such as IPI inefficiencies.

This scheduling patch improves performance by 19%. Another IPI patch provides an additional 1% improvement. Switching to Linux-5.15.61 allows us to use these IPI improvements. The following figure shows the effect in APerf.

Figure 3: IPI0 Interrupts (c7g.16xl)

IPI0 Interrupts (c7g.16xl)

Second phase: CPU core level

Now that the system level issues are mitigated, the focus is on the CPU cores. The data collected by APerf shows PMU data where the reference instance and Graviton3 differ significantly.

PMU metric Graviton3 Reference system
Branch Prediction Misses/1000 Instructions 16 4
Instruction Translation Lookaside Buffer (TLB) Misses/1000 Instructions 8.3 4
Instructions Per Clock cycle 0.6 0.6*

*Note that reference has a 20% higher CPU clock than c7g.16xl. As a rule of thumb, instructions per clock (IPC) multiplied by clock rate equals work done by a CPU.

Addressing CPU Core bottlenecks

The first improvement in PMU metrics stems from the parallelGC option. Although the intention was to increase CPU usage, the following figure shows lowered branch miss counts as well. Limiting the JIT tiered compilation to only use C2 mode helps branch prediction by reducing branch indirection and increasing the locality of executed code. Finally adding Transparent Huge Pages helps branch prediction logic and avoids lengthy address translation look-ups in DDR memory. The following graphs show the effects of the chosen JVM options.

Branch misses/1000 instructions (c7g.16xl)

Figure 4: Branch misses per 1k instructions

Branch misses per 1k instructions

JVM Options from left to right:

  • -XX:+UseParallelGC
  • -XX:+UseParallelGC -XX:-TieredCompilation
  • -XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

With the options listed under the preceding figure, APerf shows the branch prediction miss rate decreasing from the initial 16 to 11. Branch mis-predictions incur significant performance penalties as they result in wasted cycles spent computing results that ultimately need to be discarded. Furthermore, these mis-predictions cause the prefetching and cache subsystems to fail to load the necessary subsequent instructions into cache. Consequently, costly pipeline stalls and frontend stalls occur, preventing the CPU from executing instructions.

Code sparsity

Figure 5: Code sparsity

Code sparsity

JVM Options from left to right:

  • -XX:+UseParallelGC
  • -XX:+UseParallelGC -XX:-TieredCompilation
  • -XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

Code sparsity is a measure of how compact the instruction code is packed and how closely related code is placed. This is where turning off tiered compilation shows its effect. Lower sparsity helps branch prediction and the cache subsystem.

Instruction TLB misses/1000 instructions (c7g.16xl)

Figure 6: Instruction TLB misses per 1k Instructions

Instruction TLB misses per 1k Instructions

JVM Options from left to right:

  • -XX:+UseParallelGC
  • -XX:+UseParallelGC -XX:-TieredCompilation
  • -XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

The big decrease in TLB misses is caused by the use of transparent huge pages, which increase the likelihood that a virtual address translation is present in the TLB, since fewer entries are needed. Translation table walks are avoided that otherwise need to traverse entries in DDR memory that cost hundreds of CPU cycles to read.

Instructions per clock cycle (IPC)

Figure 7: Instructions per clock cycle

Instructions per clock cycle

JVM Options from left to right:

  • -XX:+UseParallelGC
  • -XX:+UseParallelGC -XX:-TieredCompilation
  • -XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

The preceding figure shows IPC increasing from 0.58 to 0.71 as JVM flags are added.

Results

This table summarizes the measures taken for performance improvement and their results.

JVM option set baseline 1 2 3
IPC 0.6 0.6 0.63 0.71
Branch misses/1000 instructions 16 14 12 11
ITLB misses/1000 instructions 8.3 7.7 8.8 1.1
Benchmark runtime [ms] 6000 3922 3843 3512
execution time improvement vs baseline
  1. +parallelGC
  2. +parallelGC -tieredCompilation
  3. +parallelGC -tieredCompilation +UseTransparentHugePages

Re-examining the setup of our testing enviornment: Changing where the load-generator lives

With the preceding efforts, c7g.16xl is within 91% of the reference system. The c7g.16xl branch prediction miss rate is still higher at 11 than the references at 4. As shown in the preceding figure, reduced branch prediction misses have a strong positive effect on performance. What follows is an experiment to achieve parity or better with the reference system based on the reduction of branch prediction misses.

Finagle-http serves HTTP requests generated by wrk2, which is a load generator implemented in C. The expectation is that the c7g.16xl branch predictor works better with the native wrk2 binary, unlike the Renaissance load generator, which is executing on the JVM. The wrk2 load generator and the finagle-http are assigned through taskset to separate sets of CPUs: 16 CPUs for wrk2 and 48 CPUs for finagle-http. The idea here is to have the branch predictors on these CPU sets focus on a limited code set. The following diagram illustrates the difference between the Renaissance and the experimental setup.

With this CPU performance tuned setup, c7g.16xl can now handle a 36% higher request load than the reference using the same configuration, at an average latency limit of 1.5ms. This illustrates the impact that system tuning with APerf can have. The same system that scored 50% lower than the comparison system now exceeds it by 36%.
The following APerf data shows the improvement of key PMU metrics that lead to the performance jump.

Branch prediction misses/1000 instructions

Figure 8: Branch misses per 1k instructions

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The branch prediction miss rate is reduced to 1.5 from 11 with the Java-only setup.

IPC

Figure 9: Instructions Per Clock Cycle

Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The IPC steps up from 0.7 to 1.5 due to the improvement in branch prediction.

Code sparsity

Figure 10: Code Sparsity

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The code sparsity decreases to 0.014 from 0.21, a factor of 15.

Conclusion

AWS created the APerf tool to aid in root cause analysis and help address performance issues for any workload. APerf is a standalone binary that captures relevant data simultaneously as a time series, such as CPU usage, interrupt frequency, memory usage, and CPU core metrics (PMU counters). APerf can generate reports for multiple data captures, making it easy to spot differences between instance types. We were able to use this data to analyze why Graviton3 was underperforming and to also see the impact of our changes in terms of performance. Using APerf we were able to successfully adjust configuration parameters and go from 50% below our performance target to 36% more performant than our reference system and associated performance target. Without Aperf, collecting these metrics and visualizing them is a non-trival task. With Aperf you can capture and visualize these metrics with two short commands, saving you time and effort so you can focus on what matters most: getting the most performance from your application.