Applying classical benchmarking methodologies to create a principled quantum benchmark suite
In this post, we will discuss the current landscape of quantum benchmarking and introduce SupermarQ, Super.tech’s suite of application-based benchmarks designed to overcome the limitations of existing approaches. SupermarQ uses Amazon Braket for device-agnostic access to gate-based quantum processing units (QPUs), so benchmarks can highlight the heterogeneity of quantum computers and their various strengths in the Noisy Intermediate-Scale Quantum (NISQ) era and beyond.
Benchmarking: past and present
Creating benchmarks is a foundational aspect of the computing industry, as the emergence of new architectures requires new ways to measure and define performance. Looking back, the growth of computing in the 1970s and 1980s led to the creation of LINPACK and SPEC for benchmarking supercomputers and workstations . The PARSEC benchmark suite was introduced in response to the proliferation of chip multiprocessors , and the explosion of machine learning applications led to the creation of MLPerf to benchmark performance between different models . With the introduction of various quantum computing architectures, new benchmarks must be developed and tailored to these systems.
You may have faced a challenge when deciding on which QPU to run your application because the quantum industry has yet to adopt a de facto benchmark. Many attempts to describe quantum devices focus on metrics that are unrepresentative of holistic performance for real applications. Examples include (a) measurements of individual gate errors, qubit coherence times, or other low-level hardware properties, and (b) synthetic benchmarks that use random circuits to gauge hardware performance. While understanding the properties of individual gate operations is a critical component of a quantum computing (QC) system, it does not represent performance on applications. Similarly, synthetic benchmarks, while useful for providing insights into the theoretical computational power of a device , are also limited in their scalability and similarity to real-world applications. Typical quantum applications do not generally take the form of random quantum circuits, so they are not necessarily representative of useful workloads . In addition, the computation required to verify the output of these benchmarks becomes intractable as the number of qubits increases.
Finally, capturing the general performance of a computational system within a single number can be very challenging and can lead to unintended consequences. For instance, in the history of classical benchmarking, there have been examples of compilers and microarchitectures that were optimized for scoring well on a specific benchmark but neglected functionality outside the scope of what was being tested . Creating quantum computers that are effective at solving real-world problems is key to moving quantum computers out of the laboratory and into industry.
The case for application benchmarks
Benchmarks at the application-level provide a characterization of system performance with regard to a specific application. This is a more applicable cross-platform comparison to use cases than those at the circuit and gate level. Thanks to performance measurements at the application-level, such as ground state energy or approximation ratio, application benchmarks make cross-platform comparisons between different quantum architectures and classical approaches more straightforward. This is an important attribute, since the crossover point between the best classical and quantum approaches shifts with every advance in algorithms, software, and hardware. Certain application benchmarks also have a level of scalability not present in benchmarks that rely on circuit simulation or other exponentially scaling classical computations to obtain the ideal output. When well designed, these particular applications are efficiently verifiable by classical computers, even at high qubit counts. For example, when applying an error correcting code, the input and output states are identical (in an ideal, noiseless scenario) and therefore the fidelity of the execution can be measured without requiring expensive circuit simulation.
Employing not one, but a suite of application benchmarks gives insight into performance across a range of potential use cases. Applications differ in the amount and kind of resources they require, so testing a basket of applications stresses different aspects of the hardware. Such an approach gives a more realistic representation of potential workloads for the QC.
Recent works within the quantum computer architecture community have taken the first steps towards application benchmarking. With the variety of QC architectures today – superconducting, trapped ion, cold atom, photonic, spin-based – initial architectural comparisons between these implementations have revealed the impact that qubit connectivity, native gate operations, and error rates can have on program execution . However, these attempts at cross-platform comparisons have been limited to a handful of applications that do not always represent the workloads we expect to run on QPUs in the near future.
SupermarQ: a robust suite of application benchmarks
SupermarQ was created by applying techniques from classical benchmarking to the quantum domain. Observing the pitfalls of quantum characterization led to the creation of design principles to overcome these limitations. These four principles are the cornerstones of SupermarQ:
- Scalability: A benchmark suite must be composed of applications whose size is parameterizable and performance is easily verifiable by classical machines.
- Meaningful and diverse: Benchmark applications should reflect workloads that will appear in practice. Incorporating applications from a range of domains – chemistry, finance, machine learning, etc., will allow results to be relevant to the widest range of potential users and use cases.
- Full-system evaluation: In the NISQ era, many of the unique properties of different quantum implementations are realized at the compiler level when the program is transpiled to a hardware supported gateset. Since the compiler can make or break program execution, a benchmark suite should be specified at a shared level of abstraction to allow the compiler to play a role in overall system performance.
- Adaptivity: Any suite which aims to accurately measure performance must keep pace with the development of new algorithms, compilation optimizations, and hardware.
Quantum Approximate Optimization Algorithm: QAOA is a variational quantum-classical algorithm that can be trained to output bitstrings to solve combinatorial optimization problems. This benchmark indicates how well a machine can solve a certain class of problems, such as determining the shortest route between a group of cities (traveling salesman) or the lowest volatility stock portfolio.
Variational Quantum Eigensolver: VQE is a hybrid algorithm, like QAOA, whose goal is to find the lowest eigenvalue of a given problem matrix by computing a difficult cost function on the QPU and feeding this value into an optimization routine running on a CPU. This benchmark indicates how well a machine can solve a certain class of problems, such as determining chemical reaction rates or the ground state of a molecule.
Hamiltonian Simulation: This benchmark indicates how well a machine can simulate the time evolution of a system – be it chemical, physical or biological. Simulating the time evolution of a quantum system is one of the most promising applications of QC, as these algorithms possess exponential speedups over classical methods.
GHZ: This benchmark measures how well a machine generates entanglement between qubits. This capability is one of the most important tasks in quantum computing and is key for domains such as quantum communication networks, signal sensing, and encryption.
Mermin-Bell: This benchmark indicates the degree of control the machine has over its quantum properties of superposition and entanglement. This test is an example of a Bell inequality test, which has been used at a small scale to demonstrate the quantumness of nature .
Error Correction: This benchmark indicates how much farther machines have to go on the road to fault-tolerance. Error correcting codes (ECCs) are the means by which Fault Tolerant (FT) quantum computers will be able to execute arbitrarily long programs.
Program profiling is a standard technique in classical computing, and the objective of SupermarQ is to do the same for quantum programs by introducing a set of hardware-agnostic feature vectors. The features describe how each of the benchmarks will stress the processor and to what degree, based on hardware-agnostic quantities that are related to the application’s resource requirements. For example, Program Communication (PC) measures the average degree of an application’s communication graph, and the Entanglement-Ratio (Ent) is given by the proportion of all gates in the circuit which are two-qubit entangling gates. These feature vectors are used to quantify the coverage of the selected benchmark applications. More information on these feature vectors, including their specific definitions, can be found in our paper.
To easily evaluate these benchmarks across a range of different QPUs, we use Amazon Braket, which allows SupermarQ to simultaneously submit the same quantum circuit instance to multiple available devices via the cloud. Moreover, Amazon Braket enables results from these backends to be accessed via a common interface, streamlining the process for collecting benchmark results.
Benchmarking for all
Interested in evaluating devices yourself? We have provided the source code used to generate, evaluate, and compute the results of this benchmark suite in a GitHub repository. Open-sourcing SupermarQ enables community contributions of additional benchmarks to keep pace with emerging applications.
Everyone should have the ability to assess and evaluate quantum hardware in the way that best suits them. That is why SupermarQ is specified at the level of OpenQASM, a popular intermediate representation for quantum circuits supported by Amazon Braket. Adhering to the principle of full-system evaluations, SupermarQ uses optimizations that are publicly available for the average quantum programmer. These include the transpilation of OpenQASM to native gates, noise-aware qubit mapping, SWAP insertions, reordering of commuting gates, and cancellation of adjacent gates. The optimizations included are those that are automatically applied when using cloud-based platforms like Amazon Braket. This matches the level of optimization that would be available to the typical user.
Periodically collecting new benchmark data is a practical concern for any quantum benchmark suite. SupermarQ is built on top of a write-once-target-all SaaS platform, SuperstaQ, which was designed with this specific purpose in mind and built using Amazon Lightsail. With SuperstaQ, you are able to specify the OpenQASM for a single application and execute it on multiple backends.
The SupermarQ approach leads to an important takeaway: matching the device architecture to the use case is key. Hardware design features can make certain machines naturally better or worse at a given operation, and therefore, holistic algorithmic methods. Solving an optimization task with a quantum computer requires a different configuration than simulating the time evolution of a system. Users in domains such as finance or logistics would care more about a device’s results on the QAOA benchmark, whereas a pharmaceutical developer would prioritize performance on the Hamiltonian Simulation.
This architecture-to-application relationship is especially evident in the differing results of superconducting vs. trapped ion machines. For instance, trapped ion qubits have very long coherence times, making them more robust to mid-circuit measurement — a key requirement for error correction. However, the 1000x faster gate speeds of superconducting are preferable for variational benchmarks like QAOA, which require millions of sequential iterations.
The benchmark results and feature maps reveal the variety of tradeoffs that are available to QC system designers and indicate that competitive advantages can be found by focusing on applications which play to a system’s strengths (e.g., faster gate speeds, higher fidelities, denser connectivity). For example, ion trap devices are able to make up for slower operation speeds with better connectivity, while the superconducting systems with sparser connectivities are still competitive due to their much faster operation times.
Specific attributes can make certain machines better at particular tasks than others, but not superior in absolute for the time being. Users should therefore choose a device based on the problem they need to solve instead of an arbitrary rating. Though our results can reflect the limitations of the machines we test in the NISQ era, the SupermarQ methodology helps reign in hype while providing actionable insights for practitioners.
The emergence of quantum computers as a new computational paradigm has been accompanied by speculation concerning the scope and timeline of their anticipated revolutionary changes. While quantum computing is still in its infancy, the variety of different architectures used to implement quantum computations make it difficult to reliably measure and compare performance. This problem motivates our introduction of SupermarQ, a scalable, hardware-agnostic quantum benchmark suite which uses application-level metrics to measure performance. SupermarQ was created by systematically applying techniques from classical benchmarking methodology to the quantum domain. SupermarQ defines a set of feature vectors to quantify coverage, selects applications from a variety of domains to ensure the suite is representative of real workloads, and collects benchmark results thanks to the access of QPUs using Amazon Braket. Use the benchmarks generated by SupermarQ to choose the right QPU for your application by visiting the SupermarQ website and get more details on the benchmark suite by reading the whitepaper.
 Dongarra, J. J., Luszczek, P., and Petitet, A. The LINPACK benchmark: past, present and future. Concurrency and Computation: Practice and Experience, 15(9):803–820, 2003.
 Henning, J. L.. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006.
 Bienia, C., Kumar, S., Singh, J. P., and Li, K.. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008.
 Mattson, P., Reddi, V. J., Cheng, C., Coleman, C., Diamos, G., Kanter, D., Micikevicius, P., Patterson, D., Schmuelling, G., Tang, H., et al. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance. IEEE Micro, 40(2):8–16, 2020.
 Arute, F., Arya, K., Babbush, R. et al. Quantum Supremacy using a Programmable Superconducting Processor. Nature, 574, 505–510, 2019.
 Preskill, J.. Quantum Computing in the NISQ Era and Beyond. Quantum, 2, 79-90, 2018.
 Hennessy, J. L. and Patterson, D. A., Computer Architecture: A Quantitative Approach. Elsevier, 2011.
 Murali, P., Linke, N. M., Martonosi, M., Abhari, A. J., Nguyen, N. H., and Alderete, C. H., Full-stack, Real-system Quantum Computer Studies: Architectural Comparisons and Design Insights. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pages 527–540. IEEE, 2019.
 Blinov, S., Wu, B., and Monroe, C., Comparison of cloud-based ion trap and superconducting quantum computer architectures. arXiv preprint arXiv:2102.00371, 2021.
 Mooney, G. J., Hill, C. D., and Hollenberg. L. C. L., Entanglement in a 20-qubit superconducting quantum computer. Scientific Rreports, 9(1), 1–8, 2019.