AWS Quantum Technologies Blog

Accelerate hybrid quantum-classical algorithms on Amazon Braket using embedded simulators from Xanadu’s PennyLane featuring NVIDIA cuQuantum

In 2021, Amazon Braket, the fully-managed quantum computing service from AWS, launched Amazon Braket Hybrid Jobs to provide customers a convenient way to run hybrid algorithms without worrying about managing the underlying infrastructure. With features such as priority access to quantum processing units (QPUs), Amazon Braket Hybrid Jobs is designed for lower latency and faster runtime for applications such as variational quantum eigensolver (VQE), quantum approximate optimization algorithms (QAOA), or quantum machine learning (QML). Hybrid algorithms that combine quantum and classical resources are particularly important because today’s quantum computers are highly sensitive to errors, which limits the size of useful computations that can be performed on these devices.

At this early stage of the technology, often dubbed the noisy intermediate-scale quantum (NISQ) era, researchers exploring potential future applications of quantum computing are rapidly turning to this hybrid approach. This technique is inspired by machine learning, where a quantum algorithm iteratively “learns” by adjusting the circuit parameters via a classical optimization loop, to minimize the detrimental impact of errors in the computation.

Quantum circuit simulators, which can simulate quantum devices in an idealized, noise-free setting offer an alternative to customers interested in investigating the convergence of hybrid algorithms, innovating better compilation techniques and identifying promising architectures to achieve the best performance in the NISQ era.

A fundamental challenge with this approach, however, is that as researchers explore more advanced algorithms that require hundreds of thousands of circuit iterations to achieve desired performance, algorithm runtimes can often stretch to hours. While simulators offer the benefit of being always available, customers need a way to experiment and iterate even faster.

Today, we are excited to announce that Amazon Braket Hybrid Jobs now supports five high-performance embedded circuit simulators, which are available in the same container as your application code. As part of this launch, we support the high-performance lightning.qubit and lightning.gpu simulators from PennyLane, the latter of which is powered by the NVIDIA cuQuantum SDK, which comes with state-of-the-art features such as native GPU support for adjoint differentiation. These methods have significantly lower memory usage and can reduce the number of circuit executions required for your variational algorithm to converge by as much as 90%. Fewer circuit executions mean fewer round-trips between your algorithm code and the simulator, hence faster runtimes.

Furthermore, by allowing you to scale-out your computation across several CPU or GPU instances, you can seamlessly take advantage of the elasticity of the cloud to run your algorithms, and in particular, accelerate your simulations, while still only paying for what you use. For certain problems such as QML, PennyLane’s lightning.gpu simulator on NVIDIA GPUs can reduce algorithm runtimes by up to 100X when compared to running on your local laptop. Figure 1 shows the user workflow with Amazon Braket Hybrid Jobs with QPUs, on-demand simulators and the newly launched embedded simulators.

Figure 1. Amazon Braket Hybrid Jobs architecture: You can use the Amazon Braket console, a notebook instance, or local integrated development environment (IDE) to call the Braket Hybrid Jobs API. Amazon Braket Hybrid jobs spins up a jobs container which hosts your algorithm code. In the top flow, the Jobs container then communicates with an on-demand simulator (SV1, DM1, TN1) or a QPU via a Braket API call. In the bottom flow, the simulator is embedded directly within the Jobs container, and can be distributed across multiple CPU or GPU instances to accelerate the job. Job results are stored in Amazon S3, while logs and metrics can be accessed via Amazon CloudWatch.

Embedded simulation with Amazon Braket Hybrid Jobs

Amazon Braket, NVIDIA, and PennyLane share the vision that accelerating research in quantum algorithms means providing customers with tools to simulate advanced algorithms faster, which requires a relentless focus on removing bottlenecks that can slow down algorithm convergence.

In March, NVIDIA released cuQuantum, a set of libraries for accelerating quantum circuit simulations on NVIDIA Tensor Core GPUs, and in collaboration with Xanadu, announced a new GPU-accelerated simulator for PennyLane called lightning.gpu. Under lightning.gpu’s hood is cuStateVec, cuQuantum’s state vector simulation library, which allows the flexible simulation of deep and highly entangled quantum circuits. Coupled with PennyLane’s lightning.gpu, and accessible through Amazon Braket, cuQuantum accelerates computationally intensive simulation workflows in quantum chemistry, quantum machine learning, and more, by orders of magnitude compared to CPU simulation. By seamlessly integrating machine learning frameworks in PennyLane with performant quantum circuit simulations on GPUs via cuQuantum, you can now experience high performance in quantum applications research.

A key benefit of the pay-for-what-you-use model is that you have the freedom to accelerate your workloads even more by taking advantage of distributed computing. This is particularly crucial for algorithms such as QML, where GPUs can compute gradients in parallel. With this launch, you only need to bring your algorithm code. You can now seamlessly parallelize your hybrid algorithms across multiple GPU instances such as the p3.16xlarge instances by only changing a few lines of code. Amazon Braket handles the heavy lifting of setting up the distributed environment, managing node-to-node communication, speeding up the overall algorithm runtime by executing multiple circuits in parallel.

Accelerate a QML workload with cuQuantum and PennyLane

To see the benefits of embedded simulators in action, let’s consider an example from QML, which is an active area of research today. Traditional machine learning is a two-step process where a model first learns the best set of parameters that fit the input data (training), and then makes decisions on unseen data using those parameters (inference). In QML, data is often encoded as amplitudes of a quantum state, and quantum circuits play the role of the model. While quantum circuits of n qubits can speed up data processing in principle by encoding 2n amplitudes, the errors prevalent in NISQ-era devices limit researchers’ ability to explore advanced hybrid algorithms. With this launch, Amazon Braket enables customers to experiment with advanced QML algorithms through embedded simulators.

Let’s see this in action by considering the QML approach to a workload such as binary classification, a canonical use case for machine learning, spanning problems such as spam email identification to credit card fraud detection. The dataset used for this demonstration is the SONAR classification dataset from the UCI repository of benchmark machine learning datasets. It consists of 208 rows with 60 features and is modeled using a quantum neural network (QNN). The QNN is modeled as a quantum circuit (sometimes referred to as the hidden layer in classical machine learning) sandwiched between a dense classical input and output layer. Running this algorithm on 24-26 qubits would require evaluating nearly 200K circuits per iteration or epoch using a method such as the parameter-shift rule. By contrast, by leveraging the adjoint method available with the lightning simulators, this reduces the number down to a few hundred circuits.

Figure 2. Figure showing the total training time per iteration/epoch (left panel) and costs per epoch (right panel) as a function of qubit count. The training was performed using c5.18xlarge CPU instances and p3.16xlarge instances, each with 8 GPUs. For parallel processing (labeled DP above), 1 and 4 p3.16xlarge instances were used.

As shown in the left panel of Figure 2 (dotted/red line), even with the performance benefits of the adjoint method, algorithm runtimes can stretch to several hours on a single instance as the qubit count increases. By leveraging the lightning.gpu simulator accelerated with cuQuantum SDK (dashed-dotted line), you can speed-up training by 10X. These results are obtained using a single p3.16.xlarge instance, which is powered by NVIDIA’s V100 Tensor Core GPU to deliver high throughput and low latency. By lowering algorithm runtimes, these GPU instances deliver improved price-performance compared to CPUs, allowing you to push the boundaries of research on hybrid algorithms at a cost of only a few US dollars. Furthermore, as the dashed and solid black lines show, Amazon Braket’s support for executing circuits in parallel allows you to take full advantage of multiple GPUs within a single instance, or across multiple instances to further reduce algorithm runtimes by another factor of 10, with little to no extra cost.


Research into hybrid quantum/classical algorithms is an important paradigm in the NISQ era of quantum computing where real-world applications of quantum computers are being explored. With the launch of embedded simulators for Amazon Braket Hybrid Jobs, you can continue to push the boundaries of research and experiment more rapidly on hybrid quantum/classical algorithms by using performance-optimized software-based simulators from PennyLane powered by NVIDIA. To reproduce the results above, refer to this blog and example notebook in the Amazon Braket Github repository. To learn more about how to get started with running your hybrid quantum/classical workloads with embedded simulators on Amazon Braket, please refer to our documentation.