AWS Database Blog

Use AWS Nitro Enclaves to build Cubist CubeSigner, a secure and highly reliable key management platform for Ethereum validators and beyond

Validators are the fundamental building blocks of proof-of-stake (PoS) blockchain protocols like Ethereum. They maintain the history of the chain and run the consensus protocol that makes it possible to implement complex decentralized applications—from decentralized finance applications to NFT collectibles. To join the protocol, validators provide assets as collateral, which ensures they behave correctly in driving consensus. Unfortunately, because the protocol doesn’t distinguish between malicious validators and honest mistakes, operational mistakes and bugs in validator client software have and continue to put assets at risk. This results in small economic penalties in the best case scenario and total loss of funds in the worst case.

In this post, we describe the high-level architecture of a secure key management platform designed to protect Ethereum validator operators from insider threats, validator compromise, operational mistakes, and client bugs. We cover the following topics:

  • The challenges of running Ethereum validators and the associated security and slashing risks.
  • The idea of tackling these challenges using a remote key management solution and the underlying goals a key manager must be built around to meaningfully reduce risk.
  • The high-level design of Cubist’s CubeSigner, a secure key management platform built on top of AWS Nitro Enclaves. CubeSigner supports staking and other applications, from end-user wallets to trading to key management for teams and operations.

Operational traps and pitfalls when running Ethereum validators

The Ethereum PoS protocol is a consensus protocol run across hundreds of thousands of validators. Every few seconds, validators run Ethereum transactions and, if the results of running these transactions agree with state changes proposed by a randomly chosen validator, they add these transactions to the blockchain and attest to this by signing validation messages. To incentivize validators to honestly report the state of the chain, the protocol requires them to put 32 ETH at stake. If validators behave as expected, (for example, by not manipulating transactions or lying about the state of the chain), they earn a reward on their staked assets. If they don’t, they lose some of this stake.

If validators are caught manipulating the state of the chain, they are penalized for putting the network’s safety at risk through slashing. Getting slashed means losing a significant amount of stake and being ejected from the protocol. To a lesser degree, validators are also punished if they go offline or sign very slowly, because this puts the protocol’s availability at risk.

Unfortunately, validator mistakes are indistinguishable from malicious behavior. If a validator provides two conflicting reports on the state of the blockchain, it’s impossible to tell if the conflict resulted from intentional misbehavior or a bug in the validator software. As a result, operators who make security, correctness, or performance mistakes can suffer direct financial loss due to slashing. There are serious challenges in preventing each kind of mistake:

  • Security – A single validator staking key typically protects 32 ETH (at the time of writing, roughly $80,000 USD). However, these valuable validator staking keys can’t be kept in cold storage because they must sign validation messages every few minutes. As a consequence, node operators have to protect hot keys from system breaches. This is especially hard considering the complex environment of open source validator clients, the operational requirements to keep nodes alive 24/7, and the potential of insider threat vectors like rogue operators and social engineering.
    • Our goal – Create a key manager that prevents attackers or malicious insiders from stealing or misusing secret signing keys.
  • Correctness – Validators that sign two different, conflicting messages are deemed dishonest and are slashed, even if the conflicting signatures were a genuine mistake. In the past, honest operators have been slashed because of buggy validator client software, errors migrating validators across machines, and mistakes updating validator software.
  • Availability – Validators that don’t sign or sign too slowly (for example, don’t respond with the 12-second slot time) get reduced financial rewards or don’t get paid at all. As a result, node operators design their solutions to be highly available and reliable, and to respond with low latency (typically setting an upper bound of 500 milliseconds for signing).
    • Our goal – Create a key manager that helps operators achieve these latency, reliability, and availability goals.

Remote signing reduces risk due to bugs and operational mistakes

To participate in the Ethereum PoS Beacon chain, node operators use a validator client like Lighthouse or Prysm. The validator client contains any number of validator key pairs. Each pair corresponds to a validator: the public key is the validator’s identity, and the secret key is what the validator client uses to sign messages on behalf of the validator. The validator client then broadcasts these signed messages to the Beacon chain network, essentially asserting the validator’s view of the state of the chain.

Validator clients can store validator keys on their machines and sign locally, or they can use a remote signer. A remote signer stores keys remotely and exposes a signing API to the validator client. At each epoch, the validator client makes a signing request to the remote signer, and the signer sends back a signed validation message. Although network latency makes remote signing slower than local signing due to the additional network round trip, operators use remote signers for many reasons, including:

  • Security – By using a remote signer, the validator keys can be isolated from:
    • The validator client – This interfaces with the untrusted world—other Beacon chain nodes—and runs untrusted Ethereum Virtual Machine (EVM) code.
    • Insider threats – These include the operators keeping the nodes up and running, who don’t need access to the validator secret keys to do their jobs.
  • Correctness – Using a remote signer makes it possible to implement a reference monitor—by coupling the signer with an anti-slashing database—that ensures that a key never signs messages that could result in slashing
  • Availability – Using a remote signer that handles all validator keys makes it possible for node operators to automatically scale the validator clients and spin up new ones when they crash

The remote signer must be designed with these properties in mind—simply using an off-the-shelf signer like Web3Signer doesn’t give you these properties. For example, the remote signer must restrict who has access to keys—this means restricting access to only authorized users (and validator clients) and sealing the keys to the signer itself—and what they can do with the keys. Otherwise, a breach or insider threat can turn into key theft. The remote signer must itself be highly available and fault-tolerant, and it must use a global, highly available, anti-slashing database; otherwise, signer or OS updates, machine failures, or network interruptions could result in slashing events. Finally, although the remote signer doesn’t need to be as fast as local (unsafe) signing, the latency of the remote signer does matter: if it’s too slow (typically anything over a second), the validator will be late signing messages and lose the operator money.

In the following sections, we explain in detail how we achieved the goal of security, correctness, and availability by implementing our Nitro Enclaves-based CubeSigner remote signing solution.

Cubist CubeSigner – Architecture overview

To address the correctness, security, and performance and reliability challenges from the previous section, we built CubeSigner, a new remote signing service that builds on AWS Key Management Service (AWS KMS), Amazon DynamoDB, and Nitro Enclaves.

At a high level, to process a validation message from a validator client, the signer completes the following steps:

  1. Accept the validation message via an HTTP interface based on Amazon API Gateway.
  2. Checks if that message is slashable using a signing history stored in DynamoDB.
  3. Signs the message—as long as it’s non-slashable—using keys secured by AWS KMS and Nitro Enclaves.

The following diagram illustrates the solution architecture.

In the following sections, we walk through these three steps in detail.

Authenticated, high-level signing API

The signer needs to expose an API that validator clients like Lighthouse and Prysm can use out of the box. Unlike the AWS KMS API, which exposes a simple but low-level signing endpoint, this API must handle high-level validator messages specific to the Ethereum Beacon chain. This high-level API is essential because it can enforce an anti-slashing policy on validator messages based on their contents. If the API instead ingested and signed raw hashes of messages, it would have no way to determine whether a given message was slashable.

In addition to preventing slashing, this high-level API also needs to ensure that validation requests are from authenticated validator clients—otherwise, compromising a single client would mean compromising all validator clients and all keys. CubeSigner goes a step further, using authentication with fine-grained scopes. A scope restricts an authenticated session so that it can only perform certain actions (for example, validation). By only giving validator clients the validation scope, node operators can prevent those validators from, for example, signing exit messages and exiting the validation protocol. This stops operators from accidentally exiting their validators, and protects them from attackers trying to exit on their behalf (costing rewards).

CubeSigner uses API Gateway with AWS Lambda authorizers to implement a custom authorization scheme that determines the caller’s identity. Each authorized signing request is then sent to CubeSigner’s Amazon Elastic Compute Cloud (Amazon EC2) instances, which first check that the authorized scopes allow the request. Then, the signer computes the message signing root hash, applies the anti-slashing policy to ensure that signing the message (hash) won’t result in a slashable offense, and signs it, returning the signature back to the validator client. Next, we discuss anti-slashing and signing itself.

Anti-slashing

CubeSigner implements anti-slashing in the second tier—the policy engine. After authorizing a signing request, CubeSigner applies several policies, including built-in anti-slashing for Ethereum. The anti-slashing policy follows EIP-3076 and ensures that a validator never signs messages that would disagree with the chain history that the validator has already committed to, which means that the policy requires a saving state. CubeSigner enforces this policy using DynamoDB as its database. It atomically and conditionally records the messages it signs for every validator key, ensuring, for example, that the signer only produces a single signature for any particular slot number.

DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools. Availability and consistent performance are particularly critical when running blockchain validators because downtime, latencies, or data inconsistencies can directly result in financial losses. Given these requirements, DynamoDB is well suited because of its consistent single-digit millisecond performance and up to 99.999% availability. For additional information on DynamoDB performance, costs, and scalability features, refer to Amazon DynamoDB features.

Safe signing

The third tier is the low-level signer itself—the CubeSigner virtual hardware security module (vHSM). This tier builds on AWS KMS to sign validation messages in HSMs, where keys are safe from attackers.

HSMs and AWS KMS

In security-sensitive business areas such as banking or domain certificate management, it’s common to outsource key management-related tasks (for example, key storage or signing) to HSMs. HSMs consist of purpose-built hardware that ties cryptographic secrets to physical hardware, which makes them very tough to steal. Key extractions attacks on standard HSMs, for example, require a laboratory with an electron microscope—otherwise, the HSM self-destruct circuit triggers, because it detects someone trying to steal keying material.

Using HSMs on their own is relatively difficult, however. Accessing them from within applications written in high-level languages typically requires a vendor-specific SDK, and the application itself needs to adhere to the PKCS11 API standard. AWS provides an accessible, HSM-based key management service called AWS KMS. AWS KMS is a fully managed service that lets you create and control cryptographic keys protected by HSMs classified at the FIPS 140-2 Security Level 3. Furthermore, it is accessible via the standard AWS SDK (for example, aws-sdk-kms for Rust).

Why not use AWS KMS on its own

Using AWS KMS alone is not enough to build a secure key manager for validators. The first problem is mechanical: Ethereum validators must sign using the Boneh-Lynn-Shacham (BLS) signature scheme, but HSMs don’t yet support BLS because of regulatory challenges aligning with the National Institute of Standards and Technology (NIST). HSMs primarily focus on providing cryptographic signature schemes that are more ubiquitous outside the blockchain industry, namely RSA and Elliptic Curve Digital Signature Algorithm (ECDSA). Extending support for newer signature schemes usually requires low-level firmware upgrades of the HSM itself.

Even if HSMs could produce BLS signatures, we would encounter a second set of problems: latency and throughput. Ethereum node operators run hundreds to thousands of validators, all of which must sign in under roughly a second (accounting for the actual cryptographic operation and network round trip). Most HSMs aren’t optimized for batching public key cryptography operations (for example, the compound latencies in KMS for sending 100s of ECDSA signing operations would exceed the one second limit). Furthermore, HSMs are typically throughput limited (per default, AWS KMS can process 300 elliptic curve asymmetric operations per second). By combining AWS KMS HSMs with Nitro Enclaves, you can implement a vHSM that overcomes these problems while preserving security guarantees.

Nitro Enclaves are separate, hardened, and highly constrained virtual machines. The AWS Nitro System, which is a combination of dedicated hardware and lightweight hypervisor, provides features such as strong isolation, hardware generated entropy via the Nitro Security Module, and cryptographic attestation—allowing you to verify the enclave’s identity and that only authorized code is running in the enclave.

How CubeSigner uses Nitro Enclaves with AWS KMS

The CubeSigner vHSM implements core signature schemes like BLS in software and runs this code in Nitro Enclaves. This lets it support arbitrary signature schemes and sign orders of magnitude faster than an HSM.

The vHSM encrypts all signing keys using an AWS KMS-based key wrapping key; whenever it gets a request to sign a transaction, it pulls the encrypted signing key and the key wrapping key into the enclave, decrypts the signing key, signs the transaction, and returns it to the user. Signing in the vHSM is fast enough for validating without penalty (for example, is well under the 500ms target even accounting for network latency). This is because:

  • Signing a message requires a single signing key decryption, and HSMs are well-optimized for symmetric cryptographic operations. AWS KMS, for example, by default runs up to 50,000 combined symmetric encrypt/decrypt operations per second.
  • Nitro Enclaves can do asymmetric cryptographic operations (for example, produce the signature) at native speeds.

CubeSigner stores all encrypted signing keys—in our example, BLS validator keys, but in general keys for any signature scheme—in DynamoDB. This allows us to replicate keys, back up keys, and use DynamoDB for scale and performance.

The key wrapping keys are stored and managed using AWS KMS, with one key wrapping key per node operator. This key is sealed to the vHSM in the enclave and made inaccessible to the AWS root user, which means that no one can use or extract the key wrapping key. Sealing a KMS key to an enclave is an extremely powerful primitive. It guarantees that the vHSM is the only component that can decrypt the operator’s signing keys, which makes it possible for us to implement hardware abstractions like the HSM in software.

Because only the code in the enclave can access secret keys, keeping keys safe reduces to keeping enclave code safe. To this end, that code is has the following features:

  • Written in Rust – We wrote the singer in Rust because it gives us low-level control over how the code handles secrets, so we can avoid leaking secrets in memory or through side channels. Rust also makes it straightforward to use verification techniques for proving the safety of the code running in the enclave.
  • Minimal – We keep dependencies in check with cargo vet, and using Rust protects us from massive runtime systems (for example, the JVM), complex just-in-time compilers (as in JavaScript engines), and their resulting huge supply chains.
  • Least privileged – The interface to the code running in the enclave is its attack surface. Our vHSM implements a tiny, well-typed interface using the vsock communication channel—unlike other signers, which simply proxy arbitrary network requests across the enclave boundary.
  • Confined – The only outbound channel from the vHSM is to AWS KMS, following a secure communication flow. For more information, refer to How AWS Nitro Enclaves uses AWS KMS. Even here, however, the signer assumes that the channel may be compromised and uses the ephemeral private and public key associated with the particular enclave to encrypt the AWS KMS result with the public key stored in the attestation document. This way, the AWS KMS result can only be decrypted and read inside the enclave where the AWS KMS decrypt call originated in the first place.

Conclusion

In this post, we covered the operational traps and pitfalls of running Ethereum validator nodes and, in particular, the security, correctness, and availability challenges node operators have to balance daily. Remote signing, in contrast to managing keys locally on the node, can address these challenges—but the design of the remote signer matters. To this end, we detailed the design of CubeSigner, a remote signing solution based on Nitro Enclaves, and showed how CubeSigner not only addresses the security and correctness (slashing) challenges operators face, but also how it makes it straightforward to eliminate operational costs and run highly available validator clusters.

We encourage you to sign up for a Cubist CubeSigner sandbox account, start up your Ethereum validator using an AWS Blockchain Node Runners template, and run your own Ethereum validator node.


About the Authors

Fraser Brown is a co-founder and the CTO of Cubist. She is also an Assistant Professor at Carnegie Mellon University’s School of Computer Science. Her research focuses on security and program correctness, from verifying (parts of) production systems to automatically finding exploitable bugs in real codebases; for example, her tools have found many zero-day bountied bugs and CVEs in the popular Chrome and Firefox browsers.

Deian Stefan is a co-founder and Chief Scientist of Cubist. He is also an Associate Professor of Computer Science and Engineering at UC San Diego. His research lies at the intersection of security and programming languages with a particular focus on building secure systems that are deployed in production. He was a co-founder of Intrinsic, a runtime security startup acquired by VMware in 2019.

David-Paul Dornseifer is a Blockchain Development Architect at AWS. He focuses on helping customers design, develop and scale end-to-end blockchain solutions. His primary focus is on digital asset custody and key management solutions.