AWS Startups Blog

How Coinbase Builds Its Blockchain Infrastructure

Coinbase is a marketplace to buy and sell digital currency. It’s one of the best-known portals for anyone hoping to approach the crypto-currency market, because it’s the “easiest and most trusted place to buy, sell, and manage your digital currency,” according to Jack Kearney, a software engineer at Coinbase working on infrastructure and security. Most trusted is a phrase that gets thrown around a lot—but for Coinbase, it has a specific meaning. “It means we need to be reliable; we need our services to be up when people want to be trading. It also means that we need to have security at mind in everything we do.”

That also means their infrastructure has to be secure, scalable, and stable. And as their use cases have grown, so has their volume. Thinking about how to meet all their customers’ needs has driven Coinbase to develop an infrastructure that is “blockchain agnostic.” For Kearny, “that just means that we can more easily spin up new chains without having to do anything super custom per chain as much as possible.”

In specific terms, what does that demand of their infrastructure? There are four main tenets that Coinbase abides by when building out its architecture. The first is immutability. “Any configuration change requires a redeploy,” explains Kearny. “This ensures that all configuration changes are tracked in source control and also ensures that we’re always in a deployable state.” That prevents any one actor from making a change that Coinbase doesn’t know about. Instead, everything is immutable, “so anytime we change something, we entirely tear down and rebuild the application,” says Kearny.

The second tenet of the Coinbase architecture is ephemerality. That means every server in their infrastructure only lives 30 days at most. “This is a huge security win,” says Kearny. “This means that if any vulnerabilities were to come out, we’re constantly redeploying these servers, tearing down the old ones, and pulling in the updated packages that might have a vulnerability.” This constant turnover also means that if attackers were to ever “gain persistence, we’d constantly be pulling the rug out from under them,” Kearny says. Building this type of defense into the infrastructure itself keeps Coinbase more secure.

The next tenet is consensus—the more human side of things. In practice, this means that no one can make changes on any critical piece of the architecture without the active consent of enough other players on the team. “Typically, the number of people required to perform an action is proportional to the sensitivity of that action,” explains Kearny. This means that “no one person can perform sensitive actions individually, but any sufficiently large group of people can do anything, any scary action together.”

Lastly, they run Coinbase according to a tenet of automation. That means engineers don’t have to work directly with DevOps to make the changes that they need—tooling is largely self-service. In conjunction with consensus, that means that “all application deployments are completely engineer managed,” Kearny says. “Instead, the deploys and rollbacks are handled by the people who know those applications best, the engineers who built them.” Instead of handing the responsibility off to an engineer who doesn’t fully understand what the application they are modifying does, the experts in those applications shepherd the deploys. It’s an elegant way of democratizing and streamlining processes at the same time.

Using these architectural tenets, how does Coinbase actually operate? It uses the nodes that validate, detect and relay the state updates throughout the blockchain network as their eyes and ears. Whenever somebody wants to send funds into Coinbase, they log into their app, see a deposit address, send their funds through a wallet they control, and Coinbase “detects that they did in fact send those funds to an address that we control by querying these nodes,” explains Kearny. “In the event that it did, we can credit their account balance. Similarly, on the other side, if somebody wants to send funds out of Coinbase, we create a transaction, sign that transaction, and broadcast it through one of these nodes.”

The nodes are critical to their operation. “We need these nodes to be reliable,” says Kearny. “We take security incredibly, incredibly seriously. And so we want to be able to frequently and rapidly redeploy these nodes into the infrastructure, into this ephemeral and immutable infrastructure.” That means, though, that Coinbase is constantly pulling down and redeploying the entire blockchain state—a large amount of data, though not unheard of. “The issue here is that we have concerns with network reliability. So, when our new node comes up, we have to pull down the entire blockchain from the network. And these networks aren’t necessarily the most reliable,” says Kearny. “That’s where Snapchain comes into play.”

Snapchain is, at a very high level, “a tool to blue-green deploy blockchain nodes,” explains Kearny. They want to build something as generic as possible without introducing vulnerabilities, so they implement a node health check. “Typically, we look at the time of the last synced block and compare that with right now. If that grows to too far, we know that we’re pretty out of sync with a network and we know that we want to actually tear down one of these nodes and continue,” says Kearny. While blockchains produce blocks at different rates, they can check against third-party developers to see if any given lag is system-wide.

To minimize the time to sync from the network, Coinbase realized that they needed to separate blockchain data production from data consumption. “We needed a one-way to produce blockchain data, and then in a completely separate setting in which to consume that data. From that, we sort of arrived at two configurations for Snapchain, what we call the snapshot configurations,” says Kearny.

These configurations are the “nodes that we spin up, we sync that chain, we take a snapshot and are constantly producing this blockchain state. And then on the other hand we have long-lived configurations, which consume the state produced by snapshot nodes.” Kearney continues: “The way we manage these and know which snapshot refers to which protocol is using just AWS tags. We tag each snapshot with different tags: the implementation, the network diversion, and the specialization.”

To consume the snapshots, they create two node containers on the box: the node and the control container. Then they take the snapshot, turn it into a volume on the fly, and only then when the data is actually in place do they turn the node on. Though they find peers, they have a full copy of the blockchain so they don’t really have to sync anything with the network. The result is an elegant solution: “We have live nodes running over here entirely isolated, we have snapshot nodes running that requests aren’t going through, and the snapshot nodes actually won’t affect user data at all.”

Network Load Balancers are another unique part of Coinbase’s architecture. Their ephemeral infrastructure means that they don’t have any static IPs, which NLBs can offer. They run with cross-zone load balancing off, which gives them one static IP per node, behind which they can do blue-green deploys. “We’ve built a snapshot selection engine to intelligently select the right snapshot based on a particular node’s version,” says Kearny. And they’re not concerned about rolling back because they have consistent and predictable snapshots to roll back to. “We’ll keep every snapshot we produced for three days so we can roll back to a very granular point in time. And then, after that, we’re going to keep one date chain protocol tuple for 50 days.”

In the future, Kearny says, they’d like to create a dedicated AWS account where they can share snapshots or checkpoints of the blockchain “so that anybody can then grab that checkpoint, know that it was produced by Coinbase and is likely a reliable source, and stand up their own nodes pretty easily.” The abstractions they have to deal with in their user interface might need a better iteration to handle the wide array of new nodes that they’re going to have in the coming months and years. But for now, the fact that they built a way to spin up and tear down nodes easily through Snapchain is, if you’ll forgive Kearny’s understatement, “pretty cool.”

Michelle Kung

Michelle Kung

Michelle Kung currently works in startup content at AWS and was previously the head of content at Index Ventures. Prior to joining the corporate world, Michelle was a reporter and editor at The Wall Street Journal, the founding Business Editor at the Huffington Post, a correspondent for The Boston Globe, a columnist for Publisher’s Weekly and a writer at Entertainment Weekly.