AWS Database Blog
Notarize documents on the Ethereum Blockchain
Companies working in the public sector must often certify that their work complies with required standards and regulations. They must be auditable by external parties and accountable to the general public. Such companies often use a trusted third party that attests to certain attributes of documentation, such as its timestamp or authorship.
Blockchains can automate part of that—they can store values immutably so that they can be audited at a later point. If you need to prove that documents haven’t been modified since their original storage, you can notarize them on blockchain. For each document, you can store a digital fingerprint, which can’t be modified anymore and serves as proof of the original document. In this post, we share how Autostrade per l’Italia, a company responsible for highway maintenance in Italy, utilizes blockchain to verify certification events.
Autostrade per l’Italia has to periodically inspect the various assets for a highway, such as bridges, overpasses, and tunnels. Each bridge, for instance, is inspected on a quarterly basis. With approximately 4,000 bridges under their responsibility, this adds up to 16,000 inspections per year. Each inspection generates several event documents in XML format, which is then stored on their internal systems. As a new feature, they want to ensure that each event exists in its original form. They want to prove process compliance to the government by verifying the events making up a particular inspection. We have built a proof of concept using the public Ethereum blockchain as ledger for verification of all events. In this post we explain how you can store a large number of proofs with very few blockchain transactions using Amazon Managed Blockchain.
Blockchain storage with AWS
Although blockchain can store data immutably, it’s very restricted on the amount of data. Each byte stored on blockchain is fairly expensive (at prices of $2,600 USD per Ether as of February 2022, storage of 1 KB of data costs approximately $50). Prices for storing data also vary with volatile transaction fees on public blockchains. The high transaction fees and volatility in those fees led to two main insights for the project:
- We can’t store the event documents on blockchain directly. That would be cost prohibitive. Instead, we use blockchain as a notary service, storing proofs of documents. We can later retrieve the proofs and use them to verify the document. The document itself is stored off-chain.
- With the number of events, even storing proofs only is too expensive. Instead, we have to compress many proofs into one transaction so that we can reduce the number of transactions drastically. We want to create blockchain transactions on a monthly basis, ending up with 12 transactions per year.
To achieve these goals, we can do two things: First, we store all documents on Amazon Simple Storage Service (Amazon S3). On Amazon S3, we sort documents into folders for each month. Secondly, we generate the transaction to store the proofs on blockchain with Amazon Managed Blockchain once per month. This monthly transaction balances transaction costs with the amount of time between document creation and its proofing on blockchain. An event document can now at most stay unverified for 1 month. After that, the proof is on blockchain and can be verified.
To deal with the number of documents, we store the proofs in a Merkle tree data structure, which aggregates many hashes (the leaves of the tree) into one so-called root hash. The tree has all the proofs for the documents as its leaves. Bottom up, we hash the proofs pairwise until we end up with one hash only, which forms the root of the tree.
Solution overview
The architecture on AWS consists of three main parts:
- The backend with document storage and logic to create the Merkle trees and store them on blockchain
- The blockchain node itself connecting to Ethereum mainnet
- A front-end component that can verify individual proofs with their data on blockchain
We receive the documents for storage and proofing through Amazon API Gateway. It forwards the documents to AWS Lambda to validate the XML and then stores it on Amazon S3. The event documents bucket holds all the XML documents.
Periodically, the aggregator Lambda function is called to do four things:
- Take all documents for a specific month from Amazon S3 and hash them.
- Aggregate the individual hashes into a Merkle tree and send the root hash as a transaction to blockchain.
- Store the Merkle proof of each individual hash as Amazon S3 metadata with the document.
- Take the block number of the transaction and store it with each XML document as Amazon S3 metadata.
The front end is a static website based on React. It retrieves the documents for a particular month from Amazon S3 and verifies them against the root hash on blockchain.
Implementation
The challenging parts of the implementation are in the aggregator Lambda function and the smart contract.
Aggregator Lambda function
The aggregator function creates the Merkle tree of all documents for a specific month. Here it is written in Python. The function is not restricted to Python, the same functionality could be implemented in Javascript or another suitable language too. First, the function gets a list of all objects in Amazon S3 with a specific prefix:
xml_s3_keys
then holds the list of objects. Now we can iterate through the list. For each object, we retrieve it from Amazon S3 and generate its Keccak hash:
We create the Merkle tree as a pairwise hash tree. The event hashes generated as s3_object_payload_keccak
form the tree’s leaves. mt_make_tree
then builds the tree. The sort_pairs=True
parameter determines the order of the input hashes. It has to match the order in the verification step. The following figure illustrates this Merkle tree.
Merkle trees are very useful to prove that a particular data point is part of the data structure. We can recreate the branch that leads from the data point to the root of the tree, shown as the orange path in the preceding figure. In the tree, we can verify the existence of the orange XML document. We need two additional data points: first, we need the so-called proof (blue hashes) for an element. The proof contains the hashes to do the pairwise hashing without recreating the entire tree each time. With the blue hashes, verification boils down to four hash operations in the preceding tree:
hash(<ORANGE XML DOC>)
= 0xf4dhash(0xf4d, 0xfff)
= 0x27fhash (0x310, 0x27f)
= 0xbbbhash(0xbbb, 0xaaa)
= 0xd27
With 0xd27
, we have reached the root of the tree. Now we need our second data point, the actual root hash (red), which we can retrieve from the blockchain. If our calculated root matches the one retrieved from the blockchain, we have proven two things:
- The original document was part of the Merkle tree at its original creation
- The document existed when the root hash was stored on blockchain
Therefore, to allow for verification at a later time, we need to store the Merkle proofs for each document. We do that by adding Amazon S3 metadata to the object in Amazon S3. That way, the proof can never be separated from the document itself. We can retrieve it by querying the object’s metadata.
Smart contract
The smart contract on blockchain has two main functions:
- storeNewRootHash – Stores a root hash of a new Merkle tree on blockchain
- verify – Checks if a document is part of a Merkle tree with a specific root
To store a new root hash, we use the following code:
The function takes the root hash as a parameter and emits an event with the root hash. We can later retrieve the event from blockchain. Additionally, it stores the block number of the last time an event has been emitted. This is useful to traverse all events that the smart contract has ever received. It is not necessary for verification.
To verify that a document is indeed part of the Merkle tree, we use the following code:
The function takes a root
hash, a leaf
hash, and the proof
(array of hashes) as input. It then iterates through the proof to recreate the branch. The if...else
ensures that the concatenation of the hashes happens in the right order each time. Here we assume a tree from sorted pairs. Finally, the function returns whether the recreated root hash matches the provided root
.
Conclusion
Due to the immutability and transparency of blockchains, they can be a useful tool for notarizing documents. However, storage on blockchain is very expensive and transaction fees are volatile. Therefore, we have to make sure to reduce the number of transactions to a minimum. Merkle trees enable us to compress all digital fingerprints into a single root hash. Verification can then be done by recomputing the branch of the Merkle tree. If the original document is still the same as during Merkle tree creation, the verification step results in the same root hash.
With a novel application of Merkle trees for data compression, the number of input documents doesn’t affect the number of transactions. Instead, we can use a suitable time frame for aggregation. Transaction cost remains manageable, because it depends on the time frame only and not on the number of documents.
Finally, verification of the existence is fairly simple computation. It only requires a sequence of hash operations, which is bound by the height of the tree. In general, with x elements in the tree, we only need log(x) hash operations for verification.
With Merkle trees, you can now prove all the documents that you need verifiable on blockchain. Try out the technique and leave a comment on what you have notarized.
About the Author
Christoph Niemann is a Senior Blockchain Architect with AWS Professional Services. He likes Blockchains and helps customers designing and building blockchain based solutions. If he’s not building with blockchains, he’s probably drinking coffee.