AWS Web3 Blog
Analyze blockchain data with natural language using Amazon Bedrock
Data within public blockchain networks such as Bitcoin and Ethereum can be accessed by anyone. It holds a wealth of valuable insights that can drive business decisions, inform investment strategies, and uncover emerging trends. However, accessing and making sense of this information has traditionally been a complex and technical undertaking. Much of the data is encoded and stored as bytes, rather than in a human-readable format.
The data structures used in blockchains are optimized to provide tamper-evidence and immutability of the data, but not to perform queries and analytics. Before the data can be queried, it must first be processed through an extract, transform, and load (ETL) pipeline and converted into a format that can be used with common business intelligence (BI) tools and query languages.
The AWS Public Blockchain Open Data Set was created to solve these challenges, and is available for Bitcoin and Ethereum. These datasets provide historical data, allowing analysts to issue SQL queries to services like Amazon Athena and Amazon Redshift to glean insights. One of the key advantages of these datasets is the potential to aggregate and analyze activity across multiple blockchain networks. However, the process of querying and analyzing blockchain data still requires a comprehensive understanding of data schemas and the ability to construct appropriate queries.
With generative artificial intelligence (AI), you can now extend this analysis capability to support natural language queries. This enables users who may not be familiar with SQL to gain similar insights from blockchain data. In this post, we introduce a solution that demonstrates how you can chat with blockchain data using Amazon Bedrock and the AWS Public Blockchain datasets. We discuss Amazon Bedrock, review the solution architecture, provide example prompts, share interesting findings, and go over how you can extend the solution to integrate with different data sources.
Benefits of Amazon Bedrock
Recent advancements in large language models (LLMs) and generative AI have opened up new possibilities for interacting with data in more natural and intuitive ways. These models have demonstrated the ability to understand and generate human-like text, enabling natural language understanding and generation.
Amazon Bedrock is a fully managed service that makes it simple for customers to build generative AI applications and provides access to a variety of foundation models (FMs), agents, and knowledge bases for retrieval augmented generation (RAG) workflows.
Amazon Bedrock allows you to experiment, customize, and deploy FMs without having to manage the underlying infrastructure and training complexity.
A key capability within Amazon Bedrock is the ability to create autonomous agents using Agents for Amazon Bedrock that can assist users in completing tasks. These agents use the reasoning and language understanding capabilities of FMs to perform the following functions:
- Understand natural language requests from users
- Break down those requests into a series of smaller tasks to complete
- Gather additional information from users through conversation
- Take actions by invoking APIs or querying knowledge bases
- Provide responses back to users in natural language
Agents are well-suited for building generative AI applications that can automate tasks and engage with users in a natural and conversational way.
Solution overview
This solution is available as an automated AWS Cloud Development Kit (AWS CDK) deployment in the accompanying GitHub repository. At the core of this solution is an agent using Anthropic Claude 3 Haiku on Amazon Bedrock, an LLM that allows the agent to understand user requests based on a given set of instructions and take appropriate action. Docker is used to build components of the CDK application locally, while the CDK utilizes CloudFormation to deploy the solution to AWS.
Instead of having to understand complex data schemas and construct SQL queries, you can simply express queries in natural language, and the agent will interpret your intent and translate it into queries you can run on the AWS Public Blockchain datasets. This lowers the barrier to entry for querying and analyzing blockchain data, making it more accessible to a broader range of users.
The following architecture showcases how the solution simplifies the process of querying blockchain data and effectively handles error recovery.
The workflow includes the following steps:
- Users input queries in natural language, such as “What was the largest Bitcoin transaction yesterday?”
- The agent, using Anthropic Claude 3 Haiku on Amazon Bedrock, processes natural language input, understands the intent, and translates it into a structured SQL query.
- To run the generated SQL query, the agent uses an associated action group. An action group defines the specific actions that an agent can perform, and in this case, it is defined using an OpenAPI schema. The schema describes a POST operation,
/athenaQuery
, that accepts a request body containing a SQL query. - The action group calls an AWS Lambda function, a serverless compute service that lets you run code without having to provision or manage servers, which runs the query on Athena, a serverless interactive query service. The generated SQL query is invoked against the relevant AWS Public Blockchain datasets that are stored in the Amazon Simple Storage Service (Amazon S3) AWS Public Blockchain Data bucket,
aws-public-blockchain
. The Lambda function returns the query results back to the agent’s action group, which are expected to be aResultSet
array containing the rows returned by the query, as defined in the OpenAPI schema. - The agent formats the response and sends it back to the user.
The solution’s AWS CDK application additionally deploys an AWS CloudFormation stack that creates AWS Glue tables and partition definitions in the AWS Glue Data Catalog, which serves as a centralized repository for metadata about the blockchain data stored in Amazon S3. By defining the schema and partitioning structure of the datasets, the AWS Glue tables provide a logical abstraction layer that allows Athena to efficiently query and analyze the underlying data. The CloudFormation template also deploys a Lambda function that runs daily to update the partition definitions in the Data Catalog, so the data is up to date.
One of the key strengths of this solution is its robust error-handling mechanism. In the event that the initial SQL query fails due to syntax errors, missing tables, or other issues, the agent doesn’t simply return an error message to the user. Instead, it analyzes the error feedback, identifies the root cause, and autonomously reformulates the query to address the problem.
This iterative process continues until a valid query is generated and run successfully, so desired results are consistently returned, even in the face of complex or problematic queries. If the agent is unable to generate a valid query after multiple attempts, it informs the user about its inability to assist with the given prompt.
It’s important to note that accurately translating complex natural language queries to SQL can be a challenge for LLM’s. To improve the accuracy and reliability of the solution, the agent references contextual information about the dataset schemas.
Key components of the agent instruction
The agent instruction, which can be found in the GitHub repo, plays a crucial role in enabling the agent to generate accurate and optimized SQL queries tailored for the AWS Public Blockchains datasets. It covers several key aspects:
- Query generation – The agent analyzes the user’s natural language request and generates an appropriate SQL query based on the provided schemas from the Data Catalog for the Bitcoin (BTC) and Ethereum (ETH) databases. It identifies the relevant blockchain (Bitcoin or Ethereum) and constructs the query using the correct table names and field references.
- Error handling – If the initial SQL query generated by the agent results in an error when run on Athena, the agent is instructed to follow the resolution process described earlier. This process is repeated until a valid query is generated or the maximum number of retry attempts is reached.
- Result handling – The agent returns the results fetched from the SQL query. If the result set is empty, the agent specifies that there were no results. It also makes sure scientific notation values are properly returned.
- Schema awareness – The agent is aware of the specific schemas for the Bitcoin and Ethereum databases, including the table names and field structures. It uses the appropriate prefixes (
btc
for Bitcoin andeth
for Ethereum) when referencing tables and fields. - Advanced query handling – The agent is equipped with additional guidelines for handling specific query patterns and data structures. For example, it knows how to handle array structures in the Bitcoin database using the
UNNEST
keyword, how to perform date comparisons and time range calculations, and how to handle token addresses in the Ethereum database using thelower
function. - Schemas and sample queries – The agent instruction references the advanced orchestration prompt, which incorporates additional information about the Bitcoin and Ethereum schemas, as well as several sample queries. The prompt guides the agent through interpreting user input, invoking action groups, and generating responses, using the provided context to enhance the accuracy and relevance of generated SQL queries.
Prerequisites
Complete the following prerequisite steps to deploy the solution. It is recommended that you deploy the solution in a dedicated sandbox AWS account. AWS CloudTrail, which is enabled by default, provides monitoring and auditing capabilities for your account. Additionally, make sure that you have properly configured AWS Identity and Access Management (IAM) permissions, limiting access of the deployment to specific users with the necessary permissions.
- Install the latest version of the AWS CDK:
- Make sure Docker is installed and running.
- Clone the GitHub repo:
- Change into the project directory and then install the necessary dependencies with the following command:
- Configure your AWS profile:
- If this is your first time deploying an AWS CDK stack in this AWS account, you need to first bootstrap your environment by running the following command:
- Deploy the AWS CDK stack:
It takes approximately 2 minutes for the stack to be deployed.
Note: Due to the Glue Catalog synchronization process, it will take approximately 4-5 minutes for the Ethereum data to become available after the initial deployment.
Test the agent
If this is your first time using Amazon Bedrock, choose Model access in the navigation pane on the Amazon Bedrock console and enable access for Anthropic Claude 3 Haiku on Amazon Bedrock, as shown below:
Now that the AWS CDK stack has been deployed, you can test the solution.
- On the Amazon Bedrock console, choose Agents under Builder Tools in the navigation pane.
- Choose the newly created agent that starts with the name bedrock-agent-*.
- Enter a natural language question in the Test prompt window and choose Run.
The following screenshot illustrates an example prompt and the corresponding output generated by the model.
You can use the following sample questions as a starting point, but be sure to test the agent with your own questions.
The following are sample questions regarding Bitcoin:
- Find the total number of Bitcoin transactions that occurred in the last 24 hours.
- Identify the largest BTC transaction in the last hour.
- Calculate the average block size over the last day.
- What is Satoshi’s message that he stored in the genesis block?
The following are sample questions regarding Ethereum:
- Get the number of new Ethereum contracts created in the last week.
- Find the most active Ethereum address in the last 7 days.
- Calculate the total value of all token transfers in the Ethereum network in the last 30 days.
- Identify the largest Ethereum transaction in the last month.
Feel free to experiment with different queries and prompts to explore the full capabilities of the agent and the AWS Public Blockchain datasets.
Clean up
To avoid incurring future charges, delete the resources you created by running the following AWS CDK command from the root of the directory:
cdk destroy
Key observations
When developing this solution, we encountered several discoveries that highlight the capabilities enabled by natural language querying of blockchain data:
- The agent can automatically convert hexadecimal values in Bitcoin blocks to readable text. For example, the coinbase parameter can be used by miners to include a short message in a block. The agent can decode Satoshi Nakamoto’s famous “The Times 03/Jan/2009 Chancellor on brink of second bailout for banks” message that is embedded in the genesis block.
- Foundation models have inherent knowledge of popular smart contract addresses in the Ethereum network. When constructing a query, the agent is able to recognize and reference common addresses correctly without requiring an associated knowledge base.
- We have found that the most effective way to handle errors and exceptions when querying the blockchain data was to continuously refine the agent instruction. This iterative process allowed the agent to recover from various issues, significantly enhancing the overall robustness and reliability of the solution. Each response from an agent is accompanied by a trace that details the steps orchestrated by the agent. The trace helps you follow the agent’s reasoning process, making it very useful for debugging or troubleshooting purposes.
These findings demonstrate the powerful capabilities of agents in understanding and interacting with blockchain data in a natural and intuitive way.
Extending the solution
Although this solution has demonstrated the power of natural language querying for the AWS Public Blockchain datasets, you can extend the capabilities further to integrate with additional data sources. Two promising avenues for expansion are Amazon Managed Blockchain (AMB) Query and The Graph.
AMB Query provides serverless access to historical token balances, transaction data, and more. During our testing, we found that retrieving balance information for a single Bitcoin address using an Athena query on the AWS Public Blockchain dataset required scanning 1.15 TB of data, which had a runtime of 40 seconds and an associated cost of approximately $6 USD. The reason for this high cost is that the AWS Public Blockchain dataset is stored in its raw form, without any indexing or optimizations for specific queries. As a result, Athena must scan the entire dataset to retrieve the requested information, leading to long runtimes and high costs, especially for queries that involve large amounts of data or complex computations.
In contrast, AMB Query can retrieve the same balance information in milliseconds, with a much lower cost of $0.000007 USD per request (or $7 USD per million requests). AMB Query uses specialized indexing to optimize access to blockchain data, resulting in significantly faster and more cost-effective retrieval of information.
It’s important to be aware of the potential costs associated with running complex or data-intensive queries on the raw dataset using Athena. If you plan to perform multiple balance or transaction queries, it may incur substantial costs due to the need to scan large portions of the dataset. In such cases, it is more cost-effective to consider alternative solutions like AMB Query.
This difference in latency and cost highlights the potential benefits of extending this solution to use AMB Query as an additional data source. This would allow for the seamless transition between querying the public blockchain datasets and the more optimized responses through AMB Query, all through the same natural language interface.
Another area of exploration is integrating with The Graph, a decentralized protocol for indexing and querying blockchain data. By integrating the agent with The Graph, users could ask natural language questions related to specific smart contracts and their associated data. For example, you could ask the agent questions about the various liquidity pools on the Uniswap decentralized exchange, and have it generate the appropriate GraphQL queries to retrieve the relevant information.
By incorporating additional data sources, this solution can provide users with an even more comprehensive and cost-effective way to perform cross-chain analytics by chatting with blockchain data. The flexibility to integrate with various data providers further enhances the value and versatility of this approach.
Lastly, as you consider expanding on this solution, it is recommended to implement Guardrails for Amazon Bedrock and associate it with your Agent. This feature allows you to establish safeguards, such as detecting and blocking potentially malicious user inputs that attempt to override or manipulate the Agent Instruction. Additionally, it is advisable to research best practices for prompt injection security. This approach will help mitigate the risks associated with prompt injection attacks, ensuring the integrity and reliability of the solution.
Conclusion
In this post, we covered how you can use Agents for Amazon Bedrock to enable natural language queries on the AWS Public Blockchain datasets. This solution allows you to gain insights from blockchain data in a natural and conversational manner, without the need for deep technical expertise. We discussed the key components of the solution and how it can be extended. As a next step, you can try deploying the GitHub repository in your AWS account. Let us know in the comments section if you have any questions.
About the Authors
Simon Goldberg is a Blockchain/Web3 Specialist Solutions Architect at AWS. Outside of work, he enjoys music production, reading, climbing, tennis, hiking, attending concerts, and researching Web3 technologies.
Emile Baizel is a Senior Blockchain Architect at AWS. He has been working with blockchain technology since 2018 when he participated in his first Ethereum hackathon. He didn’t win but he got hooked. He specializes in blockchain node infrastructure, digital custody and wallets, and smart contract security.
Autrin Abdi is an AI/ML Specialist Solutions Architect at AWS. He is dedicated to constantly learning more about the field to ensure the most up-to-date and optimal solutions are being provided for AWS customers. Outside of work he enjoys playing soccer and weightlifting.
Ayman Kazmi is an AI/ML Specialist Solutions Architect at AWS. Outside of work, he enjoys writing, playing soccer, and exploring the future of Generative AI.