AWS Storage Blog
Use generative AI to query your Amazon S3 data lake for insights
Businesses store large volumes of data in their data lakes and rely on this data to extract insights and make important business decisions. However, business stakeholders sometimes lack the technical skills required to run complex queries against their data lakes. Instead, they rely on data scientists or analysts to build reports and dashboards or to manually craft SQL queries. Additionally, as organizations store more data across a wide variety of domains (for example, marketing, supply chain, and finance), efficiently finding the right and most relevant data for addressing an analytical question can help drive faster insights. This is particularly true for organizations with data spread across hundreds of thousands of tables.
Now, with generative artificial intelligence (AI), business stakeholders can analyze trends and unearth insights in their data lakes by asking questions in natural language. With enterprise-grade security and privacy, access to industry-leading FMs, and generative AI-powered applications, AWS makes it easy to build and scale generative AI customized for your data.
In this post, we walk through an example to demonstrate how you can gain insights from an Amazon S3 data lake using natural language prompts. Instead of spending time building custom reports, you can simply ask “what is the average star rating for the top 25 products by revenue?” or “how do user ratings and feedback differ across product categories?” Then, this solution uses your data lake to answer those questions, enabling data-driven decision-making without writing code.
Solution overview
First, we set up our own data lake using Amazon Simple Storage Service (S3) and process the sample text data using Amazon EMR Serverless. Then, we register the processed dataset as a table in AWS Glue Data Catalog. We also use AWS Glue to perform data quality checks. Finally, we use a foundation model (FM) and agents to automatically convert analytical questions into SQL, query the data lake using Amazon Athena, and respond with the query results in natural language. For this last step, we use Amazon Bedrock, a fully managed service that offers a choice of high-performing FMs from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. Amazon Bedrock comes with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
The following figure shows the complete design. The GitHub repository contains CDK automation to deploy the infrastructure and services.
Prerequisites
You need an AWS account and access to use the AWS Glue Data Catalog, Amazon SageMaker, Amazon EMR, and Amazon S3. You need the AWS CDK framework installed.
Building the data lake on Amazon S3
For this example, we use the Amazon Review v2 dataset, which has about 233 million user reviews covering the years 1996 to 2018. This data set was originally described in the paper “Justifying recommendations using distantly-labeled reviews and fine-grained aspects” by Jianmo Ni, Jiacheng Li, and Julian McAuley. The data set is a large gzip-compressed JSON lines (JSONL) file. To prepare the data lake, we decompress the file and upload it to a raw data staging area in Amazon S3. In the GitHub repository, these steps are in the notebook `dataset.ipynb`, which you can run in the SageMaker Studio domain created by the accompanying CDK stack. See the GitHub repository for detailed instructions. Decompressing the file significantly speeds up future processing steps, as gzip is not a splitable file format for Spark jobs.
Next, we run a Spark data preparation job to make the data easier to query. This job drops columns that we do not need, converts the reviews in JSON to the Parquet storage format, and registers the processed data set as a table in AWS Glue Data Catalog. In the GitHub repository, these steps are in the notebook `review-analysis.ipynb,` which you can run in EMR Serverless through EMR Studio. The CDK stack creates the EMR Serverless application and EMR Studio workspace for you.
With our table created in AWS Glue Data Catalog, we can now perform data quality checks.
Checking data quality
Checking the data quality is important for making well-informed decisions. Although various tools exist to evaluate data accuracy and freshness, they often need manual data discovery processes and expertise in data engineering and data science.
To address this, we use AWS Glue Data Quality, a feature of AWS Glue that measures and monitors the data quality of your Amazon S3 data lakes, data warehouses, and other data repositories. AWS Glue Data Quality generates recommendations based on the shape of your data, such as row counts, mean, and standard deviation, and generate a set of rules for you to use as a starting point.
You can access AWS Glue Data Quality in AWS Glue Data Catalog and in AWS Glue ETL jobs. These checks can find anomalies and other data quality issues within a data set. Navigate to the AWS Glue console and select Data Catalog tables in the sidebar. Select the reviews table and go to the Data quality tab, as shown in the following image.
Next, select Recommend rules and launch the job. After a few minutes, AWS Glue comes back with a set of 13 recommended rules, as shown in the following image.
Select Insert rule recommendations and then Run the rules.
After a few minutes, the data quality check completes, as shown in the preceding image. In this case, our data is in good shape – all 13 rules passed! Most importantly, we validated the quality of our data without spending hours setting up manual data quality checks. For more information on AWS Glue Data Quality, see this earlier AWS Glue data catalog post. Now that we’ve validated the quality of our data, we conclude by setting up the generative AI-powered query interface.
Asking questions to our data lake to generate insights
Now we’re ready to use generative AI to answer questions about the data in the data lake. We could use a traditional query tool directly, but with generative AI we can write our questions in natural language, rather than SQL or Python. The solution uses Anthropic’s Claude 3 Sonnet model in Amazon Bedrock. This model excels at complex reasoning tasks such as translating natural language into appropriate SQL queries. Additionally, Anthropic’s Claude models have a 200,000 token context window, which helps with complex tasks that require the processing of very large volumes of information, such as analyzing extensive query results that contain thousands of rows. To connect an FM to our data lake and execute multi-step tasks, we use the LangChain open source framework. As we ask questions, the LangChain framework uses Claude 3 Sonnet to translate the analytical queries into SQL code. Then, LangChain invokes an agent to execute the query. In our case, we use Amazon Athena as the agent. Finally, LangChain invokes the model to assemble a final answer given the query results from Athena. The following figure shows the user journey for this experience.
The notebook `athena-nlp.ipynb` shows the solution and has some sample queries. This notebook runs in SageMaker Studio. In the langchain SQLDatabaseChain, we first define the model in Bedrock we want to use.
boto3_bedrock = boto3.client(
service_name="bedrock-runtime"
)
modelId = 'anthropic.claude-v2:1'
llm = Bedrock(
model_id=modelId,
client=boto3_bedrock,
streaming=True,
model_kwargs={"max_tokens_to_sample": 2048, 'temperature': 0.75},
)
Next, we register Athena as a database engine for the FM to use.
connathena=f"athena.us-east-1.amazonaws.com"
portathena='443' #Update, if port is different.
schemaathena='default' #from Amazon Athena
s3stagingathena=f's3://bucket/athena/user/’
wkgrpathena='primary'
connection_string = f"awsathena+rest://@{connathena}:{portathena}/{schemaathena}?s3_staging_dir={s3stagingathena}/&work_group={wkgrpathena}"
engine_athena = create_engine(connection_string, echo=False)
db = SQLDatabase(engine_athena)
And to finish, we can define our SQL agent.
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent_executor = create_sql_agent(
llm=llm,
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)
We are now ready to ask some analytical questions about our data. Here’s an example of a simple question about how many negative reviews we received during a certain month.
agent_executor.run("How many reviews came in during the month of October 2014 where the overall score was 3 or less?")
'599,787 reviews came in during the month of October 2014 where the overall score was 3 or less.'
We can also perform more complex analyses, such as identifying review trends by parsing through actual review context. To do this, we can use the FM to inspect and summarize the actual review text.
agent_executor.run("Find reviews from October 2014 with an overall score of 1 or less. Then pick five examples and summarize the review text.")
'Very thin and sheer leggings, too large scrubs with small pockets, extra small size still too big, one review claiming no delivery, and another review saying scrubs too big compared to other brands. Most complaints related to sizing not meeting expectations and material being too thin/sheer.'
Depending on the type of problem you are trying to solve, you could consider using generative AI to moderate incoming review text before it is stored or displayed to other users. Additionally, you can use Guardrails for Amazon Bedrock, which allows you to define a set of topics to avoid within the context of your application. Guardrails detects and blocks user inputs and FM responses that fall into the restricted topics.
Cleaning up
In order to remove the AWS resources associated with this example and avoid any unwanted charges to your account, delete the CDK stack.
Conclusion
In this post, we demonstrated how to empower business stakeholders to ask questions about a data lake using generative AI. We built a data lake in Amazon S3 using an Amazon Review data set, did some typical data preparation, inspected data quality, and connected an FM to the data. Then, users could ask questions with natural language, and the FM uses Amazon Athena to run queries to answer those questions. The FM also summarizes and derives insights from the text in the data lake.
The true value of a data lake lies in its ability to empower data-driven decision-making and foster a culture of data literacy across an organization. Whether you are a marketer, data scientist, or product manager, this solution helps you obtain insights from your data lake without writing code or needing to have data engineering expertise. Download the sample code from the GitHub repository and start exploring your own data today. Embrace the power of generative AI, and let the data speak for itself.
Thank you for reading this post. Learn more about unlocking the value of your data as your differentiator. If you have any comments, leave them in the comments section.