AWS Big Data Blog
Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze your data. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.
Amazon Redshift ML is a feature of Amazon Redshift that enables you to build, train, and deploy machine learning (ML) models directly within the Redshift environment. Now, you can use pretrained publicly available large language models (LLMs) in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. You can use pretrained publicly available LLMs from leading providers such as Meta, AI21 Labs, LightOn, Hugging Face, Amazon Alexa, and Cohere as part of your Redshift ML workflows. By integrating with LLMs, Redshift ML can support a wide variety of natural language processing (NLP) use cases on your analytical data, such as text summarization, sentiment analysis, named entity recognition, text generation, language translation, data standardization, data enrichment, and more. Through this feature, the power of generative artificial intelligence (AI) and LLMs is made available to you as simple SQL functions that you can apply on your datasets. The integration is designed to be simple to use and flexible to configure, allowing you to take advantage of the capabilities of advanced ML models within your Redshift data warehouse environment.
In this post, we demonstrate how Amazon Redshift can act as the data foundation for your generative AI use cases by enriching, standardizing, cleansing, and translating streaming data using natural language prompts and the power of generative AI. In today’s data-driven world, organizations often ingest real-time data streams from various sources, such as Internet of Things (IoT) devices, social media platforms, and transactional systems. However, this streaming data can be inconsistent, missing values, and be in non-standard formats, presenting significant challenges for downstream analysis and decision-making processes. By harnessing the power of generative AI, you can seamlessly enrich and standardize streaming data after ingesting it into Amazon Redshift, resulting in high-quality, consistent, and valuable insights. Generative AI models can derive new features from your data and enhance decision-making. This enriched and standardized data can then facilitate accurate real-time analysis, improved decision-making, and enhanced operational efficiency across various industries, including ecommerce, finance, healthcare, and manufacturing. For this use case, we use the Meta Llama-3-8B-Instruct LLM to demonstrate how to integrate it with Amazon Redshift to streamline the process of data enrichment, standardization, and cleansing.
Solution overview
The following diagram demonstrates how to use Redshift ML capabilities to integrate with LLMs to enrich, standardize, and cleanse streaming data. The process starts with raw streaming data coming from Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK), which is materialized in Amazon Redshift as raw data. User-defined functions (UDFs) are then applied to the raw data, which invoke an LLM deployed on SageMaker JumpStart to enrich and standardize the data. The enhanced, cleansed data is then stored back in Amazon Redshift, ready for accurate real-time analysis, improved decision-making, and enhanced operational efficiency.
To deploy this solution, we complete the following steps:
- Choose an LLM for the use case and deploy it using foundation models (FMs) in SageMaker JumpStart.
- Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint.
- Create a materialized view to load the raw streaming data.
- Call the model function with prompts to transform the data and view results.
Example data
The following code shows an example of raw order data from the stream:
The raw data has inconsistent formatting for email and phone numbers, the address is incomplete and doesn’t have a country, and comments are in various languages. To address the challenges with the raw data, we can implement a comprehensive data transformation process using Redshift ML integrated with an LLM in an ETL workflow. This approach can help standardize the data, cleanse it, and enrich it to meet the desired output format.
The following table shows an example of enriched address data.
orderid | Address | Country (Identified using LLM) |
101 | 123 Elm Street, London | United Kingdom |
102 | 123 Main St, Chicago, 12345 | USA |
103 | Musterstrabe, Bayern 00000 | Germany |
104 | 000 main st, los angeles, 11111 | USA |
105 | 000 Jean Allemane, paris, 00000 | France |
The following table shows an example of standardized email and phone data.
orderid | cleansed_email (Using LLM) |
Phone | Standardized Phone (Using LLM) | |
101 | john. roe @example.com | john.roe@example.com | +44-1234567890 | +44 1234567890 |
102 | jane.s mith @example.com | jane.smith@example.com | (123)456-7890 | +1 1234567890 |
103 | max.muller @example.com | max.muller@example.com | 498912345678 | +49 8912345678 |
104 | julia @example.com | julia@example.com | (111) 4567890 | +1 1114567890 |
105 | roberto @example.com | roberto@example.com | +33 3 44 21 83 43 | +33 344218343 |
The following table shows an example of translated and enriched comment data.
orderid | Comment | english_comment (Translated using LLM) |
comment_language (Identified by LLM) |
101 | please cancel if items are out of stock | please cancel if items are out of st | English |
102 | Include a gift receipt | Include a gift receipt | English |
103 | Bitte nutzen Sie den Expressversand | Please use express shipping | German |
104 | Entregar a la puerta | Leave at door step | Spanish |
105 | veuillez ajouter un emballage cadeau | Please add a gift wrap | French |
Prerequisites
Before you implement the steps in the walkthrough, make sure you have the following prerequisites:
- An AWS account.
- An Amazon Redshift Serverless workgroup or an Amazon Redshift provisioned cluster. For setup instructions, refer to Creating a workgroup with a namespace or Create a sample Amazon Redshift cluster, respectively.
- A Kinesis data stream. For this example, we create a data stream called
customer_orders
with on-demand capacity mode. - Sample data in JSON format generated using the Amazon Kinesis Data Generator (KDG) utility, which will be sent to the
customer_orders
Use the source records in the example code as a template to generate the data. For more information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.
Choose an LLM and deploy it using SageMaker JumpStart
Complete the following steps to deploy your LLM:
- On the SageMaker JumpStart console, choose Foundation models in the navigation pane.
- Search for your FM (for this post,
Meta-Llama-3-8B-Instruct
) and choose View model.
- On the Model details page, review the End User License Agreement (EULA) and choose Open notebook in Studio to start using the notebook in Amazon SageMaker Studio.
- In the Select domain and user profile pop-up, choose a profile, then choose Open Studio.
- When the notebook opens, in the Set up notebook environment pop-up, choose t3.medium or another instance type recommended in the notebook, then choose Select.
- Modify the notebook cell that has
accept_eula = False
toaccept_eula = True
. - Select and run the first five cells (see the highlighted sections in the following screenshot) using the run icon.
- After you run the fifth cell, choose Endpoints under Deployments in the navigation pane, where you can see the endpoint created.
- Copy the endpoint name and wait until the endpoint status is In Service.
It can take 30–45 minutes for the endpoint to be available.
Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint
In this step, you create a model using Redshift ML and the bring your own model (BYOM) capability. After the model is created, you can use the output function to make remote inference to the LLM model. To create a model in Amazon Redshift for the LLM endpoint you created previously, complete the following steps:
- Log in to the Redshift endpoint using the Amazon Redshift Query Editor V2.
- Make sure you have the following AWS Identity and Access Management (IAM) policy added to the default IAM role. Replace <endpointname> with the SageMaker JumpStart endpoint name you captured earlier:
- In the query editor, run the following SQL statement to create a model in Amazon Redshift. Replace <endpointname> with the endpoint name you captured earlier. Note that the input and return data type for the model is the SUPER data type.
Create a materialized view to load raw streaming data
Use the following SQL to create materialized view for the data that is being streamed through the customer-orders
stream. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream.
After you run these SQL statements, the materialized view mv_customer_orders
will be created and continuously updated as new data arrives in the customer-orders
Kinesis data stream.
Call the model function with prompts to transform data and view results
Now you can call the Redshift ML LLM model function with prompts to transform the raw data and view the results. The input payload is a JSON with prompt and model parameters as attributes:
- Prompt – The prompt is the input text or instruction provided to the generative AI model to generate new content. The prompt acts as a guiding signal that the model uses to produce relevant and coherent output. Each model has unique prompt engineering guidance. Refer to the Meta Llama 3 Instruct model card for its prompt formats and guidance.
- Model parameters – The model parameters determine the behavior and output of the model. With model parameters, you can control the randomness, number of tokens generated, where the model should stop, and more.
In the Invoke endpoint section of the SageMaker Studio notebook, you can find the model parameters and example payloads.
k
The following SQL statement calls the Redshift ML LLM model function with prompts to standardize phone number and email data, identify the country from the address, and translate comments into English and identify the original comment’s language. The output of the SQL is stored in the table enhanced_raw_data_customer_orders
.
Query the enhanced_raw_data_customer_orders
table to view the data. The output of LLM is in JSON format with the result in the generated_text
attribute. It’s stored in the SUPER data type and can be queried using PartiQL:
The following screenshot shows our output.
Clean up
To avoid incurring future charges, delete the resources you created:
- Delete the LLM endpoint in SageMaker JumpStart by running the cell in the Clean up section in the Jupyter notebook.
- Delete the Kinesis data stream.
- Delete the Redshift Serverless workgroup or Redshift cluster.
Conclusion
In this post, we showed you how to enrich, standardize, and translate streaming data in Amazon Redshift with generative AI and LLMs. Specifically, we demonstrated the integration of the Meta Llama 3 8B Instruct LLM, available through SageMaker JumpStart, with Redshift ML. Although we used the Meta Llama 3 model as an example, you can use a variety of other pre-trained LLM models available in SageMaker JumpStart as part of your Redshift ML workflows. This integration allows you to explore a wide range of NLP use cases, such as data enrichment, content summarization, knowledge graph development, and more. The ability to seamlessly integrate advanced LLMs into your Redshift environment significantly broadens the analytical capabilities of Redshift ML. This empowers data analysts and developers to incorporate ML into their data warehouse workflows with streamlined processes driven by familiar SQL commands.
We encourage you to explore the full potential of this integration and experiment with implementing various use cases that integrate the power of generative AI and LLMs with Amazon Redshift. The combination of the scalability and performance of Amazon Redshift, along with the advanced natural language processing capabilities of LLMs, can unlock new possibilities for data-driven insights and decision-making.
About the authors
Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.