Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze your data. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

Amazon Redshift ML is a feature of Amazon Redshift that enables you to build, train, and deploy machine learning (ML) models directly within the Redshift environment. Now, you can use pretrained publicly available large language models (LLMs) in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. You can use pretrained publicly available LLMs from leading providers such as Meta, AI21 Labs, LightOn, Hugging Face, Amazon Alexa, and Cohere as part of your Redshift ML workflows. By integrating with LLMs, Redshift ML can support a wide variety of natural language processing (NLP) use cases on your analytical data, such as text summarization, sentiment analysis, named entity recognition, text generation, language translation, data standardization, data enrichment, and more. Through this feature, the power of generative artificial intelligence (AI) and LLMs is made available to you as simple SQL functions that you can apply on your datasets. The integration is designed to be simple to use and flexible to configure, allowing you to take advantage of the capabilities of advanced ML models within your Redshift data warehouse environment.

In this post, we demonstrate how Amazon Redshift can act as the data foundation for your generative AI use cases by enriching, standardizing, cleansing, and translating streaming data using natural language prompts and the power of generative AI. In today’s data-driven world, organizations often ingest real-time data streams from various sources, such as Internet of Things (IoT) devices, social media platforms, and transactional systems. However, this streaming data can be inconsistent, missing values, and be in non-standard formats, presenting significant challenges for downstream analysis and decision-making processes. By harnessing the power of generative AI, you can seamlessly enrich and standardize streaming data after ingesting it into Amazon Redshift, resulting in high-quality, consistent, and valuable insights. Generative AI models can derive new features from your data and enhance decision-making. This enriched and standardized data can then facilitate accurate real-time analysis, improved decision-making, and enhanced operational efficiency across various industries, including ecommerce, finance, healthcare, and manufacturing. For this use case, we use the Meta Llama-3-8B-Instruct LLM to demonstrate how to integrate it with Amazon Redshift to streamline the process of data enrichment, standardization, and cleansing.

Solution overview

The following diagram demonstrates how to use Redshift ML capabilities to integrate with LLMs to enrich, standardize, and cleanse streaming data. The process starts with raw streaming data coming from Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK), which is materialized in Amazon Redshift as raw data. User-defined functions (UDFs) are then applied to the raw data, which invoke an LLM deployed on SageMaker JumpStart to enrich and standardize the data. The enhanced, cleansed data is then stored back in Amazon Redshift, ready for accurate real-time analysis, improved decision-making, and enhanced operational efficiency.

To deploy this solution, we complete the following steps:

Choose an LLM for the use case and deploy it using foundation models (FMs) in SageMaker JumpStart.
Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint.
Create a materialized view to load the raw streaming data.
Call the model function with prompts to transform the data and view results.

Example data

The following code shows an example of raw order data from the stream:

Record1: {
    "orderID":"101",
    "email":" john. roe @example.com",
    "phone":"+44-1234567890",
    "address":"123 Elm Street, London",
    "comment": "please cancel if items are out of stock"
}
Record2: {
    "orderID":"102",
    "email":" jane.s mith @example.com",
    "phone":"(123)456-7890",
    "address":"123 Main St, Chicago, 12345",
    "comment": "Include a gift receipt"
}
Record3: {
    "orderID":"103",
    "email":"max.muller @example.com",
    "phone":"+498912345678",
    "address":"Musterstrabe, Bayern 00000",
    "comment": "Bitte nutzen Sie den Expressversand"
}
Record4: {
    "orderID":"104",
    "email":" julia @example.com",
    "phone":"(111) 4567890",
    "address":"000 main st, los angeles, 11111",
    "comment": "Entregar a la puerta"
}
Record5: {
    "orderID":"105",
    "email":" roberto @example.com",
    "phone":"+33 3 44 21 83 43",
    "address":"000 Jean Allemane, paris, 00000",
    "comment": "veuillez ajouter un emballage cadeau"
}

The raw data has inconsistent formatting for email and phone numbers, the address is incomplete and doesn’t have a country, and comments are in various languages. To address the challenges with the raw data, we can implement a comprehensive data transformation process using Redshift ML integrated with an LLM in an ETL workflow. This approach can help standardize the data, cleanse it, and enrich it to meet the desired output format.

The following table shows an example of enriched address data.

orderid	Address	Country (Identified using LLM)
101	123 Elm Street, London	United Kingdom
102	123 Main St, Chicago, 12345	USA
103	Musterstrabe, Bayern 00000	Germany
104	000 main st, los angeles, 11111	USA
105	000 Jean Allemane, paris, 00000	France

The following table shows an example of standardized email and phone data.

orderid	email	cleansed_email (Using LLM)	Phone	Standardized Phone (Using LLM)
101	john. roe @example.com	john.roe@example.com	+44-1234567890	+44 1234567890
102	jane.s mith @example.com	jane.smith@example.com	(123)456-7890	+1 1234567890
103	max.muller @example.com	max.muller@example.com	498912345678	+49 8912345678
104	julia @example.com	julia@example.com	(111) 4567890	+1 1114567890
105	roberto @example.com	roberto@example.com	+33 3 44 21 83 43	+33 344218343

The following table shows an example of translated and enriched comment data.

orderid	Comment	english_comment (Translated using LLM)	comment_language (Identified by LLM)
101	please cancel if items are out of stock	please cancel if items are out of st	English
102	Include a gift receipt	Include a gift receipt	English
103	Bitte nutzen Sie den Expressversand	Please use express shipping	German
104	Entregar a la puerta	Leave at door step	Spanish
105	veuillez ajouter un emballage cadeau	Please add a gift wrap	French

Prerequisites

Before you implement the steps in the walkthrough, make sure you have the following prerequisites:

An AWS account.
An Amazon Redshift Serverless workgroup or an Amazon Redshift provisioned cluster. For setup instructions, refer to Creating a workgroup with a namespace or Create a sample Amazon Redshift cluster, respectively.
A Kinesis data stream. For this example, we create a data stream called customer_orders with on-demand capacity mode.
Sample data in JSON format generated using the Amazon Kinesis Data Generator (KDG) utility, which will be sent to the customer_orders Use the source records in the example code as a template to generate the data. For more information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.

Choose an LLM and deploy it using SageMaker JumpStart

Complete the following steps to deploy your LLM:

On the SageMaker JumpStart console, choose Foundation models in the navigation pane.
Search for your FM (for this post, Meta-Llama-3-8B-Instruct) and choose View model.
On the Model details page, review the End User License Agreement (EULA) and choose Open notebook in Studio to start using the notebook in Amazon SageMaker Studio.
In the Select domain and user profile pop-up, choose a profile, then choose Open Studio.
When the notebook opens, in the Set up notebook environment pop-up, choose t3.medium or another instance type recommended in the notebook, then choose Select.
Modify the notebook cell that has accept_eula = False to accept_eula = True.
Select and run the first five cells (see the highlighted sections in the following screenshot) using the run icon.

After you run the fifth cell, choose Endpoints under Deployments in the navigation pane, where you can see the endpoint created.
Copy the endpoint name and wait until the endpoint status is In Service.

It can take 30–45 minutes for the endpoint to be available.

Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint

In this step, you create a model using Redshift ML and the bring your own model (BYOM) capability. After the model is created, you can use the output function to make remote inference to the LLM model. To create a model in Amazon Redshift for the LLM endpoint you created previously, complete the following steps:

Make sure you have the following AWS Identity and Access Management (IAM) policy added to the default IAM role. Replace <endpointname> with the SageMaker JumpStart endpoint name you captured earlier:

{
  "Statement": [
      {
          "Action": "sagemaker:InvokeEndpoint",
          "Effect": "Allow",
          "Resource": "arn:aws:sagemaker:<region>:<AccountNumber>:endpoint/<endpointname>",
          "Principal": "*"
      }
  ]
}

In the query editor, run the following SQL statement to create a model in Amazon Redshift. Replace <endpointname> with the endpoint name you captured earlier. Note that the input and return data type for the model is the SUPER data type.
```
CREATE MODEL meta_llama_3_8b_instruct
FUNCTION meta_llama_3_8b_instruct(super)
RETURNS SUPER
SAGEMAKER '<endpointname>'
IAM_ROLE default;
```

Create a materialized view to load raw streaming data

Use the following SQL to create materialized view for the data that is being streamed through the customer-orders stream. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream.

CREATE EXTERNAL SCHEMA kinesis_streams FROM KINESIS
IAM_ROLE default;

CREATE MATERIALIZED VIEW mv_customer_orders AUTO REFRESH YES AS
    SELECT 
    refresh_time,
    approximate_arrival_timestamp,
    partition_key,
    shard_id,
    sequence_number,
    --json_parse(from_varbyte(kinesis_data, 'utf-8')) as rawdata,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'orderID',true)::character(36) as orderID,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'email',true)::character(36) as email,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'phone',true)::character(36) as phone,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'address',true)::character(36) as address,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'comment',true)::character(36) as comment
    FROM kinesis_streams."customer-orders";

After you run these SQL statements, the materialized view mv_customer_orders will be created and continuously updated as new data arrives in the customer-orders Kinesis data stream.

Call the model function with prompts to transform data and view results

Now you can call the Redshift ML LLM model function with prompts to transform the raw data and view the results. The input payload is a JSON with prompt and model parameters as attributes:

Prompt – The prompt is the input text or instruction provided to the generative AI model to generate new content. The prompt acts as a guiding signal that the model uses to produce relevant and coherent output. Each model has unique prompt engineering guidance. Refer to the Meta Llama 3 Instruct model card for its prompt formats and guidance.
Model parameters – The model parameters determine the behavior and output of the model. With model parameters, you can control the randomness, number of tokens generated, where the model should stop, and more.

In the Invoke endpoint section of the SageMaker Studio notebook, you can find the model parameters and example payloads.

The following SQL statement calls the Redshift ML LLM model function with prompts to standardize phone number and email data, identify the country from the address, and translate comments into English and identify the original comment’s language. The output of the SQL is stored in the table enhanced_raw_data_customer_orders.

create table enhanced_raw_data_customer_orders as
select phone,email,comment, address
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nConvert this phone number into a standard format: '||phone||'\n\nA standard phone number had plus sign followed by CountryCode followed by a space and the rest of the phone number without any spaces or dashes. \n\nExamples: +1 1234567890, +91 1234567890\n\nReturn only the standardized phone number, nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as standardized_phone
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nConvert this email into a standard format:'||email||'\n\nA standard email ID does not have spaces and is lower case. Return only the standardized email and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as standardized_email
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIdentify which country is this address in:'||address||'\n\nReturn only the country and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as country
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTranslate this statement to english if it is not in english:'||comment||'\n\nReturn the english comment and nothing else. Output only english<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as translated_comment
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIdentify which language this statement is in:'||comment||'\n\nReturn only the language and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as orig_comment_language
  from mv_customer_orders;

Query the enhanced_raw_data_customer_orders table to view the data. The output of LLM is in JSON format with the result in the generated_text attribute. It’s stored in the SUPER data type and can be queried using PartiQL:

select 
    phone as raw_phone
    , standardized_phone.generated_text :: varchar as standardized_phone 
    , email as raw_email
    , standardized_email.generated_text :: varchar as standardized_email
    , address as raw_address
    , country.generated_text :: varchar as country
    , comment as raw_comment
    , translated_comment.generated_text :: varchar as translated_comment
    , orig_comment_language.generated_text :: varchar as orig_comment_language
from enhanced_raw_data_customer_orders;

The following screenshot shows our output.

Clean up

To avoid incurring future charges, delete the resources you created:

Delete the LLM endpoint in SageMaker JumpStart by running the cell in the Clean up section in the Jupyter notebook.
Delete the Kinesis data stream.
Delete the Redshift Serverless workgroup or Redshift cluster.

Conclusion

In this post, we showed you how to enrich, standardize, and translate streaming data in Amazon Redshift with generative AI and LLMs. Specifically, we demonstrated the integration of the Meta Llama 3 8B Instruct LLM, available through SageMaker JumpStart, with Redshift ML. Although we used the Meta Llama 3 model as an example, you can use a variety of other pre-trained LLM models available in SageMaker JumpStart as part of your Redshift ML workflows. This integration allows you to explore a wide range of NLP use cases, such as data enrichment, content summarization, knowledge graph development, and more. The ability to seamlessly integrate advanced LLMs into your Redshift environment significantly broadens the analytical capabilities of Redshift ML. This empowers data analysts and developers to incorporate ML into their data warehouse workflows with streamlined processes driven by familiar SQL commands.

We encourage you to explore the full potential of this integration and experiment with implementing various use cases that integrate the power of generative AI and LLMs with Amazon Redshift. The combination of the scalability and performance of Amazon Redshift, along with the advanced natural language processing capabilities of LLMs, can unlock new possibilities for data-driven insights and decision-making.

About the authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.

Blessing Bamiduro is part of the Amazon Redshift Product Management team. She works with customers to help explore the use of Amazon Redshift ML in their data warehouse. In her spare time, Blessing loves travels and adventures.

AWS Big Data Blog