AWS Big Data Blog

Use unsupervised training with K-means clustering in Amazon Redshift ML

June 2023: This post was reviewed and updated for accuracy.

Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price–performance. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. In previous posts, we covered how Amazon Redshift supports supervised learning that includes regression, binary classification, and multiclass classification, as well as training models using XGBoost and providing advanced options such as preprocessors, problem type, and hyperparameters.

In this post, we use Redshift ML to perform unsupervised learning on unlabeled training data using the K-means algorithm. This algorithm solves clustering problems where you want to discover groupings in the data. Unlabeled data is grouped and partitioned based on their similarities and differences. By grouping, the K-means algorithm iteratively determines the best centroids and assigns each member to the closest centroid. Data points nearest the same centroid belong to the same group. Members of a group are as similar as possible to other members in the same group, and as different as possible from members of other groups. To learn more about K-means clustering, see K-means clustering with Amazon SageMaker.

Solution overview

The following are some use cases for K-means:

  • Ecommerce and retail – Segment your customers by purchase history, stores they visited, or clickstream activity.
  • Healthcare – Group similar images for image detection. For example, you can detect patterns for diseases or successful treatment scenarios.
  • Finance – Detect fraud by detecting anomalies in the dataset. For example, you can detect credit card fraud by abnormal purchase patterns.
  • Technology – Build a network intrusion detection system that aims to identify attacks or malicious activity.
  • Meteorology – Detect anomalies in sensor data collection such as storm forecasting.

In our example, we use K-means on the Global Database of Events, Language, and Tone (GDELT) dataset, which monitors world news across the world, and the data is stored for every second of every day. This information is freely available as part of the Registry of Open Data on AWS.

The data is stored as multiple files on Amazon Simple Storage Service (Amazon S3), with two different formats: historical, which covers the years 1979–2013, and daily updates, which cover the years 2013 and later. For this example, we use the historical format and bring in 1979 data.

For our use case, we use a subset of the data’s attributes:

  • EventCode – The raw CA­­­­­­MEO action code describing the action that Actor1 performed upon Actor2.
  • NumArticles – The total number of source documents containing one or more mentions of this event. You can use this to assess the importance of an event. The more discussion of that event, the more likely it is to be significant.
  • AvgTone – The average tone of all documents containing one or more mentions of this event. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.
  • Actor1Geo_Lat – The centroid latitude of the Actor1 landmark for mapping.
  • Actor1Geo_Long – The centroid longitude of the Actor1 landmark for mapping.
  • Actor2Geo_Lat – The centroid latitude of the Actor2 landmark for mapping.
  • Actor2Geo_Long – The centroid longitude of the Actor2 landmark for mapping.

Each row corresponds to an event at a specific location. For example, rows 53-57 in the file 1979.csv which we will use below, seem to all refer to interactions between FRA and AFR, dealing with consultation and diplomatic relations with a mostly positive tone. It is hard, if not impossible for us to make sense of such data at scale. Clusters of events, either with a similar tone, occurring in similar locations or between similar actors, are useful in visualizing and interpreting the data. Clustering can also reveal non-obvious structures such as potential common causes for different events, or the propagation of a root event across the globe, or the change in tone toward a common event over time. However, we do not know what makes two events similar – is it the location, the two actors, the tone, the time or some combination of these? Clustering algorithms can learn from data and determine 1) what makes different datapoints similar, 2) which datapoints are related to which other datapoints and 3) what are the common characteristics of these related datapoints.

Prerequisites

To get started, we need an Amazon Redshift cluster with version 1.0.33433 or higher and an AWS Identity and Access Management (IAM) role attached that provides access to Amazon SageMaker and permissions to an S3 bucket.

For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

To create a simple cluster, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose Create cluster.
  3. Provide the configuration parameters such as cluster name, user name, and password.
  4. For Associated IAM roles, on the menu Manage IAM roles, choose Create IAM role.

If you have an existing role with the required parameters, you can choose Associate IAM roles.

  1. Select Specific S3 buckets and choose a bucket for storing the artifacts generated by Redshift ML.
  2. Choose Create IAM role as default.

A default IAM role is created for you and automatically associated with the cluster.

  1. Choose Create cluster.

Prepare the data

Load the GDELT data into Amazon Redshift using the following SQL. You can use the Amazon Redshift Query Editor v2 or your favorite SQL tool to run these commands.

To create the table, use the following commands:

DROP TABLE IF EXISTS gdelt_data CASCADE;

CREATE TABLE gdelt_data (
GlobalEventId   bigint,
SqlDate  bigint,
MonthYear bigint,
Year   bigint,
FractionDate double precision,
Actor1Code varchar(256),
Actor1Name varchar(256),
Actor1CountryCode varchar(256),
Actor1KnownGroupCode varchar(256),
Actor1EthnicCode varchar(256),
Actor1Religion1Code varchar(256),
Actor1Religion2Code varchar(256),
Actor1Type1Code varchar(256),
Actor1Type2Code varchar(256),
Actor1Type3Code varchar(256),
Actor2Code varchar(256),
Actor2Name varchar(256),
Actor2CountryCode varchar(256),
Actor2KnownGroupCode varchar(256),
Actor2EthnicCode varchar(256),
Actor2Religion1Code  varchar(256),
Actor2Religion2Code varchar(256),
Actor2Type1Code varchar(256),
Actor2Type2Code varchar(256),
Actor2Type3Code varchar(256),
IsRootEvent bigint,
EventCode bigint,
EventBaseCode bigint,
EventRootCode bigint,
QuadClass bigint,
GoldsteinScale double precision,
NumMentions bigint,
NumSources bigint,
NumArticles bigint,
AvgTone double precision,
Actor1Geo_Type bigint,
Actor1Geo_FullName varchar(256),
Actor1Geo_CountryCode varchar(256),
Actor1Geo_ADM1Code varchar(256),
Actor1Geo_Lat double precision,
Actor1Geo_Long double precision,
Actor1Geo_FeatureID bigint,
Actor2Geo_Type bigint,
Actor2Geo_FullName varchar(256),
Actor2Geo_CountryCode varchar(256),
Actor2Geo_ADM1Code varchar(256),
Actor2Geo_Lat double precision,
Actor2Geo_Long double precision,
Actor2Geo_FeatureID bigint,
ActionGeo_Type bigint,
ActionGeo_FullName varchar(256),
ActionGeo_CountryCode varchar(256),
ActionGeo_ADM1Code varchar(256),
ActionGeo_Lat double precision,
ActionGeo_Long double precision,
ActionGeo_FeatureID bigint,
DATEADDED bigint 
) ;

To load data into the table, use the following command:

COPY gdelt_data FROM 's3://gdelt-open-data/events/1979.csv'
region 'us-east-1' iam_role default csv delimiter '\t';  

Create a model in Redshift ML

When using the K-means algorithm, you must specify an input K that specifies the number of clusters to find in the data. The output of this algorithm is a set of K centroids, one for each cluster. Each data point belongs to one of the K clusters that is closest to it. Each cluster is described by its centroid, which can be thought of as a multi-dimensional representation of the cluster. The K-means algorithm compares the distances between centroids and data points to learn how different the clusters are from each other. A larger distance generally indicates a greater difference between the clusters.

Before we create the model, let’s examine the training data by running the following SQL code in Amazon Redshift Query Editor v2:

select AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long
from gdelt_data

The following screenshot shows our results.

We create a model with seven clusters from this data (see the following code). You can experiment by changing the K value and creating different models. The SageMaker K-means algorithm can obtain a good clustering with only a single pass over the data with very fast runtimes.

CREATE MODEL news_data_clusters
FROM (select AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long
   from gdelt_data)
FUNCTION  news_monitoring_cluster
IAM_ROLE default
AUTO OFF
MODEL_TYPE KMEANS
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT (K '7')
SETTINGS (S3_BUCKET '<<your-amazon-s3-bucket-name>>');

For more information about model training, see Machine learning overview. For a list of other hyper-parameters K-means supports, see K-means Hyperparameters, for the full syntax of CREATE MODEL see our documentation.

You can use the SHOW MODEL command to view the status of the model:

SHOW MODEL NEWS_DATA_CLUSTERS;

The results show that our model is in the READY state.

We can now run the query to identify the clusters. The following query shows the cluster associated with each GlobelEventId:

select globaleventid, news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as cluster 
from gdelt_data;

We get the following results.

Now let’s run a query to check the distribution of data across our clusters to see if seven is the appropriate cluster size for this dataset:

select events_cluster , count(*) as nbr_events  from   
(select globaleventid, news_monitoring_cluster( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster
from gdelt_data)
group by 1;

The results show that very few events are assigned to clusters 1 and 3.

Let’s try running the above query again after re-creating the model with nine clusters by changing the K value to 9.

Using nine clusters helps smooth out the cluster sizes. The smallest is now approximately 11,000 and the largest is approximately 117,000, compared to 188,000 when using seven clusters.

Now, let’s run the following query to determine the centers of the clusters based on number of articles by event code:

select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode ,sum(numArticles) as numArticles from 
gdelt_data
group by 1,2 ;


Let’s run the following query to get more insights into the datapoints assigned to one of the clusters:

select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode, actor1name, actor2name, sum(numarticles) as totalarticles
from gdelt_data
where events_cluster = 5
and actor1name <> ' 'and actor2name <> ' '
group by 1,2,3,4
order by 5 desc

Observing the datapoints assigned to the clusters, we see clusters of events corresponding to interactions between US and China – probably due to the establishment of diplomatic relations, between US and RUS – probably corresponding to the SALT II Treaty and those involving Iran– probably corresponding to the Iranian Revolution. Thus, clustering can help us make sense of the data, and show us the way as we continue to explore and use it.

Conclusion

Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Redshift ML to gain business insights for your data. You can take advantage of ML approaches such as supervised and unsupervised learning to classify your labeled and unlabeled data, respectively. In this post, we walked you through how to perform unsupervised learning with Redshift ML by creating an ML model that uses the K-means algorithm to discover grouping in your data.

For more information about building different models, see Amazon Redshift ML.


About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Akash Gheewala is a Solutions Architect at AWS. He helps global enterprises across the high tech industry in their journey to the cloud. He does this through his passion for accelerating digital transformation for customers and building highly scalable and cost-effective solutions in the cloud. Akash also enjoys mental models, creating content and vagabonding about the world.

Murali Narayanaswamy is a principal machine learning scientist in AWS. He received his PhD from Carnegie Mellon University and works at the intersection of ML, AI, optimization, learning and inference to combat uncertainty in real-world applications including personalization, forecasting, supply chains and large scale systems.