Analyze US census data for population segmentation using Amazon SageMaker

August 2021: Post updated with changes required for SageMaker SDK v2, courtesy of Eitan Sela, Senior Startup Solutions Architect

In the United States, with the 2018 midterm elections approaching, people are looking for more information about the voting process. This blog post explores how we can apply machine learning (ML) to better integrate science into the task of understanding the electorate.

Typically for machine learning applications, clear use cases are derived from labelled data. For example, based on the attributes of a device, such as its age or model number, we can predict its likelihood of failure. We call this supervised learning because there is supervision or guidance towards predicting specific outcomes.

However, in the real world, there are often large data sets where there is no particular outcome to predict, where clean labels are hard to define. It can be difficult to pinpoint exactly what the right outcome is to predict. This type of use case is often exploratory. It seeks to understand the makeup of a dataset and what natural patterns exist. This type of use case is known as unsupervised learning. One example of this is trying to group similar individuals together based on a set of attributes.

The use case this blog post explores is population segmentation. We have taken publicly available, anonymized data from the US census on demographics by different US counties: https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml. (Note that this product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.) The outcome of this analysis are natural groupings of similar counties in a transformed feature space. The cluster that a county belongs to can be leveraged to plan an election campaign, for example, to understand how to reach a group of similar counties by highlighting messages that resonate with that group. More generally, this technique can be applied by businesses in customer or user segmentation to create targeted marketing campaigns. This type of analysis has the ability to uncover similarities that may not be obvious at face value- such as counties CA-Fresno and AZ- Yuma being grouped together. While intuitively they differ in commonly-examined attributes such as population size and racial makeup, they are more similar than different when viewed along axes such as the mix of employment type.

You can follow along using the sample notebook where you can run the code and interact with the data while reading through the blog post.

There are two goals for this exercise:

1) Walk through a data science workflow using Amazon SageMaker for unsupervised learning using PCA and Kmeans modelling techniques.

2) Demonstrate how users can access the underlying models that are built within Amazon SageMaker to extract useful model attributes. Often, it can be difficult to draw conclusions from unsupervised learning, so being able to access the models for PCA and Kmeans becomes even more important beyond simply generating predictions using the model.

The data science workflow has 4 main steps:

Loading the data from Amazon S3
Exploratory data analysis (EDA) – Data cleaning and exploration
- Cleaning the data
- Visualizing the data
- Feature engineering
Data modelling
- Dimensionality reduction
- Accessing the PCA model attributes
- Deploying the PCA model
- Population segmentation using unsupervised clustering
Drawing conclusions from our modelling
- Accessing the KMeans model attributes

Step 1: Loading the data from Amazon S3

You need to load the dataset from an Amazon S3 bucket into the Amazon SageMaker notebook.

Launch an AWS SageMaker notebook instance from the AWS console and open the notebook instance. This example notebook can be found in the Introduction to Applying Machine Learning folder. Or, you can launch a new notebook with a conda_mxnet_p36 kernel and copy the code from this blog into the notebook to run. Be sure to make note of the region that the SageMaker notebook instance is launched in because you will want to create an S3 bucket in the same region to store the SageMaker model files which will be created later.

First, we’ll import the relevant libraries into our Amazon SageMaker notebook.

import os
import boto3
import io
import sagemaker

%matplotlib inline 

import pandas as pd
import numpy as np
import mxnet as mx
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
matplotlib.style.use('ggplot')
import pickle, gzip, urllib, json
import csv

Amazon SageMaker integrates seamlessly with Amazon S3. During the first step in creating the notebook, we specified an “AmazonSageMakerFullAccess” role for the notebook. That gives this notebook permission to access any Amazon S3 bucket in this AWS account with “sagemaker” in its name.

The get_execution_role function retrieves the IAM role you created at the time you created your notebook instance.

from sagemaker import get_execution_role
role = get_execution_role()

We can see our role is an AmazonSageMaker-ExecutionRole.

role

'arn:aws:iam::003161157487:role/service-role/AmazonSageMaker-ExecutionRole-20171217T230500'

Loading the dataset

I have previously downloaded and stored the data in a public S3 bucket that you can access. You can use the Python SDK to interact with AWS using a Boto3 client.

First, start the client.

s3_client = boto3.client('s3')
data_bucket_name='aws-ml-blog-sagemaker-census-segmentation'

You’ll get a list of objects that are contained within the bucket. You can see there is one file in the bucket, ”Census_Data_for_SageMaker.csv’.

obj_list=s3_client.list_objects(Bucket=data_bucket_name)
file=[]
for contents in obj_list['Contents']:
    file.append(contents['Key'])
print(file)
['acs2015_county_data.csv', 'counties/']
file_data=file[0]

Grab the data from the CSV file in the bucket.

response = s3_client.get_object(Bucket=data_bucket_name, Key=file_data)
response_body = response["Body"].read()
counties = pd.read_csv(io.BytesIO(response_body), header=0, delimiter=",", low_memory=False)

This is what the first 5 rows of our data looks like:

counties.head()

	CensusId	State	County	TotalPop	Men	Women	Hispanic	White	Black	Native	…	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
0	1001	Alabama	Autauga	55221	26745	28476	2.6	75.8	18.5	0.4	…	0.5	1.3	1.8	26.5	23986	73.6	20.9	5.5	0.0	7.6
1	1003	Alabama	Baldwin	195121	95314	99807	4.5	83.1	9.5	0.6	…	1.0	1.4	3.9	26.4	85953	81.5	12.3	5.8	0.4	7.5
2	1005	Alabama	Barbour	26932	14497	12435	4.6	46.2	46.7	0.2	…	1.8	1.5	1.6	24.1	8597	71.8	20.8	7.3	0.1	17.6
3	1007	Alabama	Bibb	22604	12073	10531	2.2	74.5	21.4	0.4	…	0.6	1.5	0.7	28.8	8294	76.8	16.1	6.7	0.4	8.3
4	1009	Alabama	Blount	57710	28512	29198	8.6	87.9	1.5	0.3	…	0.9	0.4	2.3	34.9	22189	82.0	13.5	4.2	0.4	7.7

5 rows × 37 columns

Step 2: Exploratory data analysis EDA – Data cleaning and exploration

a. Cleaning the data

We can do simple data cleaning and processing right in our notebook instance, using the compute instance of the notebook to execute these computations.

How much data are we working with?

There are 3220 rows with 37 columns

counties.shape
(3220, 37)

Let’s just drop any incomplete data to make our analysis easier. We can see that we lost 2 rows of incomplete data, we now have 3218 rows in our data.

counties.dropna(inplace=True)
counties.shape
(3218, 37)

Let’s combine some of the descriptive reference columns such as state and county and leave the numerical feature columns.

We can now set the ‘state-county’ as the index and the rest of the numerical features become the attributes of each unique county.

counties.index=counties['State'] + "-" + counties['County']
counties.head()
drop=["CensusId" , "State" , "County"]
counties.drop(drop, axis=1, inplace=True)
counties.head()

	TotalPop	Men	Women	Hispanic	White	Black	Native	Asian	Pacific	Citizen	…	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
Alabama-Autauga	55221	26745	28476	2.6	75.8	18.5	0.4	1.0	0.0	40725	…	0.5	1.3	1.8	26.5	23986	73.6	20.9	5.5	0.0	7.6
Alabama-Baldwin	195121	95314	99807	4.5	83.1	9.5	0.6	0.7	0.0	147695	…	1.0	1.4	3.9	26.4	85953	81.5	12.3	5.8	0.4	7.5
Alabama-Barbour	26932	14497	12435	4.6	46.2	46.7	0.2	0.4	0.0	20714	…	1.8	1.5	1.6	24.1	8597	71.8	20.8	7.3	0.1	17.6
Alabama-Bibb	22604	12073	10531	2.2	74.5	21.4	0.4	0.1	0.0	17495	…	0.6	1.5	0.7	28.8	8294	76.8	16.1	6.7	0.4	8.3
Alabama-Blount	57710	28512	29198	8.6	87.9	1.5	0.3	0.1	0.0	42345	…	0.9	0.4	2.3	34.9	22189	82.0	13.5	4.2	0.4	7.7

5 rows × 34 columns

b. Visualizing the data

Now we have a dataset with a mix of numerical and categorical columns. We can visualize the data for some of our numerical columns and see what the distribution looks like.

import seaborn as sns

for a in ['Professional', 'Service', 'Office']:
    ax=plt.subplots(figsize=(6,3))
    ax=sns.distplot(counties[a])
    title="Histogram of " + a
    ax.set_title(title, fontsize=12)
    plt.show()

For example, from the figures above you can observe the distribution of counties that have a percentage of workers in Professional, Service, or Office occupations. Viewing the histograms can visually indicate characteristics of these features such as the mean or skew. The distribution of Professional workers for example reveals that the typical county has around 25-30% Professional workers, with a right skew, long tail and a Professional worker % topping out at almost 80% in some counties.

c. Feature engineering

Data Scaling – We need to standardize the scaling of the numerical columns in order to use any distance based analytical methods so that we can compare the relative distances between different feature columns. We can use minmaxscaler to transform the numerical columns so that they also fall between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
counties_scaled=pd.DataFrame(scaler.fit_transform(counties))
counties_scaled.columns=counties.columns
counties_scaled.index=counties.index

We can see that all of our numerical columns now have a min of 0 and a max of 1.

counties_scaled.describe()

	TotalPop	Men	Women	Hispanic	White	Black	Native	Asian	Pacific	Citizen	…	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
count	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	…	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000	3218.000000
mean	0.009883	0.009866	0.009899	0.110170	0.756024	0.100942	0.018682	0.029405	0.006470	0.011540	…	0.046496	0.041154	0.124428	0.470140	0.009806	0.760810	0.194426	0.216744	0.029417	0.221775
std	0.031818	0.031692	0.031948	0.192617	0.229682	0.166262	0.078748	0.062744	0.035446	0.033933	…	0.051956	0.042321	0.085301	0.143135	0.032305	0.132949	0.106923	0.106947	0.046451	0.112138
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.001092	0.001117	0.001069	0.019019	0.642285	0.005821	0.001086	0.004808	0.000000	0.001371	…	0.019663	0.023018	0.072581	0.373402	0.000948	0.697279	0.120861	0.147541	0.010204	0.150685
50%	0.002571	0.002591	0.002539	0.039039	0.842685	0.022119	0.003257	0.012019	0.000000	0.003219	…	0.033708	0.033248	0.104839	0.462916	0.002234	0.785714	0.172185	0.188525	0.020408	0.208219
75%	0.006594	0.006645	0.006556	0.098098	0.933868	0.111758	0.006515	0.028846	0.000000	0.008237	…	0.056180	0.048593	0.150538	0.560102	0.006144	0.853741	0.243377	0.256831	0.030612	0.271233
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	…	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

8 rows × 34 columns

Step 3: Data modelling

a. Dimensionality reduction

We will be using principal component analysis (PCA) to reduce the dimensionality of our data. This method decomposes the data matrix into features that are orthogonal with each other. The resultant orthogonal features are linear combinations of the original feature set. You can think of this method as taking many features and combining similar or redundant features together to form a new, smaller feature set.

We can reduce dimensionality with the built-in Amazon SageMaker algorithm for PCA.

We first import and call an instance of the PCA SageMaker model. Then we specify different parameters of the model. These can be resource configuration parameters, such as how many instances to use during training, or what type of instances to use. Or they can be model computation hyperparameters, such as how many components to use when performing PCA. Documentation on the PCA model can be found here: http://sagemaker.readthedocs.io/en/latest/pca.html

You will use the tools provided by the Amazon SageMaker Python SDK to upload the data to a default bucket.

sess = sagemaker.Session()
bucket = sess.default_bucket()

from sagemaker import PCA

num_components = 33

pca_SM = PCA(
    role=role,
    instance_count=1,
    instance_type="ml.c4.xlarge",
    output_path="s3://" + bucket + "/counties/",
    num_components=num_components,
)

Next, we prepare data for Amazon SageMaker by extracting the numpy array from the DataFrame and explicitly casting to float32

train_data = counties_scaled.values.astype('float32')

The record_set function in the Amazon SageMaker PCA model converts a numpy array into a record set format that is the required format for the input data to be trained. This is a requirement for all Amazon SageMaker built-in models. The use of this data type is one of the reasons that allows training of models within Amazon SageMaker to perform faster, for larger data sets compared with other implementations of the same models, such as the sklearn implementation.

We call the fit function on our PCA model, passing in our training data, and this spins up a training instance or cluster to perform the training job.

%%time
pca_SM.fit(pca_SM.record_set(train_data))

b. Accessing the PCA model attributes

After the model is created, we can also access the underlying model parameters.

Now that the training job is complete, you can find the job under Jobs in the Training subsection in the Amazon SageMaker console. You can find the job name listed in the training jobs. Use that job name in the following code to specify which model to examine.

Model artifacts are stored in Amazon S3 after they have been trained. This is the same model artifact that is used to deploy a trained model using Amazon SageMaker. Since many of the Amazon SageMaker algorithms use MXNet for computational speed, the model artifact is stored as an ND array. For an output path that was specified during the training call, the model resides in “training job name”–> output–>model.tar.gz file, which is a TAR archive file compressed with GNU zip (gzip) compression.

job_name = pca_SM.latest_training_job.name
model_key = "counties/" + job_name + "/output/model.tar.gz"

boto3.resource("s3").Bucket(bucket).download_file(model_key, "model.tar.gz")
os.system("tar -zxvf model.tar.gz")

After the model is unzipped and decompressed, we can load the ND array using MXNet.

import mxnet as mx
pca_model_params = mx.ndarray.load('model_algo-1')

Three groups of model parameters are contained within the PCA model.

mean: is optional and is only available if the “subtract_mean” hyperparameter is true when calling the training step from the original PCA SageMaker function.

v: contains the principal components (same as ‘components_’ in the sklearn PCA model).

s: the singular values of the components for the PCA transformation. This does not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.

explained-variance-ratio ~= square(s) / sum(square(s))

To calculate the exact explained-variance-ratio vector if needed, it simply requires saving the sum of squares of the original data (call that N) and computing explained-variance-ratio = square(s) / N.

s=pd.DataFrame(pca_model_params['s'].asnumpy())
v=pd.DataFrame(pca_model_params['v'].asnumpy())

We can now calculate the variance explained by the largest n components that we want to keep. For this example, let’s take the top 5 components.

We can see that the largest 5 components explain ~72% of the total variance in our dataset:

s.iloc[28:,:].apply(lambda x: x*x).sum()/s.apply(lambda x: x*x).sum()
0    0.717983
dtype: float32

After we have decided to keep the top 5 components, we can take only the 5 largest components from our original s and v matrix.

s_5=s.iloc[28:,:]
v_5=v.iloc[:,28:]
v_5.columns=[0,1,2,3,4]

We can now examine the makeup of each PCA component based on the weightings of the original features that are included in the component. For example, the following code shows the first component. We can see that this component describes an attribute of a county that has high poverty and unemployment, low income and income per capita, and high Hispanic/Black population and low White population.

Note that this is v_5[4] or last component of the list of components in v_5, but is actually the largest component because the components are ordered from smallest to largest. So v_5[0] would be the smallest component. Similarly, change the value of component_num to cycle through the makeup of each component.

component_num=1

first_comp = v_5[5-component_num]
comps = pd.DataFrame(list(zip(first_comp, counties_scaled.columns)), columns=['weights', 'features'])
comps['abs_weights']=comps['weights'].apply(lambda x: np.abs(x))
ax=sns.barplot(data=comps.sort_values('abs_weights', ascending=False).head(10), x="weights", y="features", palette="Blues_d")
ax.set_title("PCA Component Makeup: #" + str(component_num))
plt.show()

Similarly, you can go through and examine the makeup of each PCA components and try to understand what the key positive and negative attributes are for each component. The following code names the components, but feel free to change them as you gain insight into the unique makeup of each component.

PCA_list=['comp_1', 'comp_2', 'comp_3', 'comp_4', 'comp_5']

#PCA_list=["Poverty/Unemployment", "Self Employment/Public Workers", "High Income/Professional & Office Workers", \
#         "Black/Native Am Populations & Public/Professional Workers", "Construction & Commuters"]

c. Deploying the PCA model

We can now deploy this model endpoint and use it to make predictions. This model is now live and hosted on an instance_type that we specify.

%%time
pca_predictor = pca_SM.deploy(initial_instance_count=1, 
                                 instance_type='ml.t2.medium')

We can also pass our original dataset to the model so that we can transform the data using the model we created. Then we can take the largest 5 components and this will reduce the dimensionality of our data from 34 to 5.

%%time
result = pca_predictor.predict(train_data)
counties_transformed=pd.DataFrame()
for a in result:
    b=a.label['projection'].float32_tensor.values
    counties_transformed=counties_transformed.append([list(b)])
counties_transformed.index=counties_scaled.index
counties_transformed=counties_transformed.iloc[:,28:]
counties_transformed.columns=PCA_list

Now we have created a dataset where each county is described by the 5 principle components that we analyzed earlier. Each of these 5 components is a linear combination of the original feature space. We can interpret each of these 5 components by analyzing the makeup of the component shown previously.

counties_transformed.head()

	Poverty/Unemployment	Self Employment/Public Workers	High Income/Professional & Office Workers	Black/Native Am Populations & Public/Professional Workers	Construction & Commuters
Alabama-Autauga	-0.010824	0.120480	-0.088356	0.160527	-0.060274
Alabama-Baldwin	-0.068677	-0.023092	-0.145743	0.185969	-0.149684
Alabama-Barbour	0.093111	0.297829	0.146258	0.296662	0.506202
Alabama-Bibb	0.283526	0.011757	0.224402	0.190861	0.069224
Alabama-Blount	0.100738	-0.193824	0.022714	0.254403	-0.091030

d. Population segmentation using unsupervised clustering

Now, we’ll use the Kmeans algorithm to segment the population of counties by the 5 PCA attributes we have created. Kmeans is a clustering algorithm that identifies clusters of similar counties based on their attributes. Since we have ~3000 counties and 34 attributes in our original dataset, the large feature space may have made it difficult to cluster the counties effectively. Instead, we have reduced the feature space to 5 PCA components, and we’ll cluster on this transformed dataset.

train_data = counties_transformed.values.astype('float32')

First, we call and define the hyperparameters of our KMeans model as we have done with our PCA model. The Kmeans algorithm allows the user to specify how many clusters to identify. In this instance, let’s try to find the top 7 clusters from our dataset.

from sagemaker import KMeans

num_clusters = 7
kmeans = KMeans(
    role=role,
    instance_count=1,
    instance_type="ml.c4.xlarge",
    output_path="s3://" + bucket + "/counties/",
    k=num_clusters,
)

Then we train the model on our training data.

%%time
kmeans.fit(kmeans.record_set(train_data))

Now we deploy the model and we can pass in the original training set to get the labels for each entry. This will give us which cluster each county belongs to.

%%time
kmeans_predictor = kmeans.deploy(initial_instance_count=1, 
                                 instance_type='ml.t2.medium')
%%time
result=kmeans_predictor.predict(train_data)
CPU times: user 204 ms, sys: 0 ns, total: 204 ms
Wall time: 438 ms

We can see the breakdown of cluster counts and the distribution of clusters.

cluster_labels = [r.label['closest_cluster'].float32_tensor.values[0] for r in result]
pd.DataFrame(cluster_labels)[0].value_counts()
5.0    928
3.0    851
2.0    401
4.0    364
6.0    332
0.0    227
1.0    115
Name: 0, dtype: int64
ax=plt.subplots(figsize=(6,3))
ax=sns.distplot(cluster_labels, kde=False)
title="Histogram of Cluster Counts"
ax.set_title(title, fontsize=12)
plt.show()

However, to improve explainability, we need to access the underlying model to get the cluster centers. These centers will help describe which features characterize each cluster.

Step 4: Drawing conclusions from our modelling

Explaining the result of the modelling is an important step in making use of our analysis. By combining PCA and Kmeans, and the information contained in the model attributes within an Amazon SageMaker trained model, we can form concrete conclusions based on the data.

a. Accessing the KMeans model attributes

First, we will go into the bucket where the kmeans model is stored and extract it.

job_name = kmeans.latest_training_job.name
model_key = "counties/" + job_name + "/output/model.tar.gz"

boto3.resource("s3").Bucket(bucket).download_file(model_key, "model.tar.gz")
os.system("tar -zxvf model.tar.gz")
Kmeans_model_params = mx.ndarray.load("model_algo-1")

There is 1 set of model parameters that is contained within the KMeans model.

Cluster Centroid Locations: The location of the centers of each cluster identified by the Kmeans algorithm. The cluster location is given in our PCA transformed space with 5 components, since we passed the transformed PCA data into the model.

cluster_centroids=pd.DataFrame(Kmeans_model_params[0].asnumpy())
cluster_centroids.columns=counties_transformed.columns
cluster_centroids

	Poverty/Unemployment	Self Employment/Public Workers	High Income/Professional & Office Workers	Black/Native Am Populations & Public/Professional Workers	Construction & Commuters
0	0.025268	0.018374	0.040258	-0.356033	0.317980
1	-0.023942	-0.369325	-0.181284	-0.240269	1.045534
2	-0.019848	0.059698	-0.354655	0.047747	-0.064816
3	-0.012187	-0.063721	-0.007848	0.035607	-0.248565
4	0.103763	0.269017	0.084220	0.241475	0.390270
5	-0.029940	-0.054422	0.132807	0.086195	-0.020860
6	0.029929	0.110788	0.084239	-0.411732	-0.217632

We can plot a heatmap of the centroids and their location in the transformed feature space. This gives us insight into what characteristics define each cluster. Often with unsupervised learning, results are hard to interpret. This is one way to make use of the results of PCA plus clustering techniques together. Since we were able to examine the makeup of each PCA component, we can understand what each centroid represents in terms of the PCA components that we intepreted previously.

For example, we can see that cluster 1 has the highest value in the “Construction & Commuters” attribute while it has the lowest value in the “Self Employment/Public Workers” attribute compared with other clusters. Similarly, cluster 4 has high values in “Construction & Commuters,” “High Income/Professional & Office Workers,” and “Self Employment/Public Workers.”

plt.figure(figsize = (16, 6))
ax = sns.heatmap(cluster_centroids.T, cmap = 'YlGnBu')
ax.set_xlabel("Cluster")
plt.yticks(fontsize = 16)
plt.xticks(fontsize = 16)
ax.set_title("Attribute Value by Centroid")
plt.show()

We can also map the cluster labels back to each individual county and examine which counties were naturally grouped together.

counties_transformed['labels']=list(map(int, cluster_labels))
counties_transformed.head()

	Poverty/Unemployment	Self Employment/Public Workers	High Income/Professional & Office Workers	Black/Native Am Populations & Public/Professional Workers	Construction & Commuters	labels
Alabama-Autauga	-0.010824	0.120480	-0.088356	0.160527	-0.060274	5
Alabama-Baldwin	-0.068677	-0.023092	-0.145743	0.185969	-0.149684	3
Alabama-Barbour	0.093111	0.297829	0.146258	0.296662	0.506202	4
Alabama-Bibb	0.283526	0.011757	0.224402	0.190861	0.069224	5
Alabama-Blount	0.100738	-0.193824	0.022714	0.254403	-0.091030	5

Now, we can examine one of the clusters in more detail, like cluster 1 for example. A cursory glance at the location of the centroid tells us that it has the highest value for the “Construction & Commuters” attribute. We can now see which counties fit that description.

cluster=counties_transformed[counties_transformed['labels']==1]
cluster.head(5)

	Poverty/Unemployment	Self Employment/Public Workers	High Income/Professional & Office Workers	Black/Native Am Populations & Public/Professional Workers	Construction & Commuters	labels
Arizona-Santa Cruz	-0.014149	-0.347113	-0.386305	-0.284937	0.753071	1
Arizona-Yuma	-0.019377	-0.260098	-0.200252	-0.188408	0.585572	1
California-Fresno	0.016950	-0.198805	-0.260822	-0.090927	0.590060	1
California-Imperial	-0.015831	-0.291125	-0.296619	-0.279273	0.885126	1
California-Merced	0.170347	-0.304941	-0.154338	-0.072953	0.644423	1

Conclusion

You have just walked through a data science workflow for unsupervised learning, specifically clustering a dataset using KMeans after reducing the dimensionality using PCA. By accessing the underlying models created within Amazon SageMaker, we were able to improve the explainability of our modelling and draw actionable conclusions. Using these techniques, we have been able to better understand the essential characteristics of different counties in the US and segment the electorate into groupings accordingly.

Because endpoints are persistent, let’s delete our endpoints now that we’re done to avoid any excess charges on our AWS bill.

pca_predictor.delete_endpoint()
kmeans_predictor.delete_endpoint()

About the Author

Han Man is a Data Scientist with AWS Professional Services. He has a PhD in engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today he is passionately working with customers from a variety of industries to develop and implement machine learning & AI solutions on AWS. He enjoys following the NBA and playing basketball in his spare time.

Eitan Sela is a Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Eitan also helps customers build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.