AWS Machine Learning Blog
Analyze US census data for population segmentation using Amazon SageMaker
August 2021: Post updated with changes required for SageMaker SDK v2, courtesy of Eitan Sela, Senior Startup Solutions Architect
In the United States, with the 2018 midterm elections approaching, people are looking for more information about the voting process. This blog post explores how we can apply machine learning (ML) to better integrate science into the task of understanding the electorate.
Typically for machine learning applications, clear use cases are derived from labelled data. For example, based on the attributes of a device, such as its age or model number, we can predict its likelihood of failure. We call this supervised learning because there is supervision or guidance towards predicting specific outcomes.
However, in the real world, there are often large data sets where there is no particular outcome to predict, where clean labels are hard to define. It can be difficult to pinpoint exactly what the right outcome is to predict. This type of use case is often exploratory. It seeks to understand the makeup of a dataset and what natural patterns exist. This type of use case is known as unsupervised learning. One example of this is trying to group similar individuals together based on a set of attributes.
The use case this blog post explores is population segmentation. We have taken publicly available, anonymized data from the US census on demographics by different US counties: https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml. (Note that this product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.) The outcome of this analysis are natural groupings of similar counties in a transformed feature space. The cluster that a county belongs to can be leveraged to plan an election campaign, for example, to understand how to reach a group of similar counties by highlighting messages that resonate with that group. More generally, this technique can be applied by businesses in customer or user segmentation to create targeted marketing campaigns. This type of analysis has the ability to uncover similarities that may not be obvious at face value- such as counties CA-Fresno and AZ- Yuma being grouped together. While intuitively they differ in commonly-examined attributes such as population size and racial makeup, they are more similar than different when viewed along axes such as the mix of employment type.
You can follow along using the sample notebook where you can run the code and interact with the data while reading through the blog post.
There are two goals for this exercise:
1) Walk through a data science workflow using Amazon SageMaker for unsupervised learning using PCA and Kmeans modelling techniques.
2) Demonstrate how users can access the underlying models that are built within Amazon SageMaker to extract useful model attributes. Often, it can be difficult to draw conclusions from unsupervised learning, so being able to access the models for PCA and Kmeans becomes even more important beyond simply generating predictions using the model.
The data science workflow has 4 main steps:
- Loading the data from Amazon S3
- Exploratory data analysis (EDA) – Data cleaning and exploration
- Cleaning the data
- Visualizing the data
- Feature engineering
- Data modelling
- Dimensionality reduction
- Accessing the PCA model attributes
- Deploying the PCA model
- Population segmentation using unsupervised clustering
- Drawing conclusions from our modelling
- Accessing the KMeans model attributes
Step 1: Loading the data from Amazon S3
You need to load the dataset from an Amazon S3 bucket into the Amazon SageMaker notebook.
Launch an AWS SageMaker notebook instance from the AWS console and open the notebook instance. This example notebook can be found in the Introduction to Applying Machine Learning folder. Or, you can launch a new notebook with a conda_mxnet_p36
kernel and copy the code from this blog into the notebook to run. Be sure to make note of the region that the SageMaker notebook instance is launched in because you will want to create an S3 bucket in the same region to store the SageMaker model files which will be created later.
First, we’ll import the relevant libraries into our Amazon SageMaker notebook.
Amazon SageMaker integrates seamlessly with Amazon S3. During the first step in creating the notebook, we specified an “AmazonSageMakerFullAccess” role for the notebook. That gives this notebook permission to access any Amazon S3 bucket in this AWS account with “sagemaker” in its name.
The get_execution_role function retrieves the IAM role you created at the time you created your notebook instance.
We can see our role is an AmazonSageMaker-ExecutionRole.
Loading the dataset
I have previously downloaded and stored the data in a public S3 bucket that you can access. You can use the Python SDK to interact with AWS using a Boto3 client.
First, start the client.
You’ll get a list of objects that are contained within the bucket. You can see there is one file in the bucket, ”Census_Data_for_SageMaker.csv’.
Grab the data from the CSV file in the bucket.
This is what the first 5 rows of our data looks like:
CensusId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | … | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
0 | 1001 | Alabama | Autauga | 55221 | 26745 | 28476 | 2.6 | 75.8 | 18.5 | 0.4 | … | 0.5 | 1.3 | 1.8 | 26.5 | 23986 | 73.6 | 20.9 | 5.5 | 0.0 | 7.6 |
1 | 1003 | Alabama | Baldwin | 195121 | 95314 | 99807 | 4.5 | 83.1 | 9.5 | 0.6 | … | 1.0 | 1.4 | 3.9 | 26.4 | 85953 | 81.5 | 12.3 | 5.8 | 0.4 | 7.5 |
2 | 1005 | Alabama | Barbour | 26932 | 14497 | 12435 | 4.6 | 46.2 | 46.7 | 0.2 | … | 1.8 | 1.5 | 1.6 | 24.1 | 8597 | 71.8 | 20.8 | 7.3 | 0.1 | 17.6 |
3 | 1007 | Alabama | Bibb | 22604 | 12073 | 10531 | 2.2 | 74.5 | 21.4 | 0.4 | … | 0.6 | 1.5 | 0.7 | 28.8 | 8294 | 76.8 | 16.1 | 6.7 | 0.4 | 8.3 |
4 | 1009 | Alabama | Blount | 57710 | 28512 | 29198 | 8.6 | 87.9 | 1.5 | 0.3 | … | 0.9 | 0.4 | 2.3 | 34.9 | 22189 | 82.0 | 13.5 | 4.2 | 0.4 | 7.7 |
5 rows × 37 columns
Step 2: Exploratory data analysis EDA – Data cleaning and exploration
a. Cleaning the data
We can do simple data cleaning and processing right in our notebook instance, using the compute instance of the notebook to execute these computations.
How much data are we working with?
There are 3220 rows with 37 columns
Let’s just drop any incomplete data to make our analysis easier. We can see that we lost 2 rows of incomplete data, we now have 3218 rows in our data.
Let’s combine some of the descriptive reference columns such as state and county and leave the numerical feature columns.
We can now set the ‘state-county’ as the index and the rest of the numerical features become the attributes of each unique county.
TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | … | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
Alabama-Autauga | 55221 | 26745 | 28476 | 2.6 | 75.8 | 18.5 | 0.4 | 1.0 | 0.0 | 40725 | … | 0.5 | 1.3 | 1.8 | 26.5 | 23986 | 73.6 | 20.9 | 5.5 | 0.0 | 7.6 |
Alabama-Baldwin | 195121 | 95314 | 99807 | 4.5 | 83.1 | 9.5 | 0.6 | 0.7 | 0.0 | 147695 | … | 1.0 | 1.4 | 3.9 | 26.4 | 85953 | 81.5 | 12.3 | 5.8 | 0.4 | 7.5 |
Alabama-Barbour | 26932 | 14497 | 12435 | 4.6 | 46.2 | 46.7 | 0.2 | 0.4 | 0.0 | 20714 | … | 1.8 | 1.5 | 1.6 | 24.1 | 8597 | 71.8 | 20.8 | 7.3 | 0.1 | 17.6 |
Alabama-Bibb | 22604 | 12073 | 10531 | 2.2 | 74.5 | 21.4 | 0.4 | 0.1 | 0.0 | 17495 | … | 0.6 | 1.5 | 0.7 | 28.8 | 8294 | 76.8 | 16.1 | 6.7 | 0.4 | 8.3 |
Alabama-Blount | 57710 | 28512 | 29198 | 8.6 | 87.9 | 1.5 | 0.3 | 0.1 | 0.0 | 42345 | … | 0.9 | 0.4 | 2.3 | 34.9 | 22189 | 82.0 | 13.5 | 4.2 | 0.4 | 7.7 |
5 rows × 34 columns
b. Visualizing the data
Now we have a dataset with a mix of numerical and categorical columns. We can visualize the data for some of our numerical columns and see what the distribution looks like.
For example, from the figures above you can observe the distribution of counties that have a percentage of workers in Professional, Service, or Office occupations. Viewing the histograms can visually indicate characteristics of these features such as the mean or skew. The distribution of Professional workers for example reveals that the typical county has around 25-30% Professional workers, with a right skew, long tail and a Professional worker % topping out at almost 80% in some counties.
c. Feature engineering
Data Scaling – We need to standardize the scaling of the numerical columns in order to use any distance based analytical methods so that we can compare the relative distances between different feature columns. We can use minmaxscaler to transform the numerical columns so that they also fall between 0 and 1.
We can see that all of our numerical columns now have a min of 0 and a max of 1.
TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | … | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
count | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | … | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 | 3218.000000 |
mean | 0.009883 | 0.009866 | 0.009899 | 0.110170 | 0.756024 | 0.100942 | 0.018682 | 0.029405 | 0.006470 | 0.011540 | … | 0.046496 | 0.041154 | 0.124428 | 0.470140 | 0.009806 | 0.760810 | 0.194426 | 0.216744 | 0.029417 | 0.221775 |
std | 0.031818 | 0.031692 | 0.031948 | 0.192617 | 0.229682 | 0.166262 | 0.078748 | 0.062744 | 0.035446 | 0.033933 | … | 0.051956 | 0.042321 | 0.085301 | 0.143135 | 0.032305 | 0.132949 | 0.106923 | 0.106947 | 0.046451 | 0.112138 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | … | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.001092 | 0.001117 | 0.001069 | 0.019019 | 0.642285 | 0.005821 | 0.001086 | 0.004808 | 0.000000 | 0.001371 | … | 0.019663 | 0.023018 | 0.072581 | 0.373402 | 0.000948 | 0.697279 | 0.120861 | 0.147541 | 0.010204 | 0.150685 |
50% | 0.002571 | 0.002591 | 0.002539 | 0.039039 | 0.842685 | 0.022119 | 0.003257 | 0.012019 | 0.000000 | 0.003219 | … | 0.033708 | 0.033248 | 0.104839 | 0.462916 | 0.002234 | 0.785714 | 0.172185 | 0.188525 | 0.020408 | 0.208219 |
75% | 0.006594 | 0.006645 | 0.006556 | 0.098098 | 0.933868 | 0.111758 | 0.006515 | 0.028846 | 0.000000 | 0.008237 | … | 0.056180 | 0.048593 | 0.150538 | 0.560102 | 0.006144 | 0.853741 | 0.243377 | 0.256831 | 0.030612 | 0.271233 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | … | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 34 columns
Step 3: Data modelling
a. Dimensionality reduction
We will be using principal component analysis (PCA) to reduce the dimensionality of our data. This method decomposes the data matrix into features that are orthogonal with each other. The resultant orthogonal features are linear combinations of the original feature set. You can think of this method as taking many features and combining similar or redundant features together to form a new, smaller feature set.
We can reduce dimensionality with the built-in Amazon SageMaker algorithm for PCA.
We first import and call an instance of the PCA SageMaker model. Then we specify different parameters of the model. These can be resource configuration parameters, such as how many instances to use during training, or what type of instances to use. Or they can be model computation hyperparameters, such as how many components to use when performing PCA. Documentation on the PCA model can be found here: http://sagemaker.readthedocs.io/en/latest/pca.html
You will use the tools provided by the Amazon SageMaker Python SDK to upload the data to a default bucket.
Next, we prepare data for Amazon SageMaker by extracting the numpy array from the DataFrame and explicitly casting to float32
The record_set function in the Amazon SageMaker PCA model converts a numpy array into a record set format that is the required format for the input data to be trained. This is a requirement for all Amazon SageMaker built-in models. The use of this data type is one of the reasons that allows training of models within Amazon SageMaker to perform faster, for larger data sets compared with other implementations of the same models, such as the sklearn implementation.
We call the fit function on our PCA model, passing in our training data, and this spins up a training instance or cluster to perform the training job.
b. Accessing the PCA model attributes
After the model is created, we can also access the underlying model parameters.
Now that the training job is complete, you can find the job under Jobs in the Training subsection in the Amazon SageMaker console. You can find the job name listed in the training jobs. Use that job name in the following code to specify which model to examine.
Model artifacts are stored in Amazon S3 after they have been trained. This is the same model artifact that is used to deploy a trained model using Amazon SageMaker. Since many of the Amazon SageMaker algorithms use MXNet for computational speed, the model artifact is stored as an ND array. For an output path that was specified during the training call, the model resides in “training job name”–> output–>model.tar.gz file, which is a TAR archive file compressed with GNU zip (gzip) compression.
Three groups of model parameters are contained within the PCA model.
mean: is optional and is only available if the “subtract_mean” hyperparameter is true when calling the training step from the original PCA SageMaker function.
v: contains the principal components (same as ‘components_’ in the sklearn PCA model).
s: the singular values of the components for the PCA transformation. This does not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.
explained-variance-ratio ~= square(s) / sum(square(s))
To calculate the exact explained-variance-ratio vector if needed, it simply requires saving the sum of squares of the original data (call that N) and computing explained-variance-ratio = square(s) / N.
We can now calculate the variance explained by the largest n components that we want to keep. For this example, let’s take the top 5 components.
We can see that the largest 5 components explain ~72% of the total variance in our dataset:
After we have decided to keep the top 5 components, we can take only the 5 largest components from our original s and v matrix.
We can now examine the makeup of each PCA component based on the weightings of the original features that are included in the component. For example, the following code shows the first component. We can see that this component describes an attribute of a county that has high poverty and unemployment, low income and income per capita, and high Hispanic/Black population and low White population.
Note that this is v_5[4] or last component of the list of components in v_5, but is actually the largest component because the components are ordered from smallest to largest. So v_5[0] would be the smallest component. Similarly, change the value of component_num to cycle through the makeup of each component.
Similarly, you can go through and examine the makeup of each PCA components and try to understand what the key positive and negative attributes are for each component. The following code names the components, but feel free to change them as you gain insight into the unique makeup of each component.
c. Deploying the PCA model
We can now deploy this model endpoint and use it to make predictions. This model is now live and hosted on an instance_type that we specify.
We can also pass our original dataset to the model so that we can transform the data using the model we created. Then we can take the largest 5 components and this will reduce the dimensionality of our data from 34 to 5.
Now we have created a dataset where each county is described by the 5 principle components that we analyzed earlier. Each of these 5 components is a linear combination of the original feature space. We can interpret each of these 5 components by analyzing the makeup of the component shown previously.
Poverty/Unemployment | Self Employment/Public Workers | High Income/Professional & Office Workers | Black/Native Am Populations & Public/Professional Workers | Construction & Commuters | |
Alabama-Autauga | -0.010824 | 0.120480 | -0.088356 | 0.160527 | -0.060274 |
Alabama-Baldwin | -0.068677 | -0.023092 | -0.145743 | 0.185969 | -0.149684 |
Alabama-Barbour | 0.093111 | 0.297829 | 0.146258 | 0.296662 | 0.506202 |
Alabama-Bibb | 0.283526 | 0.011757 | 0.224402 | 0.190861 | 0.069224 |
Alabama-Blount | 0.100738 | -0.193824 | 0.022714 | 0.254403 | -0.091030 |
d. Population segmentation using unsupervised clustering
Now, we’ll use the Kmeans algorithm to segment the population of counties by the 5 PCA attributes we have created. Kmeans is a clustering algorithm that identifies clusters of similar counties based on their attributes. Since we have ~3000 counties and 34 attributes in our original dataset, the large feature space may have made it difficult to cluster the counties effectively. Instead, we have reduced the feature space to 5 PCA components, and we’ll cluster on this transformed dataset.
First, we call and define the hyperparameters of our KMeans model as we have done with our PCA model. The Kmeans algorithm allows the user to specify how many clusters to identify. In this instance, let’s try to find the top 7 clusters from our dataset.
Then we train the model on our training data.
Now we deploy the model and we can pass in the original training set to get the labels for each entry. This will give us which cluster each county belongs to.
We can see the breakdown of cluster counts and the distribution of clusters.
However, to improve explainability, we need to access the underlying model to get the cluster centers. These centers will help describe which features characterize each cluster.
Step 4: Drawing conclusions from our modelling
Explaining the result of the modelling is an important step in making use of our analysis. By combining PCA and Kmeans, and the information contained in the model attributes within an Amazon SageMaker trained model, we can form concrete conclusions based on the data.
a. Accessing the KMeans model attributes
First, we will go into the bucket where the kmeans model is stored and extract it.
Cluster Centroid Locations: The location of the centers of each cluster identified by the Kmeans algorithm. The cluster location is given in our PCA transformed space with 5 components, since we passed the transformed PCA data into the model.
Poverty/Unemployment | Self Employment/Public Workers | High Income/Professional & Office Workers | Black/Native Am Populations & Public/Professional Workers | Construction & Commuters | |
0 | 0.025268 | 0.018374 | 0.040258 | -0.356033 | 0.317980 |
1 | -0.023942 | -0.369325 | -0.181284 | -0.240269 | 1.045534 |
2 | -0.019848 | 0.059698 | -0.354655 | 0.047747 | -0.064816 |
3 | -0.012187 | -0.063721 | -0.007848 | 0.035607 | -0.248565 |
4 | 0.103763 | 0.269017 | 0.084220 | 0.241475 | 0.390270 |
5 | -0.029940 | -0.054422 | 0.132807 | 0.086195 | -0.020860 |
6 | 0.029929 | 0.110788 | 0.084239 | -0.411732 | -0.217632 |
We can plot a heatmap of the centroids and their location in the transformed feature space. This gives us insight into what characteristics define each cluster. Often with unsupervised learning, results are hard to interpret. This is one way to make use of the results of PCA plus clustering techniques together. Since we were able to examine the makeup of each PCA component, we can understand what each centroid represents in terms of the PCA components that we intepreted previously.
For example, we can see that cluster 1 has the highest value in the “Construction & Commuters” attribute while it has the lowest value in the “Self Employment/Public Workers” attribute compared with other clusters. Similarly, cluster 4 has high values in “Construction & Commuters,” “High Income/Professional & Office Workers,” and “Self Employment/Public Workers.”
We can also map the cluster labels back to each individual county and examine which counties were naturally grouped together.
Poverty/Unemployment | Self Employment/Public Workers | High Income/Professional & Office Workers | Black/Native Am Populations & Public/Professional Workers | Construction & Commuters | labels | |
Alabama-Autauga | -0.010824 | 0.120480 | -0.088356 | 0.160527 | -0.060274 | 5 |
Alabama-Baldwin | -0.068677 | -0.023092 | -0.145743 | 0.185969 | -0.149684 | 3 |
Alabama-Barbour | 0.093111 | 0.297829 | 0.146258 | 0.296662 | 0.506202 | 4 |
Alabama-Bibb | 0.283526 | 0.011757 | 0.224402 | 0.190861 | 0.069224 | 5 |
Alabama-Blount | 0.100738 | -0.193824 | 0.022714 | 0.254403 | -0.091030 | 5 |
Now, we can examine one of the clusters in more detail, like cluster 1 for example. A cursory glance at the location of the centroid tells us that it has the highest value for the “Construction & Commuters” attribute. We can now see which counties fit that description.
Poverty/Unemployment | Self Employment/Public Workers | High Income/Professional & Office Workers | Black/Native Am Populations & Public/Professional Workers | Construction & Commuters | labels | |
Arizona-Santa Cruz | -0.014149 | -0.347113 | -0.386305 | -0.284937 | 0.753071 | 1 |
Arizona-Yuma | -0.019377 | -0.260098 | -0.200252 | -0.188408 | 0.585572 | 1 |
California-Fresno | 0.016950 | -0.198805 | -0.260822 | -0.090927 | 0.590060 | 1 |
California-Imperial | -0.015831 | -0.291125 | -0.296619 | -0.279273 | 0.885126 | 1 |
California-Merced | 0.170347 | -0.304941 | -0.154338 | -0.072953 | 0.644423 | 1 |
Conclusion
You have just walked through a data science workflow for unsupervised learning, specifically clustering a dataset using KMeans after reducing the dimensionality using PCA. By accessing the underlying models created within Amazon SageMaker, we were able to improve the explainability of our modelling and draw actionable conclusions. Using these techniques, we have been able to better understand the essential characteristics of different counties in the US and segment the electorate into groupings accordingly.
Because endpoints are persistent, let’s delete our endpoints now that we’re done to avoid any excess charges on our AWS bill.
About the Author
Han Man is a Data Scientist with AWS Professional Services. He has a PhD in engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today he is passionately working with customers from a variety of industries to develop and implement machine learning & AI solutions on AWS. He enjoys following the NBA and playing basketball in his spare time.
Eitan Sela is a Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Eitan also helps customers build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.