AWS Big Data Blog

Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift

Guy Ernest is a Solutions Architect with AWS

This post builds on Guy’s earlier posts Building a Numeric Regression Model with Amazon Machine Learning and Building a Multi-Class ML Model with Amazon Machine Learning.

Many decisions in life are binary, answered either Yes or No. Many business problems also have binary answers. For example: “Is this transaction fraudulent?”, “Is this customer going to buy that product?”, or “Is this user going to churn?” In machine learning, this is called a binary classification problem. Many business decisions can be enhanced by accurately predicting the answer to a binary question. Amazon Machine Learning (Amazon ML) provides a simple and low-cost option to answer some of these questions at speed and scale.

Like the previous posts (Numeric Regression and Multiclass Classification), this post uses a publicly available example from Kaggle. This time, you will use the Click-Through Rate Prediction example, which is from the online advertising field. In this example, you will predict the likelihood that a specific user will click on a specific ad.

Preparing the data to build the machine learning model

You’ll be getting the data for building the model from the competition site, but to make it more realistic, you will use Amazon Redshift as an intermediary. In many cases, historical event data required to build a machine learning model is already stored in the data warehouse. Amazon ML integrates with Amazon Redshift to allow you to query relevant event data and perform aggregation, join, or manipulation operations to prepare the data to train the machine learning model. You will see some examples for these operations in this post.

To be able to follow through this exercise, you need an AWS account, Kaggle account (to download the data set), Amazon Redshift cluster, and SQL client. If you don’t already have an Amazon Redshift cluster, you can get a two-month free trial for a dw2.large single-node cluster, which you can use for this demo.

Setting up an Amazon Redshift cluster

In the AWS Management Console, in the Supported Regions list, choose US East (N. Virginia), and then Amazon Redshift in the Database section. Choose Launch Cluster.

On the Cluster Details page, provide a name for the cluster (for example, ml-demo) and for the database (for example, dev), and then provide the master user name and a password.

Naming the cluster

On the Node Configuration page, define the layout of the cluster. For the amount of data in this example, a single dc1.large node is sufficient (and fits into the Amazon Redshift free tier).

Choose Continue, and on the following page review the settings and choose Launch Cluster. After a few minutes, the cluster is available. Choose the cluster name to see its configuration.

 

For now, you need to note the Endpoint value, to be able to connect to the cluster and ingest the data downloaded from the Kaggle site.

Downloading and storing the data

Download the training file from the competition site, and then upload it to Amazon Simple Storage Service (Amazon S3), using the AWS CLI to handle the large file upload in parts.

# Download the train data from:
http://www.kaggle.com/c/avazu-ctr-prediction/download/train.csv.gz
# upload the file to S3
aws s3 cp train.csv.gz s3:///click_thru/input/

You can use various SQL clients to connect to the cluster, such as SQL-Workbench or Aginity Workbench, or you can connect with psql in a terminal from a Linux-based EC2 instance.

ssh -i .pem ec2-user@ec2-.eu-west-1.compute.amazonaws.com
psql -h ml-demo..us-east-1.redshift.amazonaws.com -U  -d dev -p 5439
psql -h ml-demo.<CLUSTER_ID>.us-east-1.redshift.amazonaws.com -U <USER_NAME> -d dev -p 5439

From your SQL client, create a table to store the events from the competition site. Be sure to use the right data type for each column.

CREATE TABLE click_train (
  id varchar(25) not null,
  click boolean,
  -- the format is YYMMDDHH but defined it as string
  hour char(8),
  C1 varchar(20),
  banner_pos smallint,
  site_id varchar(10),
  site_domain varchar(10),
  site_category varchar(10),
  app_id varchar(10),
  app_domain varchar(10),
  app_category varchar(10),
  device_id varchar(10),
  device_ip varchar(10),
  device_model varchar(10),
  device_type integer,
  device_conn_type integer,
  C14 integer,
  C15 integer,
  C16 integer,
  C17 integer,
  C18 integer,
  C19 integer,
  C20 integer,
  C21 integer
);

In the SQL client, use the COPY command to copy the events into the cluster:

COPY click_train FROM 's3:///input/click_thru/train.csv.gz'
credentials 'aws_access_key_id=;aws_secret_access_key=' 
GZIP
DELIMITER ','
IGNOREHEADER 1;

If everything went okay, you should see that you have more than 40 million records, by using the following SELECT query:

dev=# SELECT count(*) FROM click_train;
  count
----------
 40428967
(1 row)

Building a machine learning model from data in Amazon Redshift

In the previous blog posts, you built machine learning models from data files in S3. Data files could also originate from SQL dumps from a database. Because using SQL dumps is common, Amazon ML integrates directly with two popular database sources, Amazon Relational Database Service (Amazon RDS) and Amazon Redshift. This integration makes it easier to train machine learning models directly on “live” data by speeding up the process of data ingestion.

To build an ML model from data in Amazon Redshift, allow Amazon ML to connect to Amazon Redshift, run the UNLOAD command of the relevant query to Amazon S3, and then start the training phase.

Note: Since the original post was published, the service has been simplified and many of the steps (IAM role creation, data shuffling as a pre-process, and binary values casting) were replaced by service logic. The following section reflects this change.

In the Amazon Machine Learning console, choose Create new… Datasource and ML model.

On the Data Input page, choose Redshift, and fill in the information of cluster name, database name, user name, and your password. You can create a new IAM role to allow Amazon ML to access Redshift and S3 on your behalf, or you can use an existing role for it. You also need to specify the SELECT query to use (included below) and the name of the S3 bucket and folder to be used as the staging location.

In the SQL query, you can use the full richness of SQL to JOIN tables, calculate statistics on WINDOW functions, and parse text or date fields. For this simple example, you will use a simple SELECT *:

SELECT * FROM click_train;

In the Amazon ML wizard, on the Schema page, you can see the definition of the schema that Amazon identifies automatically from the data. At this stage, it’s best to review the proposed values for each one of the attributes, and change the numeric values that represent category ID to ‘Categorical’ instead.

On the Target page, choose the click field as the target.

Continue with the wizard to define the row ID (id field). When you reach the Review page, choose the default settings to create the machine learning model. By default, Amazon ML splits the data so that 70% is used for model training and 30% is used for model evaluation.

Building the data source, ML model, and evaluation can take some time because there are so many records to process. You can monitor progress in the Amazon ML dashboard.

 

On the dashboard, you can see that the original datasource that you created is already “In progress,” splitting the datasource to 70% for training and 30% for evaluation is also “In progress,” and ML model creation and evaluation are ‘Pending’, waiting on the completion of datasource creation. After the process is done, check the model evaluation.

Evaluating the accuracy of the machine learning model

In the previous two blog posts, you saw how Amazon ML provides a prediction accuracy metric (a single number) and a graph that reports the accuracy of a model. For the numeric regression example, we reviewed the root-mean-square-error (RMSE) metric and the error distribution accuracy graph, and for the multiclass classification example, we reviewed the F1 score and the confusion matrix.

In this Binary Classification case, the prediction accuracy metric is called AUC (Area-Under-the-Curve). You can read more about the meaning of this overall score in Amazon ML documentation. In this case, the score is 0.74:

For more insight, see the performance visualization, where the effect of choosing the cut-off score can be better understood. The prediction score for each record is a numeric value between 0 and 1. The closer to 1, the more likely it should be set to Yes, and vice versa for No. With the combination of the cut-off score, the record can fall into one out of four cases:

  • true-positive (TP) – correctly classified as Yes
  • true-negative (TN) – correctly classified as No
  • false-positive (FP) – wrongly classified as Yes
  • false-negative (FN) – wrongly classified as No

If you set the cut-off score closer to 1, fewer records will be classified as false-positives, but at the same time more records will be classified as false-negatives. Where you set the cut-off value is a business decision. If each false-positive costs a penalty ($1 to present an ad, for example), you might set the value higher, but if each false-negative causes you to miss a big sale ($1,000 commission on a luxury car purchase, for example), you might set the value lower.

You can slide the cut-off value on the graph left or right. Sliding it to the left decreases the value, as well as the number of false-positive mistakes, but increases the number of false-negative mistakes. Increasing the value of the cut-off has the opposite effect. You can also use the four slides under Advance metrics on the bottom of the graph to control different aspects of the cut-off value. In line with the “No free lunch” theorem, modifying one slide modifies the values of the other slides.

  • Accuracy – The ratio of all correct classifications (yes and no) out of all predictions. This is a balanced view of both types of mistakes.
  • False Positive Rate – The ratio of actual negative that are predicted as positive out of all negative cases.
  • Precision – The ratio of correct positive classifications (yes) out of all positive predictions. This views the problem of saying yes too often (a waste of a costly bid or annoying users with irrelevant pop ups). In other words, how precise are you when you decide to send something to a person or how precisely do you spend your marketing budget? For more information and illustrations (such as the one below), see the Wikipedia entry on Precision and recall.

  • Recall – The ratio of correct positive classifications (yes) out of all positive records. This views the problem of saying no too often (loss of sale opportunities). In other words, how many of the people that you want to recall that you actually get. Recall of 0.06, in the example above, means that only 6% of the people that you want to see the ad (as they would click it) will see it.

For example, set the Recall slide to 0.5 to see what will happen if you want to make sure that you get to at least 50% of your target users with each ad.

As you can see, the accuracy didn’t drop dramatically (from 0.83 to 0.74), but the precision decreased significantly (from 0.6 to 0.33), which means that only 1 out of 3 will click the ad compared to 2 out of 3 in the previous settings. These changes are caused by changing the cut-off value, with no changes or improvements to the model itself.

You can improve a model by creating new datasources from Amazon Redshift that include additional relevant information, such as the customer’s week day and time based on the IP address (which is missing from this data set, but is usually available in real-life data), or whether it is morning, day, evening, or night. Here are some examples of manipulations you can play with when you use SELECT queries on the data from your Amazon Redshift datasource:

SELECT
    id,
    click::int,
    -- Calculating the date of the week from the Hour string
    date_part(dow, TO_DATE (hour, 'YYMMDDHH')) as dow,
    -- Creating bins of the hours of the day based on common behaviour
    case
        when RIGHT(Hour,2) >= '00' and RIGHT (Hour,2) <= '05' then 'Night'
        when RIGHT(Hour,2) >= '06' and RIGHT (Hour,2) <= '11' then 'Morning'
        when RIGHT(Hour,2) >= '12' and RIGHT (Hour,2) <= '17' then 'Afternoon'
        when RIGHT(Hour,2) >= '18' and RIGHT (Hour,2) <= '23' then 'Evening'
        else 'Unknown'
    end
        as day_period
...

Other types of information can include data about your users that can be deduced from clickstream analysis, such as gender or age, and other analytical queries that use JOIN statements on data from other tables in the Amazon Redshift data warehouse.

Summary

In this post, you read about when and how to use the binary classification type of machine learning model offered by Amazon ML. You learned how to use Amazon Redshift as the datasource for training data, select the data, cast the target data type to int to trigger binary classification, and use the RANDOM function to shuffle the data.

You also read about how to evaluate a binary classification model, including reviewing accuracy, precision, and recall metrics. This knowledge could help you build, evaluate, and modify binary classification models to solve your business problems.

If you have questions or suggestions, please leave a comment below.

—————————————————————-

Love to work on open source? Check out our careers page.

—————————————————————-

Do more with Amazon Machine Learning:

Building a Numeric Regression Model with Amazon Machine Learning