AWS Machine Learning Blog

Amazon SageMaker adds Scikit-Learn support

Amazon SageMaker now comes pre-configured with the Scikit-Learn machine learning library in a Docker container. Scikit-Learn is popular choice for data scientists and developers because it provides efficient tools for data analysis and high quality implementations of popular machine learning algorithms through a consistent Python interface and well documented APIs. Scikit-Learn executes quickly and can scale to most data sets and problems, making it an ideal choice when you need to iterate quickly on your machine learning problems. Unlike Deep Learning frameworks such as TensorFlow or MxNet, Scikit-Learn is used for machine learning and data analysis. You can select from a range of supervised and unsupervised learning algorithms for clustering, regression, classification, dimensionality reduction, feature preprocessing, and model selection.

The newly added Scikit-Learn library is available in the Amazon SageMaker Python SDK. You can write your Scikit-Learn script and use the Amazon SageMaker training capabilities, including automatic model tuning. Once your model is trained, you can deploy your Scikit-Learn models to highly available endpoints that auto-scale to make real-time predictions with low latency. You can also use the same models in large-scale batch transform jobs.

In this blog post, I show you how to use the pre-built Scikit-Learn library in Amazon SageMaker to build, train, and deploy a multi-class classification model.

Training and deploying a Scikit-Learn model

In this example, we’re going to train a decision tree classifier on the IRIS dataset. This example is based on the Scikit-Learn Decision Tree Classification example. The full Amazon SageMaker notebook is available to try out. We’ll highlight the most important pieces here. This dataset has 50 samples from three different species of Iris flower, and is commonly used to demonstrate machine learning techniques. The goal is to predict which of the three species a flower belongs to, based on a number of different properties (petal length, petal width, and so on). While we’re using decision trees to solve this problem, Scikit-Learn offers a number of other algorithms that you can use.

Entry point script

The first step is to write the Scikit-Learn script. Starting with the main guard, we use a parser to read the hyperparameters that we pass to our Amazon SageMaker Estimator when creating the training job. These hyperparameters are made available as arguments to our input script in the training container. In this example, we look for the maximum number of leaf nodes. We also parse a number of Amazon SageMaker-specific environment variables to get information about the training environment, such as the location of input data and location where we want to save the model.

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here. In this simple example we are just including one hyperparameter.
    parser.add_argument('--max_leaf_nodes', type=int, default=-1)

    # SageMaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

After we’ve defined our hyperparameters, we then load the dataset. For this example, we load all of the CSV files using the Pandas library.

    # Take the set of files and read them all into a single pandas dataframe
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    train_data = pd.concat(raw_data)

In this example, we assume that the label (which species the flower belongs to) is stored in the first column. We separate our features and the label into two separate data frames.

    # labels are in the first column
    train_y = train_data.ix[:,0]
    train_X = train_data.ix[:,1:]

Now we’re ready to train our model. This is as simple as creating the right classifier and calling fit. A key benefit of Scikit-Learn is the simplicity and consistency of the interface that each algorithm exposes. If your model needs pre-processing of features or calculation of validation scoresc, you can also do those things in this step.

    # We determine the number of leaf nodes using the hyper-parameter above.
    max_leaf_nodes = args.max_leaf_nodes

    # Now use scikit-learn's decision tree classifier to train the model.
    clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
    clf = clf.fit(train_X, train_y)

Finally, we save our model.

    # Save the decision tree model.
    joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))

Amazon SageMaker notebook – Setup

Now that we’ve written our Scikit-Learn script we can run it on Amazon SageMaker using the Amazon SageMaker pre-built Scikit-Learn container. We’re going to use a hosted Jupyter notebook to orchestrate the training process. Feel free to follow along interactively by running the notebook.

First, we set up the Amazon S3 bucket for storing the data and the model, as well as the AWS Identity and Access Management (IAM) role for the data and Amazon SageMaker permissions.

import sagemaker

# S3 prefix
prefix = 'scikit-iris'

# Get a SageMaker-compatible role used by this Notebook Instance.
role = sagemaker.get_execution_role()

Now we’ll import the Python libraries we’ll need and create an Amazon SageMaker session.

from sagemaker.sklearn.estimator import SKLearn

sagemaker_session = sagemaker.Session()

Next, we’ll download our dataset and upload it to Amazon S3. In this example, we use Scikit-Learn locally in our notebook since it provides convenience functions to download the IRIS dataset.

import numpy as np
import os
from sklearn import datasets

# Load Iris dataset, then join labels and features together
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)


# Create a temporary directory and write the dataset as CSV
os.makedirs('./data', exist_ok=True)
np.savetxt('./data/iris.csv', joined_iris, delimiter=',', fmt='%1.1f, %1.3f, %1.3f, %1.3f, %1.3f')

# Upload the dataset to S3.
train_input = sagemaker_session.upload_data('data', key_prefix="{}/{}".format(prefix, 'data'))

Amazon SageMaker notebook – Training

Now that we’ve prepared the training data and our Scikit-Learn script (which we’ll name scikit_learn_iris.py), the SKLearn class in the SageMaker Python SDK allows us to run that script as a training job on the Amazon SageMaker managed training infrastructure. We’ll also pass the estimator our IAM role, the type of instance we want to use, and a dictionary of the hyperparameters that we want to pass to our script. Since Scikit-Learn runs on a single CPU-only machine, the Estimator only supports an instance count for training of 1 and GPU instances are not supported.

sklearn = SKLearn(
    entry_point='scikit_learn_iris.py',
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={'max_leaf_nodes': 30})

After we’ve constructed our SKLearn estimator, we can fit it by passing in the data we uploaded to Amazon S3. Amazon SageMaker makes sure our data is available in the local filesystem of the training cluster, so our Scikit-Learn script can simply read the data from disk.

sklearn.fit({'train': train_input})

Amazon SageMaker Notebook – Deployment

After training, we can use the SKLearn estimator to create an Amazon SageMaker endpoint – a hosted and managed prediction service that we can use to perform inference.

To do this our scikit_learn_iris.py script needs a model_fn() function that loads our saved model to make predictions.

def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

You can also optionally specify other functions to customize the behavior of deserialization of the input request (input_fn()), serialization of the predictions (output_fn()), and how predictions are made (predict_fn()). The defaults work for our current use-case so we don’t need to define them.

For more details on the default implementations to customize the behavior, see the SageMaker Scikit-Learn Container GitHub Repository.

Now we can use our SKLearn estimator in the notebook to deploy the model. The deploy() function allows us to set the number and type of instances that will be used for the prediction endpoint. These do not need to be the same values we used for the training job. Here we will deploy the model to a single ml.m4.xlarge instance but you can also deploy the model to more than one instance and set up Auto Scaling for the endpoint.

predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

Amazon SageMaker notebook – Prediction and evaluation

Now we can use this predictor to classify flowers from the IRIS dataset. Ideally, the evaluation dataset should be different from the training dataset but for this example, we’ll just reuse some of the training dataset to invoke the endpoint. .

First, we create a test dataset.

import itertools
import pandas as pd

shape = pd.read_csv("data/iris.csv", header=None)

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data = shape.iloc[indices[:-1]]
test_X = test_data.iloc[:,1:]
test_y = test_data.iloc[:,0]

Now we can use the endpoint to make predictions by calling the predict function with our features.

predictor.predict(test_X.values)

Amazon SageMaker notebook – Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instances associated with it.

sklearn.delete_endpoint()

Amazon SageMaker notebook – Batch Transform

Amazon SageMaker also provides Batch Transform, a managed service for doing large-scale that can be used to perform inference against your trained model on a large dataset. It’s ideal for scenarios where you’re dealing with large batches of data, you don’t need sub-second latency, or you need to preprocess and transform the training data.

First, we create a transformer and specify the number and type of instances we want to use for the job. In this case we specify 2 instances but you can scale this with the size of your dataset to reduce the amount of time it takes.

# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = sklearn.transformer(instance_count=2, instance_type='ml.m4.xlarge')

Now we start the transform job by providing it the location of our data on Amazon S3. The notebook example includes the steps followed to upload the dataset.

# Start a transform job and wait for it to finish
transformer.transform(batch_input_s3, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

After the transform job has completed, we can download the output data from Amazon S3. For every input file we had, we will have a corresponding output file.

# Download the output data from S3 to local filesystem
batch_output = transformer.output_path
!mkdir -p batch_output
!aws s3 cp --recursive $batch_output/ batch_data/

Conclusion

In this blog post we show you how to use the Amazon SageMaker built-in Scikit-Learn container to train a model on the IRIS dataset. However, that’s just the beginning. Scikit-Learn has a large selection of algorithms and transformers that you can use for your machine learning use-cases. Amazon SageMaker enables you to use Scikit-Learn scripts and works seamlessly with SageMaker training, automatic model tuning, and deployment capabilities.

Citations

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository

[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

 


About the Author

 Laurence Rouesnel is the Algorithms & Platforms Group Manager in Amazon AI Labs. He leads a team of engineers and scientists working on deep learning and machine learning research and products. In his spare time he is an avid traveler, and loves the outdoors whether it’s hiking, skiing, or windsurfing.

 

 

 

Eric Kim is an engineer in the Algorithms & Platforms Group of Amazon AI Labs. He helps support the AWS service SageMaker, and has experience in machine learning research, development, and application. Outside of work, he is an avid music lover and a fan of all dogs.