Build PMML-based Applications and Generate Predictions in AWS

If you generate machine learning (ML) models, you know that the key challenge is exporting and importing them into other frameworks to separate model generation and prediction. Many applications use PMML (Predictive Model Markup Language) to move ML models from one framework to another. PMML is an XML representation of a data mining model.

In this post, I show how to build a PMML application on AWS. First, you build a PMML model in Apache Spark using Amazon EMR. You import the model using AWS Lambda with JPMML, a PMML producer and consumer library for the Java platform. With this approach, you can export PMML models to an Amazon S3 bucket using a Spark application. After exporting the data, you can terminate the Amazon EMR cluster. Then, you use AWS Lambda to import the PMML model for generating predictions. We use the Iris dataset from UC Irvine. It contains three classes, one for each species of iris, with 50 instances of irises for each class.

For a list of PMML producer and consumer software, see the PMML Powered section of the Data Mining Group (DMG) website.

PMML application overview

The PMML application uses the MLlib K-means clustering algorithm in Apache Spark. MLlib K-means clustering is an unsupervised learning algorithm that tries to cluster data based on similarity. In K-means clustering, you have to specify the number of clusters that you want to group the data into. The Iris dataset contains three species, so you will configure the algorithm to group the data into three clusters. You then train the model.

Next, you export the model to a PMML document, which is stored in an S3 bucket. Spark ML has multiple options for exporting PMML. Spark ML added support for exporting them with MLlib libraries with the org.apache.spark.mllib.pmml.PMMLExportable trait starting with Spark MLib 1.4. With Spark ML, you can also use JPMML libraries to export PMML models.

Note: Spark doesn’t support exporting all models to PMML. For a list of models that can be exported, see PMML model export – RDD-based API.

Finally, you import the generated PMML model into AWS Lambda. Lambda allows you to build a cost-effective PMML application without provisioning or managing servers. You also can set up your application so that actions in other AWS services automatically trigger it or call it directly from any web or mobile app.

Prerequisites

To build the Spark Scala application and AWS Lambda function for Java, you need the following:

Set up the EMR cluster and export the PMML document

The emr-pmml-demo GitHub project contains the Spark and Lambda code that you need for this post. It also downloads the Iris dataset, creates a model, and exports the model into a PMML document.

To set up the EMR cluster and export the model to a PMML document

Clone the emr-pmml-demo GitHub project:
```
git clone https://github.com/awslabs/emr-pmml-demo.git 
```
This creates a directory called emr-pmml-demo, which contains all of the source code and files for the PMML application.
To see what the Scala file contains, in the SparkMLTest directory, open the file:
src/main/scala/com/aws/sparkml/sample/Main.scala
The source code downloads the Iris dataset, creates a model, and then exports the generated model into a PMML document using the toPMML function, as follows:

package com.aws.sparkml.sample

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

import org.apache.spark.mllib.clustering.KMeans

object Main {
  

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf(true).setAppName("spark-pmml-demo")
    val sc = new SparkContext(conf)
    val inputFile: String = args(0)
    val outputFile: String = args(1)


    val iris_data = sc.textFile(inputFile)
    val parsedData = iris_data.map(_.split(",").dropRight(1).map(_.toDouble)).map(Vectors.dense(_)).cache()
    
    
    val numClusters = 3
    val numIterations = 40
    val clusters = KMeans.train(parsedData, numClusters, numIterations)
    clusters.toPMML(sc, outputFile)
    
    }

}

The SparkMLTest directory contains the .sbt build file. Create a JAR (Java ARchive) file by running the following command:
```
sbt assembly
```
It might take a while to download the libraries. After the JAR file has been built, upload it to an S3 bucket of your choice.

Open the Amazon EMR console, and then choose Create cluster. Choose Advanced Options, and select Spark 2.1.0. Leave the other default options selected and choose Next.

On the Hardware, General Cluster Settings, and Security pages, choose Next, and then choose Create cluster.While the cluster is spinning up, download iris.data from the emr-pmml-demo directory to your S3 bucket.
When the cluster is up, submit the Spark job, as follows. The application reads input from, and writes output to, an S3 bucket that you specify.
1. On the cluster detail page, choose Add Step.
2. For Step Type, choose Spark Application.
3. For Spark-submit options, type the following:
```
--class com.aws.sparkml.sample.Main --master yarn 
```
4. For Application location, choose the location of the JAR file that you created in Step 3.
5. For Arguments, type the location of the iris.data file, which you downloaded in Step 5, and the location of the S3 bucket and folder where you want to save the PMML document.
6. For Action on failure, choose Continue, and then choose Add.

Once step has completed successfully, you should see a new folder that contains the generated PMML document, open it.

The PMML document, which is part of the Spark output (PART-00000), should look similar to the following:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2">
    <Header description="k-means clustering">
        <Application name="Apache Spark MLlib" version="2.1.0"/>
        <Timestamp>2017-04-11T16:07:19</Timestamp>
    </Header>
    <DataDictionary numberOfFields="4">
        <DataField name="field_0" optype="continuous" dataType="double"/>
        <DataField name="field_1" optype="continuous" dataType="double"/>
        <DataField name="field_2" optype="continuous" dataType="double"/>
        <DataField name="field_3" optype="continuous" dataType="double"/>
    </DataDictionary>
    <ClusteringModel modelName="k-means" functionName="clustering" modelClass="centerBased" numberOfClusters="3">
        <MiningSchema>
            <MiningField name="field_0" usageType="active"/>
            <MiningField name="field_1" usageType="active"/>
            <MiningField name="field_2" usageType="active"/>
            <MiningField name="field_3" usageType="active"/>
        </MiningSchema>
        <ComparisonMeasure kind="distance">
            <squaredEuclidean/>
        </ComparisonMeasure>
        <ClusteringField field="field_0" compareFunction="absDiff"/>
        <ClusteringField field="field_1" compareFunction="absDiff"/>
        <ClusteringField field="field_2" compareFunction="absDiff"/>
        <ClusteringField field="field_3" compareFunction="absDiff"/>
        <Cluster name="cluster_0">
            <Array n="4" type="real">5.901612903225806 2.7483870967741932 4.393548387096774 1.433870967741935</Array>
        </Cluster>
        <Cluster name="cluster_1">
            <Array n="4" type="real">5.005999999999999 3.4180000000000006 1.4640000000000002 0.2439999999999999</Array>
        </Cluster>
        <Cluster name="cluster_2">
            <Array n="4" type="real">6.85 3.0736842105263147 5.742105263157893 2.071052631578947</Array>
        </Cluster>
    </ClusteringModel>
</PMML>

The PMML document consists of multiple sections that describe the model. The DataDictionary section lists all possible fields used by the model. The ClusteringModel section defines the data mining model and contains the statistics for each cluster.

Import the PMML file into Lambda and generate predictions

Now, create a Lambda function to import the generated PMML XML file using the Java JPMML library. The JPMML library allows you to import PMML models, and then generate predictions. It provides support for many versions of PMML models. You download the library from the Maven repository.

To import the PMML file into Lambda and test it

To get the Lambda source code, in the emr-pmml-demo directory, open jpmml-demo-lambda.
In src/main/java, open JpmmlPredictor.java and modify the following variables to specify the input location of the PMML document:
private static String bucketName = "BUCKET_NAME";
private static String key = "FOLDER_NAME/ part-00000";
To generate a JAR file, run a Maven build from the root directory, which contains pom.xml, the build file for Java applications. To limit the size of the JAR file, use the Maven shade plugin, which helps to exclude some libraries:mvn package shade:shadeThe JAR file also imports and includes the JPMML JAR file during the build.
Upload the JAR file to an S3 bucket of your choice.
Create a Lambda function.
1. In the AWS Lambda console, choose Create a Lambda function. For blueprint, choose the Blank The PMML application doesn’t use triggers, so on the Configure triggers page, choose Next.
2. On the Configure function page, for Runtime, choose Java 8 and upload the JAR file from the S3 bucket, as follows:
3. For Handler, type aws.jpmml.demo.JpmmlPredictor::handleRequest, and then choose an IAM role that has read permissions to the S3 bucket that contains the PMML document.
4. Review your settings, and choose Create function.
Test the Lambda function and configure it with a test event.

- 1. 1. hoose Actions, and then choose Configure test event.
    2. Type the following text, and then choose Save and Test:
      { "sepalLength": "6.3", "sepalWidth": "2.7", "petalLength": "4.9", "petalWidth": "1.8" }

AWS Lambda creates an internal model by importing the PMML document. The Lambda function invokes the JPPML evaluator function, which predicts the cluster based on the input arguments. For example:

{
  "clusterId": "cluster_0"
}

If you get a timeout error, in Advanced settings, set the timeout to 60 seconds.

Conclusion

In this post, I explained how to use Amazon EMR and AWS Lambda to build a PMML model. With this approach, you can use transient Amazon EMR clusters to generate PMML models from data in Amazon S3, and then use AWS Lambda to generate predictions. To provide a REST API layer for your application, you could integrate the Lambda functions with Amazon API Gateway.

Additional Reading

Learn more about running BigDL, deep learning for Apache Spark, on AWS.

About the Author

Gitansh Chadha is a Solutions Architect at AWS. He lives in the San Francisco bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys the outdoors and spending time with his twin daughters.