Random Data Splitting and Cross-validation with Amazon Machine Learning

Posted on: Dec 3, 2015

You can now set up your Amazon Machine Learning (Amazon ML) model evaluations to be more accurate through a random splitting strategy, enabling you to train and evaluate ML models based on random subsets of input data records. Random splitting may be the best strategy to establish that your evaluation data is representative of your training data, ensuring that your model evaluation is correct. You can choose your splitting strategy through the Amazon ML console or API, and receive alerts when the training and evaluation data are not similar, enabling you to select a different data splitting strategy for the next model iteration.

You can also now also create more accurate evaluations of your models by using cross-validation on your data. Cross-validation is particularly valuable if you are invoking many evaluations of slightly different ML models. Enabled by a new feature in the Amazon ML API to select a complement of your dataset, cross-validation enables you to create several different ML models and evaluation pairs on complementary data subsets and average the quality metrics reported by all the evaluations, to arrive at more accurate overall quality metric of your model.

Please visit the Github repository for sample code on how to use cross-validation on your data, and the Amazon ML developer guide for more information on random data splitting.