AWS Open Source Blog
Machine learning with AutoGluon, an open source AutoML library
If you work in data science, you might think that the hardest thing about machine learning is not knowing when you’ll be done. You start with a problem, a dataset, and an idea about how to solve it, but you never know whether your approach is going to work until later, after you’ve wasted time. Part of what makes the machine learning process difficult is that there are a lot of best practices that experienced practitioners know to use. If you’re just getting started in data science, you may spend a significant amount of time on an approach you thought was right, which an expert practitioner would have told you is a dead end.
What if you could codify these best practices into one simple and easy-to-use software package that any developer could use? A library that can automatically prepare your dataset, try different machine learning approaches, and combine their results to deliver high-quality models—and all of that with a few lines of code?
This is the idea behind automated machine learning (AutoML), and the thinking that went into designing AutoGluon AutoML library that Amazon Web Services (AWS) open-sourced at re:invent 2019. Using AutoGluon, you can train state-of-the-art machine learning models for image classification, object detection, text classification, and tabular data prediction with little to no prior experience in machine learning. You can run AutoGluon anywhere—from your laptop or workstation, to a powerful Amazon Elastic Compute Cloud (Amazon EC2)—instance to take advantage of multiple cores and get results faster.
The AutoGluon team at AWS has released a paper detailing the inner-workings of AutoGluon-Tabular, an open source AutoGluon capability that allows you to train machine learning models on tabular datasets from sources such as spreadsheets and database tables.
In the first half of this article, I will introduce AutoGluon-Tabular and summarize key innovations described in the paper and the magic that happens behind the scenes when you use AutoGluon-Tabular. In this second half of the article, I will walk through an end-to-end code example showing how you can use AutoGluon-Tabular to get top 1% scores in a data science competition with a few lines of code—no machine learning experience required.
If you want to jump ahead and start running the example, head over to the “Getting a head start on your next data science competition” section. Jupyter notebook for the demo is available on my GitHub.
AutoGluon-Tabular’s approach to AutoML
While machine learning applications in images and videos get all the attention, people have been applying statistical techniques to tabular data (think rows and columns in a spreadsheet or a database) for decades, either to build predictive models or to gather summary statistics. A large number of data science problems fall into this category—for example, sales forecasting based on inventory and demand data, fraud detection from transaction data, and generating product recommendations from user preferences.
This article will focus on a subset of AutoGluon’s capabilities for dealing with tabular data, which we’ll refer to as AutoGluon-Tabular.
AutoGluon-Tabular gives you access to all the best practices used by expert data scientists through a user-friendly API and was designed with the following key principles:
- Simplicity: Users should be able to train classification and regression models and deploy them with a few lines of code.
- Robustness: Users should be able to provide raw data without any feature engineering or data manipulation.
- Predictable-timing: Users should be able to specify a time budget and get the best model under that time constraint.
- Fault-tolerance: Users should be able to resume training when interrupted and be able to inspect all intermediate steps.
If you’re already an expert data science practitioner and are wondering whether AutoGluon-Tabular is useful to you, the answer is yes. Even for an expert, AutoGluon-Tabular can save time by automating time-consuming manual steps—handling missing data, manual feature transformations, data splitting, model selection, algorithm selection, hyperparameter selection and tuning, ensembling multiple models, and repeating the process when data changes.
AutoGluon-Tabular also includes novel techniques for multi-layer stack ensembling that significantly boosts model accuracy. Because AutoGluon is fully open source, transparent, and extensible, you have complete visibility into what it’s doing at every stage of the process and you can even bring in your own algorithms and use them with AutoGluon.
AutoGluon API
AutoGluon-Tabular users only need to know how to use three Python functions: Dataset()
, fit()
, and predict()
. Don’t let the simplicity of the API fool you—there’s a lot going on behind the scenes, and AutoGluon-Tabular works hard on your behalf to give you high-quality models. In the next section, we’ll unwrap these functions and talk about what makes them tick.
Step 0: Launch your Amazon EC2 instance; install and import AutoGluon.
AutoGluon can take advantage of multi-core CPUs for faster training. I recommend launching an Amazon EC2 instance from the C5 or M5 family. Choose higher vCPU count for faster performance. For a short guide on launching your instance and accessing it, read the Getting Started with Amazon EC2 documentation. Follow the instructions on the AutoGluon webpage to install AutoGluon. In most cases you should be able to just run pip install
.
To run AutoGluon-Tabular, start by telling AutoGluon that the task of interest is to build a predictor for tabular data. Replace TabularPrediction
with ImageClassification
for image classification problems, ObjectDetection
for object detection problems, and TextClassification
for text classification problems. The rest of the API remains the same. This makes it easy for you to switch between problems without having to re-learn API.
from autogluon import TabularPrediction as task
Step 1: Load dataset.
If you’re a pandas user, you’ll feel at home using the Dataset
function, which provides a pandas-like experience, so you can do things like drop variables or join multiple datasets. Because AutoGluon-Tabular manages data preprocessing automatically for you, you won’t need to do any data manipulation.
data = task.Dataset(DATASET_PATH)
Step 2: Fit models.
The fit()
function does all the heavy lifting, as we’ll see in the next section. The function does two things: It studies the dataset and then prepares it for training, and it fits several models and combines them to produce a high accuracy model.
predictor = task.fit(data_train, label=LABEL_COLUMN_NAME)
Step 3: Make predictions.
The predict
function generates predictions from new data. Predictions could cause classes and probabilites for classification problems or continuous numeric values for regression problems. When you run fit()
, several models are generated and saved to disk. If you revisit them at a later date, you can simply load a predictor by using the load command and run predictions using it.
prediction = predictor.predict(new_data)
The magic of the fit() function
When you pass your dataset to the task.fit()
function, it does two things: data preprocessing and model fitting. Now let’s find out what happens behind the scenes.
Data preprocessing
AutoGluon-Tabular first checks the label column and determines whether you have a classification problem (predicting categories) or a regression problem (predicting continuous values). Then it initiates data preprocessing steps that transform the data into a form that’ll be consumed by many different machine learning algorithms during the fit()
phase.
During the preprocessing step, AutoGluon-Tabular begins by categorizing each feature into numeric, categorical, text, or date/time. Columns that cannot be categorized, such as columns that include non-numeric information and that don’t repeat (to be considered categorical), for example User IDs, will be discarded.
Text columns are transformed into numeric vectors of n-gram (contiguous sequence of n items or words) features; date and time features are transformed into suitable numeric values. To deal with missing discrete variables, AutoGluon-Tabular creates an additional Unknown category rather than imputing (replacing with a proxy such as average). In real-world datasets, values can be missing for various reasons—such has data corruption, sensor failures, and human error—and it doesn’t imply there wasn’t anything interesting there. Categorizing it as Unknown allows AutoGluon-Tabular to handle previously unseen categories when generating predictions with new data. During the model fitting stage, AutoGluon-Tabular also performs additional data preprocessing steps that are model specific.
Model fitting
When you invoke the fit()
function, AutoGluon-Tabular will train a series of machine learning models on the preprocessed data. Then it combines multiple models using ensembling and stacking.
AutoGluon-Tabular trains the individual models in a specifically chosen sequence. First it trains reliably performant models, such as random forests, and then progressively trains more computationally expensive but less reliable models, such as k-nearest neighbors. The benefit of this approach is that you can impose a time limit to the fit()
function and it will return the best models it can train under the time constraint. AutoGluon-Tabular gives you the flexibility to decide whether you want the best accuracy with no constraints, or best accuracy under a specific cost or time budget.
AutoGluon-Tabular currently supports the following algorithms and trains all of them if no time limit is imposed:
- Random Forests
- Extremely Randomized trees
- k-nearest neighbors
- LightGBM boosted trees
- CatBoost boosted trees
- AutoGluon-Tabular deep neural networks
Novelty in the AutoGluon-Tabular deep neural network architecture
There’s a common misconception in the data science community that deep learning approaches don’t work well with tabular data. There’s a reason for that thinking: Convolutions were introduced to neural networks for their translation invariance properties via weight sharing. And this works great for datasets that are 1-D signal, 2-D, or 3-D images or videos, where each signal sample or pixel value itself has low predictive power. In many tabular datasets applications each feature is uniquely important and has higher predictive power than an individual pixels in an image. In these situations, feedforward or convolutional neural network architectures tend to under-perform compared to their decision tree-based cousins.
To address these issues, AutoGluon-Tabular employs a novel neural network architecture shown in the figure below. Empirical studies show that carefully architected neural networks can deliver significant accuracy-boosts, especially when creating an ensemble with other types of models, which we’ll discuss in the next section.
Unlike commonly used purely feedforward-based network architectures, AutoGluon-Tabular introduces an embedding layer to each categorical feature, where the embedding dimension is selected proportionally to the number of unique categories in the feature. The advantage of the embedding layer is that it introduces a learnable component for each categorical feature before it’s consumed by subsequent feedforward layers. The embeddings of categorical features are then concatenated with the numerical features into a large vector that is both fed into a three-layer feedforward network as well as directly connected to the output predictions via a linear skip-connection, à la residual family of networks.
Ensembles and multi-layer stacking
The idea of combining multiple models to create an “ensemble” that has higher predictive accuracy than each of its participants is not new. The earliest implementations of ensemble techniques date back to early 1990s with the invention of boosting (and the AdaBoost algorithm) and bagging (bootstrap aggregation) approaches. These techniques created ensembles of decision trees that are weak learners (not much better than chance) and unstable (sensitive to changes in the dataset). But when many decision trees are combined, they produce models with high predictive power resilient to over-fitting. These early works are foundational to popular machine learning packages, such as LightGBM, CatBoost, and scikit-learn’s RandomForest, which are employed by AutoGluon.
If you’re wondering whether you can combine outputs of RandomForest, CatBoost, k-nearest neighbors, and others to further improve model accuracy, the answer is yes you can. Experienced machine learning practitioners have been doing this for many years and are skilled at devising clever ways to combine multiple models. Check out the winning entry for the Otto Group Product Classification Challenge Kaggle competition. The first place solution included 33 models, whose outputs are then used to train three more models (stacking), followed by a weighted average.
With AutoGluon-Tabular, you don’t have to be skilled at stacking and ensembling. AutoGluon-Tabular will automatically do it for you. AutoGluon-Tabular introduces a novel form of multi-layer stack ensemble, shown in the figure above. Here’s how it works:
- Base layer: Individually trains multiple base models described in the model fitting section.
- Concat layer: The output of the first layer is concatenated, along with the input features.
- Stacker layer: Multiple stacker models are trained on the concat layer output. The novelty introduced by AutoGluon-Tabular is that the Stacker layer re-uses the exact same models in the base layer, including their hyperparameters as stacker models. Because input features are concatenated with output of base layer, stacker models also get an opportunity to look at the input dataset.
- Weighting layer: An ensemble selection approach is implemented in which stacker models are introduced into a new ensemble such that the validation accuracy is maximized.
To ensure that the entire dataset is seen by every learner, AutoGluon-Tabular performs k-fold cross-validation. To further improve predictive accuracy and reduce overfitting, AutoGluon-Tabular will repeat the k-fold cross-validation on n times on n different random partitions of input data. The number n is chosen by estimating how many rounds can be completed within the specified time constraints when calling the fit() function.
AutoGluon-Tabular and fault tolerance
In life and in data science, things may not go as planned. When using AutoGluon-Tabular you may accidentally hit Ctrl+C, or experience a power surge and your computer loses power, or you shut down all running Amazon EC2 instances without realizing you were running a training job. Mistakes happen and when things don’t go as planned, you don’t want to be stuck with lost data and lost progress. AutoGluon-Tabular has some built-in protections for situations such as these.
When you call the fit() function, AutoGluon-Tabular first estimates the required training time, and if this exceeds the remaining time when training a layer, it skips that layer and proceeds to the next layer. To ensure that progress is not lost, after each new model is trained, it is immediately saved to disk. If a failure does occur, AutoGluon-Tabular can still produce predictions as long as it has trained at least one model on at least one fold (out of the k-folds) before the failure (or before time limit is reached). For algorithms that support intermediate checkpointing during training, such as tree-based algorithms and neural networks, AutoGluon-Tabular can still generate prediction with these checkpoints. AutoGluon-Tabular also can anticipate when models may fail during training and skip to the next one.
Code example: Getting a head start on your next data science competition
You should now have a sense of how AutoGluon-Tabular works behind the scenes, but data science is a practical discipline and the best way to learn is by doing.
In this section, we will walk through an end-to-end example of using AutoGluon-Tabular to train a model on a dataset that was made available for the Otto Group Product Classification Challenge on Kaggle. By following the example below, you should be able to achieve scores that will put you on the top 1% in the leaderboard. The competition is no longer running, but you can still submit your models and get scores on the public and private leaderboards.
The competition dataset consists of 200k rows representing products and 93 columns representing product features. Products are categorized into 10 classes specified by the label column in the training dataset. The goal of the competition is to predict the product category given 93 features for a product. You can read more about the competition on the competition page.
Prerequisites
If you want to follow along and run examples as you read, the Jupyter notebook for the demo is available on GitHub.
To download the dataset and to submit your scores to Kaggle, make sure to head over to the competition page and click “Join Competition” and agree to their terms and conditions before proceeding.
I ran these examples on one c5.24xlarge EC2 instance on AWS and the total training took 2 hours and 30 mins. c5.24xlarge is a compute-optimized instance and offers 96 CPU cores. I chose an instance type with large number of cores since many AutoGluon-Tabular algorithms are multi-threaded and can take advantage of all cores. Below is a screenshot showing CPU utilization using the htop
command on the EC2 instance during the training phase (neural network training). Horizontal green bars indicate busy CPU cores. Black bars indicate less than 100% utilization per core. The load average score of 84 indicates the average load across all 96 CPUs at the moment the screenshot was taken.
If you choose a CPU instances with fewer cores, your training time will be longer. To save costs consider running this on EC2 Spot instances, you get a discount on the instance price, but these instances can be preempted. Because AutoGluon-Tabular is designed to be fault tolerant, you can always resume training when capacity is available again.
Download the Kaggle CLI
Follow the instructions on the Kaggle API GitHub page to download the Kaggle CLI. The CLI makes it easy to download the dataset and submit prediction results programmatically without leaving your Jupyter notebook.
Download AutoGluon
Follow the instructions on the AutoGluon webpage to install AutoGluon. In most cases you should be able to just install it using pip
.
Getting started
The following steps are from the otto-kaggle-example.ipynb Jupyter notebook hosted on GitHub. Let’s take a look at what’s happening at each of these steps.
Step 1: Download dataset
This step assumes that you have Kaggle CLI installed and you’ve agreed to participate in the competition by visiting the competition page.
dataset = 'dataset'
!kaggle competitions download -p {dataset} -q otto-group-product-classification-challenge
!unzip -d {dataset} {dataset}/otto-group-product-classification-challenge.zip
!rm {dataset}/otto-group-product-classification-challenge.zip
Output:
Archive: dataset/otto-group-product-classification-challenge.zip
inflating: dataset/sampleSubmission.csv
inflating: dataset/test.csv
inflating: dataset/train.csv
Step 2: Import AutoGluon and inspect dataset
In this step, we import a TabularPrediction task. If you’re familiar with pandas, then you’ll feel right at home with the task.Dataset() function, which can read a variety of tabular data files and it returns a pandas-like object. Because AutoGluon-Tabular does not expect you to do any preprocessing, you won’t be doing a lot of data manipulation, other than dropping variables that you don’t need or joining multiple datasets.
from autogluon import TabularPrediction as task
train_data = task.Dataset(file_path=f'{dataset}/train.csv').drop('id', axis=1)
train_data.head()
Output:
Loaded data from: dataset/train.csv | Columns = 95 / 95 | Rows = 61878 -> 61878
feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 feat_10 ... feat_85 feat_86 feat_87 feat_88 feat_89 feat_90 feat_91 feat_92 feat_93 target
0 1 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 Class_1
1 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 Class_1
2 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 Class_1
3 1 0 0 1 6 1 5 0 0 1 ... 0 1 2 0 0 0 0 0 0 Class_1
4 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 Class_1
5 rows × 94 columns
Step 3: Fit an AutoGluon model
label_column = 'target' # specifies which column do we want to predict savedir = 'otto_models/' # where to save trained models
predictor = task.fit(train_data=train_data, label=label_column, output_directory=savedir, eval_metric='log_loss', auto_stack=True, verbosity=2, visualizer='tensorboard')
The mandatory parameters for the fit() function are train_data and label, the rest are optional. In this example, I’m also specifying the following options:
output_directory
: Location where you want all the models and intermediate steps saved.eval_metric
: The metric AutoGluon will use to optimize your model(s) against; for a full list of supported metrics see the documentation page.auto_stack
: True if you want AutoGluon to utilize stacking described in “Ensembles and multi-layer stacking” section of this article.verbosity
: A value of 0 implies you don’t see any output; 4 is the highest verbosity level.visualizer
: If you set this to tensorboard, you can monitor training progress on TensorBoard for neural network training.
Output:
Beginning AutoGluon training ...
AutoGluon will save models to otto_models/
Train Data Rows: 61878
Train Data Columns: 94
Preprocessing data ...
Here are the first 10 unique label values in your data: ['Class_1' 'Class_2' 'Class_3' 'Class_4' 'Class_5' 'Class_6' 'Class_7'
'Class_8' 'Class_9']
AutoGluon infers your prediction problem is: multiclass (because dtype of label-column == object)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])Feature Generator processed 61878 data points with 93 features
Original Features:
int features: 93
Generated Features:
int features: 0
All Features:
int features: 93
Data preprocessing and feature engineering runtime = 0.36s ...
AutoGluon will gauge predictive performance using evaluation metric: log_loss
This metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict()
To change this, specify the eval_metric argument of fit()
AutoGluon will early stop models using evaluation metric: log_loss
/home/ubuntu/anaconda3/envs/autogluon/lib/python3.7/imp.py:342: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
return _load(spec)
Fitting model: RandomForestClassifierGini_STACKER_l0 ...
-0.5691 = Validation log_loss score
27.63s = Training runtime
0.03s = Validation runtime
…
…
You can monitor the neural network training performance with TensorBoard. Install TensorBoard and run:
tensorboard —logdir otto_models/models/
And point your browser to http://0.0.0.0:6006/.
Step 4: Get fit summary
results = predictor.fit_summary() # display detailed summary of fit() process
Output:
*** Summary of fit() ***
Number of models trained: 22
Types of models trained:
{'WeightedEnsembleModel', 'StackerEnsembleModel'}
Validation performance of individual models: {'RandomForestClassifierGini_STACKER_l0': -0.5691089791208548,...}
Best model (based on validation performance): weighted_ensemble_k0_l2
Hyperparameter-tuning used: False
Bagging used: True (with 10 folds)
Stack-ensembling used: True (with 1 levels)
User-specified hyperparameters:
{'NN': {'num_epochs': 500, 'visualizer': 'tensorboard'}, 'GBM': {'num_boost_round': 10000}, ...}
Plot summary of models saved to file: SummaryOfModels.html
*** End of fit() summary ***
To compare all the models in the ensemble, call leaderboard()
:
lboard = predictor.leaderboard()
lboard.sort_values(by='score_val', ascending=False)
Output:
Step 5: Load test dataset and predict
dataset = 'dataset'
test_data_full = task.Dataset(file_path=f'{dataset}/test.csv')
test_data = test_data_full.drop('id', axis=1)
pred_probablities = predictor.predict_proba(test_data, as_pandas=True)
Step 6: Submit results to Kaggle
submission_name = 'autogluon-submission.csv'
pred_probablities.to_csv(submission_name,index=False)
!kaggle competitions submit otto-group-product-classification-challenge -f {submission_name} -m "autogluon {submission_name}"
Head over to the competition page and you should see your score. I received a score of 0.40708, which puts me around 32nd position out of 3,511 submissions, which would be under top 1% of all submissions.
For instructions on how to use AutoGluon for other Kaggle competitions, check out the tutorial in the AutoGluon documentation “How to use AutoGluon for Kaggle competitions”.
Conclusion
In this article I introduced AutoGluon and AutoGluon-Tabular, and I explained how you can use it to accelerate your data science projects. If you’re interested in learning more about AutoGluon-Tabular, how it performs on popular AutoML and Kaggle benchmarks, and how it compares to alternative AutoML solutions, read the AutoGluon-Tabular paper “AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data”.
Data science is about evolving objective functions and continuous optimization. If you have questions or comments as you explore AutoGluon capabilities, let us know. Please reach out to me on Twitter at @shshnkp or LinkedIn, or contact the AutoGluon team on GitHub by filing an issue.
AutoGluon is actively accepting code contributions to the AutoGluon project. If you are interested in contributing to AutoGluon, head over to the contribution page on GitHub for more information.
On behalf of the AutoGluon team, thank you for reading and happy automating machine learning with AutoGluon!