Introducing the open-source Amazon SageMaker XGBoost algorithm container

XGBoost is a popular and efficient machine learning (ML) algorithm for regression and classification tasks on tabular datasets. It implements a technique known as gradient boosting on trees and performs remarkably well in ML competitions.

Since its launch, Amazon SageMaker has supported XGBoost as a built-in managed algorithm. For more information, see Simplify machine learning with XGBoost and Amazon SageMaker. As of this writing, you can take advantage of the open-source Amazon SageMaker XGBoost container, which has improved flexibility, scalability, extensibility, and Managed Spot Training. For more information, see the Amazon SageMaker sample notebooks and sagemaker-xgboost-container on GitHub, or see XBoost Algorithm.

This post introduces the benefits of the open-source XGBoost algorithm container and presents three use cases.

Benefits of the open-source SageMaker XGBoost container

The new XGBoost container has following benefits:

Latest version

The open-source XGBoost container supports the latest XGBoost 1.0 release and all improvements, including better performance scaling on multi-core instances and improved stability for distributed training.

Flexibility

With the new script mode, you can now customize or use your own training script. This functionality, which is also available for TensorFlow, MXNet, PyTorch, and Chainer users, allows you to add in custom pre- or post-processing logic, run additional steps after the training process, or take advantage of the full range of XGBoost functions (such as cross-validation support). You can still use the no-script algorithm mode (like other Amazon SageMaker built-in algorithms), which only requires you to specify a data location and hyperparameters.

Scalability

The open-source container has a more efficient implementation of distributed training, which allows it to scale out to more instances and reduces out-of-memory errors.

Extensibility

Because the container is open source, you can extend, fork, or modify the algorithm to suit your needs, beyond using the script mode. This includes installing additional libraries and changing the underlying version of XGBoost.

Managed Spot Training

You can save up to 90% on your Amazon SageMaker XGBoost training jobs with Managed Spot Training support. This fully managed option lets you take advantage of unused compute capacity in the AWS Cloud. Amazon SageMaker manages the Spot Instances on your behalf so you don’t have to worry about polling for capacity. The new version of XGBoost automatically manages checkpoints for you to make sure your job finishes reliably. For more information, see Managed Spot Training in Amazon SageMaker and Use Checkpoints in Amazon SageMaker.

Additional input formats

XGBoost now includes support for Parquet and Recordio-protobuf input formats. Parquet is a standardized, open-source, self-describing columnar storage format for use in data analysis systems. Recordio-protobuf is a common binary data format used across Amazon SageMaker for various algorithms, which XGBoost now supports for training and inference. For more information, see Common Data Formats for Training. Additionally, this container supports Pipe mode training for these data formats. For more information, see Using Pipe input mode for Amazon SageMaker algorithms.

Using the latest XGBoost container as a built-in algorithm

As an existing Amazon SageMaker XGBoost user, you can take advantage of the new features and improved performance by specifying the version when you create your training jobs. For more information about getting started with XGBoost or using the latest version, see the GitHub repo.

You can upgrade to the new container by specifying the framework version (1.0-1). This version specifies the upstream XGBoost framework version (1.0) and an additional Amazon SageMaker version (1). If you have an existing XGBoost workflow based on the legacy 0.72 container, this is the only change necessary to get the same workflow working with this container. The container also supports XGBoost 0.90 by using version as 0.90-1.

See the following code:

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost', '1.0-1')

estimator = sagemaker.estimator.Estimator(container, 
                                          role, 
                                          hyperparameters=hyperparameters,
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.2xlarge', 
                                          )

estimator.fit(training_data)

Using managed Spot Instances

You can also take advantage of managed Spot Instance support by enabling the train_use_spot_instances flag on your Estimator. For more information, see the GitHub repo.

When you are training with managed Spot Instances, the training job may be interrupted, which causes it to take longer to start or finish. If a training job is interrupted, you can use a checkpointed snapshot to resume from a previously saved point, which can save training time (and cost). You can also use the checkpoint_s3_uri, which is where your training job stores snapshots, to seamlessly resume when a Spot Instance is interrupted. See the following code:

estimator = sagemaker.estimator.Estimator(container, 
                                          role, 
                                          hyperparameters=hyperparameters,
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.2xlarge', 
                                          output_path=output_path, 
                                          sagemaker_session=sagemaker.Session(),
                                          train_use_spot_instances=train_use_spot_instances, 
                                          train_max_run=train_max_run, 
                                          train_max_wait=train_max_wait,
                                          checkpoint_s3_uri=checkpoint_s3_uri
                                         )
                                         
estimator.fit({'train': train_input})

Towards the end of the job, you should see the following two lines of output:

Training seconds: X – The actual compute time your training job
Billable seconds: Y – The time you are billed for after you apply Spot discounting

If you enabled train_use_spot_instances, you should see a notable difference between X and Y, which signifies the cost savings from using Managed Spot Training. This is reflected in the following code:

Managed Spot Training savings: (1-Y/X)*100 %

Using script mode

Script mode is a new feature with the open-source Amazon SageMaker XGBoost container. You can use your own training or hosting script to fully customize the XGBoost training or inference workflow. The following code example is a walkthrough of using a customized training script in script mode. For more information, see the GitHub repo.

Preparing the entry-point script

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser instance.

Starting with the main guard, use a parser to read the hyperparameters passed to your Amazon SageMaker estimator when creating the training job. These hyperparameters are made available as arguments to your input script. You also parse several Amazon SageMaker-specific environment variables to get information about the training environment, such as the location of input data and where to save the model. See the following code:

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here
    parser.add_argument('--num_round', type=int)
    parser.add_argument('--max_depth', type=int, default=5)
    parser.add_argument('--eta', type=float, default=0.2)
    parser.add_argument('--objective', type=str, default='reg:squarederror')
    
    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])
    
    args = parser.parse_args()
    
    train_hp = {
        'max_depth': args.max_depth,
        'eta': args.eta,
        'gamma': args.gamma,
        'min_child_weight': args.min_child_weight,
        'subsample': args.subsample,
        'silent': args.silent,
        'objective': args.objective
    }
    
    dtrain = xgb.DMatrix(args.train)
    dval = xgb.DMatrix(args.validation)
    watchlist = [(dtrain, 'train'), (dval, 'validation')] if dval is not None else [(dtrain, 'train')]

    callbacks = []
    prev_checkpoint, n_iterations_prev_run = add_checkpointing(callbacks)
    # If checkpoint is found then we reduce num_boost_round by previously run number of iterations
    
    bst = xgb.train(
        params=train_hp,
        dtrain=dtrain,
        evals=watchlist,
        num_boost_round=(args.num_round - n_iterations_prev_run),
        xgb_model=prev_checkpoint,
        callbacks=callbacks
    )
    
    model_location = args.model_dir + '/xgboost-model'
    pkl.dump(bst, open(model_location, 'wb'))
    logging.info("Stored trained model at {}".format(model_location))

Inside the entry-point script, you can optionally customize the inference experience when you use Amazon SageMaker hosting or batch transform. You can customize the following:

input_fn() – How the input is handled
predict_fn() – How the XGBoost model is invoked
output_fn() – How the response is returned

The defaults work for this use case, so you don’t need to define them.

Training with the Amazon SageMaker XGBoost estimator

After you prepare your training data and script, the XGBoost estimator class in the Amazon SageMaker Python SDK allows you to run that script as a training job on the Amazon SageMaker managed training infrastructure. You also pass the estimator your IAM role, the type of instance you want to use, and a dictionary of the hyperparameters that you want to pass to your script. See the following code:

from sagemaker.session import s3_input
from sagemaker.xgboost.estimator import XGBoost

xgb_script_mode_estimator = XGBoost(
    entry_point="abalone.py",
    hyperparameters=hyperparameters,
    image_name=container,
    role=role, 
    train_instance_count=1,
    train_instance_type="ml.m5.2xlarge",
    framework_version="1.0-1",
    output_path="s3://{}/{}/{}/output".format(bucket, prefix, "xgboost-script-mode"),
    train_use_spot_instances=train_use_spot_instances,
    train_max_run=train_max_run,
    train_max_wait=train_max_wait,
    checkpoint_s3_uri=checkpoint_s3_uri
)

xgb_script_mode_estimator.fit({"train": train_input})

Deploying the custom XGBoost model

After you train the model, you can use the estimator to create an Amazon SageMaker endpoint—a hosted and managed prediction service that you can use to perform inference. See the following code:

predictor = xgb_script_mode_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")
test_data = xgboost.DMatrix('/path/to/data')
predictor.predict(test_data)

Training with Parquet input

You can now train the latest XGBoost algorithm with Parquet-formatted files or streams directly by using the Amazon SageMaker supported open-sourced ML-IO library. ML-IO is a high-performance data access library for ML frameworks with support for multiple data formats, and is installed by default on the latest XGBoost container. For more information about importing a Parquet file and training with it, see the GitHub repo.

Conclusion

The open-source XGBoost container for Amazon SageMaker provides a fully managed experience and additional benefits that save you money in training and allow for more flexibility.

About the Authors

Rahul Iyer is a Software Development Manager at AWS AI. He leads the Framework Algorithms team, building and optimizing machine learning frameworks like XGBoost and Scikit-learn. Outside work, he enjoys nature photography and cherishes time with his family.

Rocky Zhang is a Senior Product Manager at AWS SageMaker. He builds products that help customers solve real world business problems with Machine Learning. Outside of work he spends most of his time watching, playing, and coaching soccer.

Eric Kim is an engineer in the Algorithms & Platforms Group of Amazon AI. He helps support the AWS service SageMaker, and has experience in machine learning research, development, and application. Outside of work, he is a music lover and a fan of all dogs.

Laurence Rouesnel is a Senior Manager in Amazon AI. He leads teams of engineers and scientists working on deep learning and machine learning research and products, like SageMaker AutoPilot and Algorithms. In his spare time he’s an avid fan of traveling, table-top RPGs, and running.