AWS Machine Learning Blog
Git integration now available for the Amazon SageMaker Python SDK
Git integration is now available in the Amazon SageMaker Python SDK. You no longer have to download scripts from a Git repository for training jobs and hosting models. With this new feature, you can use training scripts stored in Git repos directly when training a model in the Python SDK. You can also use hosting scripts stored in Git repos when hosting a model. The scripts are hosted in GitHub, another Git-based repo, or an AWS CodeCommit repo.
This post describes in detail how to use Git integration with the Amazon SageMaker Python SDK.
Overview
When you train a model with the Amazon SageMaker Python SDK, you need a training script that does the following:
- Loads data from the input channels
- Configures training with hyperparameters
- Trains a model
- Saves the model
You specify the script as the value of the entry_point
argument when you create an estimator object.
Previously, when users constructed an Estimator
or Model
object, in the Python SDK, the training script had to be a path in the local file system when you provided it as the entry_point
value. This location was inconvenient when you had training scripts in Git repos because you had to download them locally.
If multiple developers were contributing to the Git repo, you would have to keep track of any updates to the repo. Also, if your local version was out of date, you’d need to pull the latest version prior to every training job. This also makes scheduling periodic training jobs even more challenging.
With the launch of Git integration, these issues are solved, which results in a notable improvement in convenience and productivity.
Walkthrough
Enable the Git integration feature by passing a dict
parameter named git_config
when you create the Estimator
or Model
object. The git_config
parameter provides information about the location of the Git repo that contains the scripts and the authentication for accessing that repo.
Locate the Git repo
To locate the repo that contains the scripts, use the repo
, branch
, and commit
fields in git_config
. The repo
field is required; the other two fields are optional. If you only provide the repo
field, the latest commit in master
branch is used by default:
To specify a branch, use both the repo
and branch
fields. The latest commit in that branch is used by default:
To specify a commit of a specific branch in a repo, use all three fields in git_config
:
If only the repo
and commit
fields are provided, this works when the commit is under the master
branch and the commit is used. However, if the commit is not under the master
branch, the repo is not found:
Get access to the Git repo
If the Git repo is private (all CodeCommit repos are private), you need authentication information to access it.
For CodeCommit repos, first make sure that you set up your authentication method. For more information, see Setting Up for AWS CodeCommit. The topic lists the following ways by which you can authenticate:
- SSH connections
- Git credentials
- AWS CLI Credential Helper
Authentication for SSH URLs
For SSH URLs, you must configure the SSH key pair. This applies to GitHub, CodeCommit, and other Git-based repos.
- For CodeCommit SSH key configuration, see:
- Setup Steps for SSH Connections to AWS CodeCommit Repositories on Linux, macOS, or Unix
- Setup Steps for SSH Connections to AWS CodeCommit Repositories on Windows
- For GitHub SSH key configuration, see Connecting to GitHub with SSH. The SSH key configuration for other Git-based VCSs is similar to that of GitHub.
Do not set an SSH key passphrase for the SSH key pairs. If you do, access to the repo fails.
After the SSH key pair is configured, Git integration works with SSH URLs without further authentication information:
Authentication for HTTPS URLs
For HTTPS URLs, there are two ways to deal with authentication:
- Have it configured locally.
- Configure it by providing extra fields in
git_config
, namely2FA_enabled
,username
,password
, andtoken
. Things can be slightly different here between CodeCommit, GitHub, and other Git-based repos.
Authenticating using Git credentials
If you authenticate with Git credentials, you can do one of the following:
- Provide the credentials in
git_config
: - Have the credentials stored in local credential storage. Typically, the credentials are stored automatically after you provide them with the AWS CLI. For example, macOS stores credentials in Keychain Access.
With the Git credentials stored locally, you can specify the git_config
parameter without providing the credentials, to avoid showing them in scripts:
Authenticating using AWS CLI Credential Helper
If you follow the setup documentation mentioned earlier to configure AWS CLI Credential Helper, you don’t have to provide any authentication information.
For GitHub and other Git-based repos, check whether two-factor authentication (2FA) is enabled for your account. (Authentication is disabled by default and must be enabled manually.) For more information, see Securing your account with two-factor authentication (2FA).
If 2FA is enabled for your account, provide 2FA_enabled
when specifying git_config
and set it to True
. Otherwise, set it to False
. If 2FA_enabled
is not provided, it is set to False
by default. Usually, you can use either username+password or a personal access token to authenticate for GitHub and other Git-based repos. However, when 2FA is enabled, you can only use a personal access token.
To use username+password for authentication:
Again, you can store the credentials in local credential storage to avoid showing them in the script.
To use a personal access token for authentication:
Create the estimator or model with Git integration
After you correctly specify git_config
, pass it as a parameter when you create the estimator or model object to enable Git integration. Then, make sure that the entry_point
, source_dir
, and dependencies
are all be relative paths under the Git repo.
You know that if source_dir
is provided, entry_point
should be a relative path from the source directory. The same is true with Git integration.
For example, with the following structure of the Git repo ‘amazon-sagemaker-examples’ under branch ‘training-scripts’:
You can create the estimator object as follows:
In this example, source_dir 'char-rnn-tensorflow'
is a relative path inside the Git repo, while entry_point 'train.py'
is a relative path under ‘char-rnn-tensorflow’.
Git integration example
Now let’s look at a complete example of using Git integration. This example trains a multi-layer LSTM RNN model on a language modeling task based on PyTorch example. By default, the training script uses the Wikitext-2 dataset. We train a model on SageMaker, deploy it, and then use deployed model to generate new text.
Run the commands in a Python script, except for those that start with a ‘!’, which are bash commands.
First let’s do the setup:
Next get the dataset. This data is from Wikipedia and is licensed CC-BY-SA-3.0. Before you use this data for any other purpose than this example, you should understand the data license, described at https://creativecommons.org/licenses/by-sa/3.0/:
Upload the data to S3:
Specify git_config
and create the estimator with it:
Train the mode:
Next let’s host the model. We are going to provide custom implementation of model_fn
, input_fn
, output_fn
, and predict_fn
hosting functions in a separate file ‘generate.py’, which is in the same Git repo. The PyTorch model uses a npy serializer and deserializer by default. For this example, since we have a custom implementation of all the hosting functions and plan on using JSON instead, we need a predictor that can serialize and deserialize JSON:
Create the model object:
Create the hosting endpoint:
Now we are going to use our deployed model to generate text by providing random seed, temperature (higher will increase diversity), and number of words we would like to get:
You get the following results:
Finally delete the endpoint after you are done using it:
Conclusion
In this post, I walked through how to use Git integration with the Amazon SageMaker Python SDK. With Git integration, you no longer have to download scripts from Git repos for training jobs and hosting models. Now you can use scripts in Git repos directly, simply by passing an additional parameter git_config
when creating the Estimator
or Model
object.
If you have questions or suggestions, please leave them in the comments.
About the Authors
Yue Tu is a summer intern on the AWS SageMaker ML Frameworks team. He works on Git integration for the SageMaker Python SDK during his internship. Outside of work he likes playing basketball, his favorite basketball teams are the Golden State Warriors and Duke basketball team. He also likes paying attention to nothing for some time.
Chuyang Deng is a software development engineer on the AWS SageMaker ML Frameworks team. She enjoys playing LEGO alone.