自建Kubernetes集群提交和管理Amazon SageMaker训练任务
(二)使用本地的代码和数据提交计算
2020年8月
摘要
接上文(/cn/https://aws.amazon.com/blogs/china/self-built-kubernetes-cluster-submission-and-management-of-amazon-sagemaker-training-tasks-1-sagemaker-operator-installation-and-task-submission/)
在上文中,我们主要讲了如何安装并使用SageMaker Operator提交作业,主要是使用内置镜像和数据来提交训练任务。通过Amazon SageMaker不但可以使用内置镜像,也可以使用自定义镜像来提交训练任务。在自定义场景中,用户即可以使用AWS的预配置的算法,也可以使用自己的算法进行SageMaker提交。在使用预配置算法的模式下,我们可以通过脚本参数自定义训练数据的路径及代码的路径。本文会基于一个预配置算法作业(使用AWS Tensorflow 预配置算法 + 自定义训练数据 + 本地BERT代码)提交为例子,分析如何使用本地数据中心的数据和代码提交SageMaker。
应用场景
在一些场景中,公司会把算法代码及训练数据保存在本地数据中心中,因此在这种混合场景下,我们可以将本地的代码和训练数据上传至S3,并通过入口参数形式,对SageMaker加载代码和数据的地址进行维护。从而能够从线下数据中心的K8S集群中提交SageMaker作业。
技术要点分析:
在整个本地任务提交的场景:
1. 案例会采取BYOS(Bring Your Own Scripts)的方案进行镜像制作
a) 代码存在本地的GitLab或其他的代码托管工具中
b) 用于训练的数据保存在本地
2. 为了方便举例,采取Tensorflow + BERT 算法的场景
为实现本地提交云上训练任务,有如下技术要点需要注意
1. 由于SageMaker需要从云上读取训练数据(支持S3,EFS等),因此本地用于训练的数据需要同步至S3,具体操作可以参考S3 SDK或者命令行方式进行文件上传
2. 在BERT算法的入口文件中确认算法必须要传入的参数,具体参数可参照参考文档2
if __name__ == "__main__":
flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("task_name")
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()
3. training.yaml作为在K8S集群上提交SageMaker作业的提交模版。我们可以将如何读取代码(BERT)从hyperParameters传入,SageMaker会从对应的位置中获取代码进行任务的计算 (如何像SageMaker提供训练信息详见参考文档4)
training.yaml文件示例如下:
apiVersion: sagemaker.aws.amazon.com/v1
kind: TrainingJob
metadata:
# Maintan a job name as custom format
name: tensorflow-1594280036
spec:
# Maintain a Role which have sagemaker full access
roleArn: arn:aws:iam::XXXXX:role/sagemaker
# Change it to China Region (BJ: cn-north-1; NingXia: cn-northwest-1)
region: cn-north-1
algorithmSpecification:
# Maintain a offical address, and you can search available image repo in https://aws.amazon.com/cn/releasenotes/available-deep-learning-containers-images/ , pls pay attention if using "gpu" image the instance for training must us ml.p3.* size.
trainingImage: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.14-gpu-py3
# Available option is File or PIPE
trainingInputMode: File
outputDataConfig:
# The Module will upload to this address after training
s3OutputPath: s3://your_bucket/output/models/
inputDataConfig:
# This spec is describe for training stage, and need to maintain where SageMaker can load the training Data. Need to change "S3Uri" in your local account. This Spec is required
- channelName: training
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3Uri: s3://your_bucket/DataSet/TraingJobName
s3DataDistributionType: FullyReplicated
compressionType: None
# How Many instances is used for training
resourceConfig:
instanceCount: 2
instanceType: ml.p3.8xlarge
volumeSizeInGB: 50
# Maintain the hyperParameters according job requirement
hyperParameters:
# This items is required by entrypoint paramters,this file is for bert configuration, the code will be load from s3://your_code_bucket/bert-tf/sourcedir.tar.gz and store in this path
- name: bert_config_file
value: "/opt/ml/code/bert_config.json"
# This item is requred by entry point paramters, the SageMaker will load data form s3://your_bucket/DataSet/TraingJobName and will store in this path for training
- name: data_dir
value: "/opt/ml/input/data/training/"
# How many iterations for eval
- name: do_eval
value: "1"
# How many iterations for train
- name: do_train
value: "1"
# The chekpoint address, the code is loading from s3://your_code_bucket/bert-tf/sourcedir.tar.gz, and the bert_model.ckpt should be store in your code.
- name: init_checkpoint
value: "/opt/ml/code/bert_model.ckpt"
- name: learning_rate
value: "2e-05"
- name: max_seq_length
value: "128"
- name: model_dir
value: "/opt/ml/model"
# How many epochs your job is training
- name: num_train_epochs
value: "100.0"
- name: output_dir
value: "/opt/ml/model"
- name: sagemaker_container_log_level
value: "20"
- name: sagemaker_enable_cloudwatch_metrics
value: "true"
- name: sagemaker_mpi_custom_mpi_options
value: "-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none"
- name: sagemaker_mpi_enabled
value: "true"
- name: sagemaker_mpi_num_of_processes_per_host
value: "4"
# The EntryPoint file name.
- name: sagemaker_program
value: "run_classifier_hvd.py"
- name: sagemaker_region
value: "cn-north-1"
# Where SageMaker can load your code.
- name: sagemaker_submit_directory
value: "s3://your_code_bucket/bert-tf/sourcedir.tar.gz"
- name: task_name
value: "MRPC"
- name: train_batch_size
value: "32"
# This is required by entrypoints paramters, and the code is load form s3://your_code_bucket/bert-tf/sourcedir.tar.gz.
- name: vocab_file
value: "/opt/ml/code/vocab.txt"
# Once you use manage spot module, your have to set maxRuntimeInSeconds and maxWaitTimeInSeconds. and WaitTime must >= Runtime
stoppingCondition:
maxRuntimeInSeconds: 86400
maxWaitTimeInSeconds: 172800
# If using managment spots frame, need to maintain this value and checkpoint address
enableManagedSpotTraining: true
# the checkpoint address which located is S3
checkpointConfig:
s3Uri: s3://your_bucket/bert-hvd-tf/checkpoints/
4. 将代码打包为sourcedir.tar.gz (其中包含入口文件run_classifier_hvd.py必须在文件的根目录下)并将打包好的文件上传至codebucket中,文件上传后的目录需要和training.yaml文件中sagemaker_submit_directory 的配置保持一致。
5. 在文件和数据分别被传输到对应的code与data的S3存储桶后,可以在本地的K8S集群中,进行任务的提交。直接使用kubectl apply -f training.yaml提交即可。
6. 在training.yaml的outputDataConfig
中维护了SageMaker训练完成的模型文件,我们可以在对应的S3存储桶上配置event trigger。当SageMaker计算并完成结果上传后,我们可以主动讲模型文件进行拉回数据中心的操作。
延展讨论:
1. 问:AWS的官方SageMaker框架镜像是如何制作的,包含了哪些必要的安装元素,如果官方镜像没有我要安装的组件,我如何添加。
答:SageMaker的官方镜像除了安装一些基础组件以外,就安装了一些SageMaker调用Sagemaker的一些必要组件,具体安装内容可举例见(https://github.com/aws/deep-learning-containers/blob/master/tensorflow/training/docker/2.3.0/py3/Dockerfile.cpu)因此,为了后续的运维和trouble shooting的问题,我们可以使用AWS官方镜像。如果有一些额外的安装包需要被引用,可以使用pip 安装requriement.txt的方式进行安装,具体安装方法详见(https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#use-third-party-libraries)
2. 问:在SageMaker读取S3上的DataSet的时候,FILE和PIPE的方式有何不同:
答:如果使用FILE模式,SageMaker会在训练开始之前,先把所有的训练数据拉去到容器本地,然后在开始训练。如果epoch 有多轮的情况下,训练数据不用每次都回到S3上进行重新拉去。而PIPE的模式是能够边拉去边计算,能够快速的开启训练任务,并且不过多的占用本地存储。但是每个epoch开始之前,都会重新去S3上拉去数据。并且入口文件的代码要有所改动来适配PIPE模式(https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-pipe-mode-using-pipemodedataset)总体来说,如果训练的epoch多,并且数据集不大的情况,适合使用FILE模式,如果epoch少,训练数据集大,可能更适合FILE模式。
参考文档:
1. SageMaker 框架镜像地址:
https://aws.amazon.com/cn/releasenotes/available-deep-learning-containers-images/
2. BERT 算法入口文件参考:
https://github.com/google-research/bert/blob/master/run_classifier.py
3. BERT HVD 入口文件参考
https://github.com/lambdal/bert/blob/master/run_classifier_hvd.py
4. 如何制作SageMaker 镜像
https://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/build-container-to-train-script-get-started.html
5. AWS 官方镜像制作流程
https://github.com/aws/deep-learning-containers
本篇作者