自建 Kubernetes 集群提交和管理 Amazon SageMaker 训练任务（二）SageMaker Operator 安装及任务提交

自建Kubernetes集群提交和管理Amazon SageMaker训练任务

（二）使用本地的代码和数据提交计算

2020年8月
摘要
接上文（/cn/https://aws.amazon.com/blogs/china/self-built-kubernetes-cluster-submission-and-management-of-amazon-sagemaker-training-tasks-1-sagemaker-operator-installation-and-task-submission/)
在上文中，我们主要讲了如何安装并使用SageMaker Operator提交作业，主要是使用内置镜像和数据来提交训练任务。通过Amazon SageMaker不但可以使用内置镜像，也可以使用自定义镜像来提交训练任务。在自定义场景中，用户即可以使用AWS的预配置的算法，也可以使用自己的算法进行SageMaker提交。在使用预配置算法的模式下，我们可以通过脚本参数自定义训练数据的路径及代码的路径。本文会基于一个预配置算法作业（使用AWS Tensorflow 预配置算法 + 自定义训练数据 + 本地BERT代码）提交为例子，分析如何使用本地数据中心的数据和代码提交SageMaker。

应用场景
在一些场景中，公司会把算法代码及训练数据保存在本地数据中心中，因此在这种混合场景下，我们可以将本地的代码和训练数据上传至S3，并通过入口参数形式，对SageMaker加载代码和数据的地址进行维护。从而能够从线下数据中心的K8S集群中提交SageMaker作业。

技术要点分析：

在整个本地任务提交的场景：
1. 案例会采取BYOS（Bring Your Own Scripts）的方案进行镜像制作
a) 代码存在本地的GitLab或其他的代码托管工具中
b) 用于训练的数据保存在本地
2. 为了方便举例，采取Tensorflow + BERT 算法的场景
为实现本地提交云上训练任务，有如下技术要点需要注意
1. 由于SageMaker需要从云上读取训练数据（支持S3，EFS等），因此本地用于训练的数据需要同步至S3，具体操作可以参考S3 SDK或者命令行方式进行文件上传
2. 在BERT算法的入口文件中确认算法必须要传入的参数，具体参数可参照参考文档2

if __name__ == "__main__":
  flags.mark_flag_as_required("data_dir")
  flags.mark_flag_as_required("task_name")
  flags.mark_flag_as_required("vocab_file")
  flags.mark_flag_as_required("bert_config_file")
  flags.mark_flag_as_required("output_dir")
  tf.app.run()

3. training.yaml作为在K8S集群上提交SageMaker作业的提交模版。我们可以将如何读取代码（BERT）从hyperParameters传入，SageMaker会从对应的位置中获取代码进行任务的计算（如何像SageMaker提供训练信息详见参考文档4）
training.yaml文件示例如下：

apiVersion: sagemaker.aws.amazon.com/v1
kind: TrainingJob
metadata:
  # Maintan a job name as custom format
  name: tensorflow-1594280036
spec:
  # Maintain a Role which have sagemaker full access
  roleArn: arn:aws:iam::XXXXX:role/sagemaker
  # Change it to China Region (BJ: cn-north-1; NingXia: cn-northwest-1)
  region: cn-north-1
  algorithmSpecification:
    # Maintain a offical address, and you can search available image repo in https://aws.amazon.com/cn/releasenotes/available-deep-learning-containers-images/ , pls pay attention if using "gpu" image the instance for training must us ml.p3.* size.
    trainingImage: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.14-gpu-py3
    # Available option is File or PIPE
    trainingInputMode: File
  outputDataConfig:
    # The Module will upload to this address after training
    s3OutputPath: s3://your_bucket/output/models/
  inputDataConfig:
    # This spec is describe for training stage, and need to maintain where SageMaker can load the training Data. Need to change "S3Uri" in your local account. This Spec is required
    - channelName: training
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: s3://your_bucket/DataSet/TraingJobName
          s3DataDistributionType: FullyReplicated
      compressionType: None
  # How Many instances is used for training
  resourceConfig:
    instanceCount: 2
    instanceType: ml.p3.8xlarge
    volumeSizeInGB: 50
  # Maintain the hyperParameters according job requirement
  hyperParameters:
    # This items is required by entrypoint paramters,this file is for bert configuration, the code will be load from s3://your_code_bucket/bert-tf/sourcedir.tar.gz and store in this path
    - name: bert_config_file
      value: "/opt/ml/code/bert_config.json"
    # This item is requred by entry point paramters, the SageMaker will load data form s3://your_bucket/DataSet/TraingJobName and will store in this path for training
    - name: data_dir
      value: "/opt/ml/input/data/training/"
    # How many iterations for eval
    - name: do_eval
      value: "1"
    # How many iterations for train
    - name: do_train
      value: "1"
    # The chekpoint address, the code is loading from s3://your_code_bucket/bert-tf/sourcedir.tar.gz, and the bert_model.ckpt should be store in your code.
    - name: init_checkpoint
      value: "/opt/ml/code/bert_model.ckpt"
    - name: learning_rate
      value: "2e-05"
    - name: max_seq_length
      value: "128"
    - name: model_dir
      value: "/opt/ml/model"
    # How many epochs your job is training
    - name: num_train_epochs
      value: "100.0"
    - name: output_dir
      value: "/opt/ml/model"
    - name: sagemaker_container_log_level
      value: "20"
    - name: sagemaker_enable_cloudwatch_metrics
      value: "true"
    - name: sagemaker_mpi_custom_mpi_options
      value: "-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none"
    - name: sagemaker_mpi_enabled
      value: "true"
    - name: sagemaker_mpi_num_of_processes_per_host
      value: "4"
    # The EntryPoint file name.
    - name: sagemaker_program
      value: "run_classifier_hvd.py"
    - name: sagemaker_region
      value: "cn-north-1"
    # Where SageMaker can load your code.
    - name: sagemaker_submit_directory
      value: "s3://your_code_bucket/bert-tf/sourcedir.tar.gz"
    - name: task_name
      value: "MRPC"
    - name: train_batch_size
      value: "32"
    # This is required by entrypoints paramters, and the code is load form s3://your_code_bucket/bert-tf/sourcedir.tar.gz.
    - name: vocab_file
      value: "/opt/ml/code/vocab.txt"
  # Once you use manage spot module, your have to set maxRuntimeInSeconds and maxWaitTimeInSeconds. and WaitTime must >= Runtime
  stoppingCondition:
    maxRuntimeInSeconds: 86400
    maxWaitTimeInSeconds: 172800
  # If using managment spots frame, need to maintain this value and checkpoint address
  enableManagedSpotTraining: true
  # the checkpoint address which located is S3
  checkpointConfig:
s3Uri: s3://your_bucket/bert-hvd-tf/checkpoints/

4. 将代码打包为sourcedir.tar.gz （其中包含入口文件run_classifier_hvd.py必须在文件的根目录下）并将打包好的文件上传至codebucket中，文件上传后的目录需要和training.yaml文件中sagemaker_submit_directory 的配置保持一致。
5. 在文件和数据分别被传输到对应的code与data的S3存储桶后，可以在本地的K8S集群中，进行任务的提交。直接使用kubectl apply -f training.yaml提交即可。
6. 在training.yaml的outputDataConfig 中维护了SageMaker训练完成的模型文件，我们可以在对应的S3存储桶上配置event trigger。当SageMaker计算并完成结果上传后，我们可以主动讲模型文件进行拉回数据中心的操作。
延展讨论：
1. 问：AWS的官方SageMaker框架镜像是如何制作的，包含了哪些必要的安装元素，如果官方镜像没有我要安装的组件，我如何添加。

答：SageMaker的官方镜像除了安装一些基础组件以外，就安装了一些SageMaker调用Sagemaker的一些必要组件，具体安装内容可举例见（https://github.com/aws/deep-learning-containers/blob/master/tensorflow/training/docker/2.3.0/py3/Dockerfile.cpu）因此，为了后续的运维和trouble shooting的问题，我们可以使用AWS官方镜像。如果有一些额外的安装包需要被引用，可以使用pip 安装requriement.txt的方式进行安装，具体安装方法详见（https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#use-third-party-libraries）
2. 问：在SageMaker读取S3上的DataSet的时候，FILE和PIPE的方式有何不同：

答：如果使用FILE模式，SageMaker会在训练开始之前，先把所有的训练数据拉去到容器本地，然后在开始训练。如果epoch 有多轮的情况下，训练数据不用每次都回到S3上进行重新拉去。而PIPE的模式是能够边拉去边计算，能够快速的开启训练任务，并且不过多的占用本地存储。但是每个epoch开始之前，都会重新去S3上拉去数据。并且入口文件的代码要有所改动来适配PIPE模式（https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-pipe-mode-using-pipemodedataset）总体来说，如果训练的epoch多，并且数据集不大的情况，适合使用FILE模式，如果epoch少，训练数据集大，可能更适合FILE模式。

参考文档：
1. SageMaker 框架镜像地址：
https://aws.amazon.com/cn/releasenotes/available-deep-learning-containers-images/
2. BERT 算法入口文件参考:
https://github.com/google-research/bert/blob/master/run_classifier.py
3. BERT HVD 入口文件参考
https://github.com/lambdal/bert/blob/master/run_classifier_hvd.py
4. 如何制作SageMaker 镜像
https://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/build-container-to-train-script-get-started.html
5. AWS 官方镜像制作流程
https://github.com/aws/deep-learning-containers

选择您的 Cookie 首选项

亚马逊AWS官方博客

自建 Kubernetes 集群提交和管理 Amazon SageMaker 训练任务（二）SageMaker Operator 安装及任务提交

自建Kubernetes集群提交和管理Amazon SageMaker训练任务

（二）使用本地的代码和数据提交计算

本篇作者