我的 Amazon SageMaker 管道执行为什么失败?

上次更新日期:2022 年 10 月 17 日

我想排查我的 Amazon SageMaker 管道执行失败的原因。

解决方法

要排查 SageMaker 中管道执行失败的问题,请执行以下操作:

注意:如果在运行 AWS CLI 命令时收到错误信息,请确保您使用的是最新版本的 AWS CLI

1.    运行 AWS 命令行界面(AWS CLI)命令 list-pipeline-executions

注意:如果您没有在本地计算机上配置 AWS CLI,请使用 AWS CloudShell 控制台

$ aws sagemaker list-pipeline-executions --pipeline-name test-pipeline-p-wzx9cplzrvdk

该命令会返回您的管道的管道执行列表,该列表看起来与以下内容类似:

"PipelineExecutionSummaries": [
        {
            "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b",
            "StartTime": "2022-09-27T12:56:44.646000+00:00",
            "PipelineExecutionStatus": "Failed",
            "PipelineExecutionDisplayName": "execution-1664283404791",
            "PipelineExecutionFailureReason": "Step failure: One or multiple steps failed."
        },
        {
            "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/acvref9y1f47",
            "StartTime": "2022-09-27T12:13:28.762000+00:00",
            "PipelineExecutionStatus": "Succeeded",
            "PipelineExecutionDisplayName": "execution-1664280808943"
        }
    ]
}

2.    运行 list-pipeline-executions-steps 命令以查看失败的步骤:

$ aws sagemaker list-pipeline-execution-steps --pipeline-execution-arn arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b

输出与以下内容类似:

{
    "PipelineExecutionSteps": [
        {
            "StepName": "TrainAbaloneModel",
            "StartTime": "2022-09-27T13:00:49.235000+00:00",
            "EndTime": "2022-09-27T13:01:50.056000+00:00",
            "StepStatus": "Failed",
            "AttemptCount": 0,
            "FailureReason": "ClientError: ClientError: Please ensure the security group provided is valid",
            "Metadata": {
                "TrainingJob": {
                    "Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:training-job/pipelines-lvejn1jl827b-trainabalonemodel-u9l9wjassg"
                }
            }
        },
        {
            "StepName": "PreprocessAbaloneData",
            "StartTime": "2022-09-27T12:56:45.595000+00:00",
            "EndTime": "2022-09-27T13:00:48.638000+00:00",
            "StepStatus": "Succeeded",
            "AttemptCount": 0,
            "Metadata": {
                "ProcessingJob": {
                    "Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:processing-job/pipelines-lvejn1jl827b-preprocessabalonedat-6axq0kthyg"
                }
            }
        }
    ]
}

在这种情况下,训练作业步骤失败,因为在作业的 VpcConfig 对象中指定了不存在的安全组。

如果不清楚失败步骤的 FailureReason,请查看 Amazon CloudWatch Logs 中是否有失败的 SageMaker 作业或端点,以进一步进行问题排查。您可以在 CloudWatch 日志组 /aws/sagemaker/TrainingJobs 中查看训练作业的日志。日志流看起来与以下内容类似:

example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp


这篇文章对您有帮助吗?


您是否需要账单或技术支持?