亚马逊AWS官方博客

STOmics 系列 2 – Cromwell 和 Volcano 集成技术详解

在本系列的第一篇文章中,介绍了 STOmics、Amazon EKS、Karpenter 和 Volcano 等核心概念,以及如何使用这些组件和服务构建容器化 HPC 平台,以实现 STOmics 的个性化分析工作场景的最佳实践。在实践过程中,遇到了一些困难,本文将详细讨论 Cromwell 和 Volcano 的集成问题,并深入探讨如何使用 Cromwell 将一个 WDL 规范的生信流程投递到 Amazon EKS 集群,以进行高效的任务调度和计算分析。

Cromwell 简介

Cromwell 是一个流行的开源工具,用于管理和执行基于工作流的数据分析任务。它最初是为 Broad Institute 的 Genomics Platform 开发的,但现在已成为许多生物医学研究人员和数据科学家的首选工具之一。Cromwell 的主要特点是可扩展性和灵活性。它支持多种计算环境,包括本地计算机、云计算和高性能计算(HPC)集群。使用 Cromwell,用户可以轻松地定义和运行工作流,管理任务依赖关系,跟踪任务状态和输出结果。Cromwell 还支持多种工作流语言,如 WDL(Workflow Description Language)和 CWL(Common Workflow Language),其中 CWL 在 80 及以上版本已经不支持了。这些语言可以帮助用户描述工作流和任务的输入、输出和参数,从而更轻松地创建可重复和可扩展的工作流。

Volcano 简介

Volcano 是基于 Kubernetes 的容器批量计算平台,主要用于高性能计算场景。它提供了 Kubernetes 目前缺少的一套机制,这些机制通常是机器学习大数据应用、科学计算、特效渲染等多种高性能工作负载所需的。作为一个通用批处理平台,Volcano 与几乎所有的主流计算框架无缝对接,如 Spark 、TensorFlow 、PyTorch、Flink、Argo、MindSpore、 PaddlePaddle 等。它还提供了包括基于各种主流架构的 CPU、GPU 在内的异构设备混合调度能力。Volcano 的设计理念建立在 15 年来多种系统和平台大规模运行各种高性能工作负载的使用经验之上,并结合来自开源社区的最佳思想和实践。

先决条件

  • 拥有一个 AWS 帐户
  • 创建 AWS 资源的权限(例如 IAM 角色、IAM 策略、Amazon EC2 实例和 Amazon EKS 集群)
  • 创建一台用于操作的 EC2 并赋予角色,该角色具有 admin 权限并且该 EC2 已经挂载了和 EKS 集群相同的 Amazon EFS 文件系统,挂载路径都为 /shared
  • 创建一个 Amazon EKS 集群并挂载 EFS 文件存储
  • 熟悉 Kubernetes 和 Linux shell 命令的基本使用

参考架构图

Cromwell 和 Volcano 集成问题

安装 Volcano

Volcano 官方文档的介绍中有提到支持 Cromwell 的生态,但是在生态栏中却没有 Cromwell 的详细介绍,当然也不止 Cromwell 框架,其他有些框架也是这样的情况,如下图:

本文我们只研究 Cromwell 框架,所以我们先安装 Volcano,再安装 Cromwell 进行集成,Volcano 在 Amazon EKS 上安装也非常简单,我们在已经创建好的 Amazon EKS 集群上运行以下命令:

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

验证 Volcano 安装是否成功:

kubectl get all -n volcano-system

安装成功输出如下:

NAME                                       READY   STATUS      RESTARTS   AGE
pod/volcano-admission-8cc65c757-dfv54      1/1     Running     0          49s
pod/volcano-admission-init-mqnsn           0/1     Completed   0          49s
pod/volcano-controllers-56588b7df6-gjzhz   1/1     Running     0          48s
pod/volcano-scheduler-58dc5bd8fb-27g2g     1/1     Running     0          48s

NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/volcano-admission-service   ClusterIP   10.100.119.77    <none>        443/TCP    49s
service/volcano-scheduler-service   ClusterIP   10.100.247.193   <none>        8080/TCP   48s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           49s
deployment.apps/volcano-controllers   1/1     1            1           48s
deployment.apps/volcano-scheduler     1/1     1            1           48s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-8cc65c757      1         1         1       49s
replicaset.apps/volcano-controllers-56588b7df6   1         1         1       48s
replicaset.apps/volcano-scheduler-58dc5bd8fb     1         1         1       48s

NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           6s         49s

安装 vcctl 命令行工具,该工具依赖 go 语言,执行以下命令:

sudo yum install golang -y
cd ~
git clone https://github.com/volcano-sh/volcano.git
cd volcano
make vcctl
sudo cp _output/bin/vcctl /usr/local/bin/

检查是否安装成功:

vcctl version

如果安装成功,输出如下:

API Version: v1alpha1
Version: latest
Git SHA: 5302995a1d000df718fee5f3044df64bcb773d8d
Built At: 2023-07-14 07:51:44
Go Version: go1.18.9
Go OS/Arch: linux/amd64

创建 Volcano Queue 队列:

cd ~
cat << EOF > volcano-queue.yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: cromwell
spec:
  weight: 1
  reclaimable: false
  capability:
    cpu: 1500
EOF
kubectl apply -f volcano-queue.yaml

安装 Cromwell

Cromwell 由 Java 语言编写,我们可以直接在 Github 上下载打包好的 jar 包,地址如下:https://github.com/broadinstitute/cromwell/releases

截止到 2023 年 7 月,Cromwell 已经 release 85 个版本,运行 Cromwell 依赖 Java 11 环境。

安装 Java 11 环境,本例子Amazon EC2 采用的是 Amazon linux2 的镜像,执行如下命令:

sudo yum update -y
sudo amazon-linux-extras install java-openjdk11 -y
java -version

下载 Cromwell-85.jar 文件:

cd /shared
wget https://github.com/broadinstitute/cromwell/releases/download/85/cromwell-85.jar
java -jar cromwell-85.jar

安装成功,输出如下:

cromwell 85
Usage: java -jar /path/to/cromwell.jar [server|run|submit] [options] <args>...

  --help                   Cromwell - Workflow Execution Engine
  --version
Command: server
Starts a web server on port 8000.  See the web server documentation for more details about the API endpoints.
Command: run [options] workflow-source
Run the workflow and print out the outputs in JSON format.
  workflow-source          Workflow source file or workflow url.
  --workflow-root <value>  Workflow root.
  -i, --inputs <value>     Workflow inputs file.
  -o, --options <value>    Workflow options file.
  -t, --type <value>       Workflow type.
  -v, --type-version <value>
                           Workflow type version.
  -l, --labels <value>     Workflow labels file.
  -p, --imports <value>    A zip file to search for workflow imports.
  -m, --metadata-output <value>
                           An optional JSON file path to output metadata.
Command: submit [options] workflow-source
Submit the workflow to a Cromwell server.
  workflow-source          Workflow source file or workflow url.
  --workflow-root <value>  Workflow root.
  -i, --inputs <value>     Workflow inputs file.
  -o, --options <value>    Workflow options file.
  -t, --type <value>       Workflow type.
  -v, --type-version <value>
                           Workflow type version.
  -l, --labels <value>     Workflow labels file.
  -p, --imports <value>    A zip file to search for workflow imports.
  -h, --host <value>       Cromwell server URL.

我们从 Cromwell 官网上的资料查到和 Volcano 集成相对比较简单,只需要配置 Cromwell 的 Backends 即可,如下:

在装好的 Cromwell 的 Amazon EC2 上执行如下命令,生成 backend 配置文件:

cd /shared
mkdir cromwell 
cd cromwell
cat << EOF > volcano.conf
include required(classpath("application"))
backend {
  default = Volcano

  providers {
    Volcano {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        runtime-attributes = """
            Int runtime_minutes = 600
            Int cpus = 2
            Int requested_memory_mb_per_core = 8000
            String queue = "cromwell"
        """

        # If an 'exit-code-timeout-seconds' value is specified:
        # - check-alive will be run at this interval for every job
        # - if a job is found to be not alive, and no RC file appears after this interval
        # - Then it will be marked as Failed.
        # Warning: If set, Cromwell will run 'check-alive' for every job at this interval

        # exit-code-timeout-seconds = 120

        submit = """
            vcctl job run -f ${script}
        """
        kill = "vcctl job delete -N ${job_id}"
        check-alive = "vcctl job view -N ${job_id}"
        job-id-regex = "(\\\d+)"
      }
    }
}
}
EOF

创建简单的流程测试文件:simple-hello.wdl

cat << EOF > simple-hello.wdl
task echoHello{
    command {
        echo "Hello AWS!"
    }
    runtime {
        dockerImage: "ubuntu:latest"
        memorys: "512Mi"
        cpus: 1
    }

}

workflow printHelloAndGoodbye {
    call echoHello
}
EOF

Cromwell 与 Volcano 集成测试:

cd /shared
java -Dconfig.file=/shared/cromwell/volcano.conf -jar /shared/cromwell-85.jar run /shared/cromwell/simple-hello.wdl

结果是错误的,输出如下:

[2023-07-14 09:01:59,93] [info] Running with database db.url = jdbc:hsqldb:mem:af791aa3-a32f-4eb7-9485-cef928350377;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 09:02:05,59] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2023-07-14 09:02:05,60] [info] [RenameWorkflowOptionsInMetadata] 100%
[2023-07-14 09:02:05,98] [info] Running with database db.url = jdbc:hsqldb:mem:dfdbe72d-a916-46a4-989d-a2b8f19138b0;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 09:02:06,81] [info] Slf4jLogger started
[2023-07-14 09:02:07,10] [info] Workflow heartbeat configuration:
{
  "cromwellId" : "cromid-4a23a0e",
  "heartbeatInterval" : "2 minutes",
  "ttl" : "10 minutes",
  "failureShutdownDuration" : "5 minutes",
  "writeBatchSize" : 10000,
  "writeThreshold" : 10000
}
[2023-07-14 09:02:07,29] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 09:02:07,29] [info] Metadata summary refreshing every 1 second.
[2023-07-14 09:02:07,29] [info] No metadata archiver defined in config
[2023-07-14 09:02:07,29] [info] No metadata deleter defined in config
[2023-07-14 09:02:07,30] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 09:02:07,44] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2023-07-14 09:02:07,55] [info] JobRestartCheckTokenDispenser - Distribution rate: 50 per 1 seconds.
[2023-07-14 09:02:07,72] [info] JobExecutionTokenDispenser - Distribution rate: 20 per 10 seconds.
[2023-07-14 09:02:07,93] [info] SingleWorkflowRunnerActor: Version 85
[2023-07-14 09:02:07,97] [info] SingleWorkflowRunnerActor: Submitting workflow
[2023-07-14 09:02:08,16] [info] Unspecified type (Unspecified version) workflow c0db1d21-1327-4471-b08b-33d0bb43a0eb submitted
[2023-07-14 09:02:08,21] [info] SingleWorkflowRunnerActor: Workflow submitted c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,23] [info] 1 new workflows fetched by cromid-4a23a0e: c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,25] [info] WorkflowManagerActor: Starting workflow c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,26] [info] WorkflowManagerActor: Successfully started WorkflowActor-c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,26] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2023-07-14 09:02:08,32] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
[2023-07-14 09:02:08,52] [info] MaterializeWorkflowDescriptorActor [c0db1d21]: Parsing workflow as WDL draft-2
[2023-07-14 09:02:09,47] [info] MaterializeWorkflowDescriptorActor [c0db1d21]: Call-to-Backend assignments: printHelloAndGoodbye.echoHello -> Volcano
[2023-07-14 09:02:09,83] [warn] Volcano [c0db1d21]: Key/s [docker] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2023-07-14 09:02:11,04] [info] WorkflowExecutionActor-c0db1d21-1327-4471-b08b-33d0bb43a0eb [c0db1d21]: Starting printHelloAndGoodbye.echoHello
[2023-07-14 09:02:12,56] [info] Not triggering log of restart checking token queue status. Effective log interval = None
[2023-07-14 09:02:12,74] [info] Not triggering log of execution token queue status. Effective log interval = None
[2023-07-14 09:02:17,75] [info] Assigned new job execution tokens to the following groups: c0db1d21: 1
[2023-07-14 09:02:17,93] [warn] DispatchedConfigAsyncJobExecutionActor [c0db1d21printHelloAndGoodbye.echoHello:NA:1]: Unrecognized runtime attribute keys: docker
[2023-07-14 09:02:18,02] [info] DispatchedConfigAsyncJobExecutionActor [c0db1d21printHelloAndGoodbye.echoHello:NA:1]: echo "Hello AWS!"
[2023-07-14 09:02:18,05] [info] DispatchedConfigAsyncJobExecutionActor [c0db1d21printHelloAndGoodbye.echoHello:NA:1]: executing: vcctl job run -f
[2023-07-14 09:02:18,28] [info] WorkflowManagerActor: Workflow c0db1d21-1327-4471-b08b-33d0bb43a0eb failed (during ExecutingWorkflowState): java.lang.RuntimeException: Unable to start job. Check the stderr file for possible errors: /home/ec2-user/cromwell-executions/printHelloAndGoodbye/c0db1d21-1327-4471-b08b-33d0bb43a0eb/call-echoHello/execution/stderr.submit
	at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.$anonfun$execute$2(SharedFileSystemAsyncJobExecutionActor.scala:165)
	at scala.util.Either.fold(Either.scala:189)
	at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute(SharedFileSystemAsyncJobExecutionActor.scala:160)
	at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute$(SharedFileSystemAsyncJobExecutionActor.scala:155)
	at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.execute(ConfigAsyncJobExecutionActor.scala:215)
	at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$executeAsync$1(StandardAsyncExecutionActor.scala:751)
	at scala.util.Try$.apply(Try.scala:210)
	at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync(StandardAsyncExecutionActor.scala:751)
	at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync$(StandardAsyncExecutionActor.scala:751)
	at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeAsync(ConfigAsyncJobExecutionActor.scala:215)
	at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:1141)
	at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:1133)
	at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeOrRecover(ConfigAsyncJobExecutionActor.scala:215)
	at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
	at cromwell.core.retry.Retry$.withRetry(Retry.scala:46)
	at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
	at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
	at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
	at akka.actor.Actor.aroundReceive(Actor.scala:539)
	at akka.actor.Actor.aroundReceive$(Actor.scala:537)
	at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.aroundReceive(ConfigAsyncJobExecutionActor.scala:215)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
	at akka.actor.ActorCell.invoke(ActorCell.scala:583)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
	at akka.dispatch.Mailbox.run(Mailbox.scala:229)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

从此次的实验中得出结果,Cromwell 和 Volcano 根本没有进行集成,连 Cromwell 在 Github 上提供的 Volcano backend 代码都缺少一个 } 号,在 backend 加上 } 号后能跑过,但这还不是重点问题,接着又报另外的错误,我们再查看具体的错误日志,报错如下:

Error: flag needs an argument: 'f' in -f
Usage:
  vcctl job run [flags]

Flags:
  -f, --filename string     the yaml file of job
  -h, --help                help for run
  -i, --image string        the container image of job (default "busybox")
  -k, --kubeconfig string   (optional) absolute path to the kubeconfig file (default "/home/ec2-user/.kube/config")
  -L, --limits string       the resource limit of the task (default "cpu=1000m,memory=100Mi")
  -s, --master string       the address of apiserver
  -m, --min int             the minimal available tasks of job (default 1)
  -N, --name string         the name of job
  -n, --namespace string    the namespace of job (default "default")
  -r, --replicas int        the total tasks of job (default 1)
  -R, --requests string     the resource request of the task (default "cpu=1000m,memory=100Mi")
  -S, --scheduler string    the scheduler for this job (default "volcano")

Global Flags:
      --log-flush-frequency duration   Maximum number of seconds between log flushes (default 5s)

我们进一步分析 vcctl job run -f 后面是需要跟上一个 yaml 文件,测试如下:

[ec2-user@ip-172-31-30-173 execution]$ vcctl job run -f script
Failed to run job: only support yaml file

我们再进一步查找 Volcano 在 Github 上相关的 Issues资料,也是证实了目前还不支持 Cromwell,如下:

经过一系列的试错和经验总结后,既然 vcctl job run -f 需要一个 yaml 文件,那我们可以手动创建一个 yaml 文件模板,然后在 backend 中指定该模板,再把一些动态变化的参数开放出来,根据这样的思路,我们新建一个模板,注意需要把 efs-storage-claim 改为自己的 Amazon EFS mount point,如下:

cd shared/cromwell
cat << EOF > eks-template.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: JOB_NAME
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: VOLCANO_QUEUE
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: task
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - name: cromwell
              image: C_IMAGE
              command:
              - sh
              - EXECUTION_DIR/script
              resources:
                requests:
                  cpu: CPUS
                  memory: MEMORYS
                limits:
                  cpu: CPUS
                  memory: MEMORYS
              volumeMounts:
              - name: efs-pvc
                mountPath: /shared
          volumes:
          - name: efs-pvc
            persistentVolumeClaim:
              claimName: efs-storage-claim
          restartPolicy: Never
EOF

模板里可以根据自己的需求增加一些参数,接着需要修改 backend,如下:

cd /shared/cromwell
cat << EOF > volcano.conf
include required(classpath("application"))
backend {
  default = Volcano

  providers {
    Volcano {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        runtime-attributes = """
            Int runtime_minutes = 600
            Int cpus = 1
            String memorys = "512MB"
            String dockerImage = "ubuntu:latest"
            String queue = "cromwell"
            String job_template = "/shared/cromwell/eks-template.yaml"
        """
        submit = """
            name=\${"\`echo "+job_name+"|awk -F '_' '{print tolower(\$3)}'\`"}
            uid=\${"\`echo \$(cat /dev/urandom | od -x | head -1 | awk '{print \$2\$3\"\"\$4\$5\"\"\$6\$7\"\"\$8\$9}')\`"}
            jobName=\${"\`echo \$name-\$uid\`"}
            execDir=\$(cd \`dirname \$0\`; pwd)
            workflowDir=\${"\`dirname "+cwd+"\`"}
            cat \${job_template} > \${script}.yaml
                sed -i "s@JOB_NAME@\$jobName@g" \${script}.yaml && \\
                sed -i "s@WORKFLOW_DIR@\$workflowDir@g" \${script}.yaml && \\
                sed -i "s@EXECUTION_DIR@\$execDir@g" \${script}.yaml && \\
                sed -i "s@VOLCANO_QUEUE@\${queue}@g" \${script}.yaml && \\
                sed -i "s@CPUS@\${cpus}@g" \${script}.yaml && \\
                sed -i "s@C_IMAGE@\${dockerImage}@g" \${script}.yaml && \\
                sed -i "s@MEMORYS@\${memorys}@g" \${script}.yaml && \\
            vcctl job run -f \${script}.yaml
        """
        kill = "vcctl job delete -N \${job_id}"
        check-alive = "vcctl job view -N \${job_id}"
        job-id-regex = ".*run job (\\\S+) successfully.*"
      }
    }
}
}
EOF

再次执行投递任务,如下:

cd /shared
java -Dconfig.file=/shared/cromwell/volcano.conf -jar /shared/cromwell-85.jar run /shared/cromwell/simple-hello.wdl

成功输入如下信息:

[2023-07-14 12:16:52,70] [info] Running with database db.url = jdbc:hsqldb:mem:6991c79a-dc66-46fb-8724-c54912be7130;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 12:16:58,50] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2023-07-14 12:16:58,52] [info] [RenameWorkflowOptionsInMetadata] 100%
[2023-07-14 12:16:59,04] [info] Running with database db.url = jdbc:hsqldb:mem:5e0dacf6-bb11-4537-b2b0-8fa06e868fc1;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 12:16:59,85] [info] Slf4jLogger started
[2023-07-14 12:17:00,13] [info] Workflow heartbeat configuration:
{
  "cromwellId" : "cromid-d51c166",
  "heartbeatInterval" : "2 minutes",
  "ttl" : "10 minutes",
  "failureShutdownDuration" : "5 minutes",
  "writeBatchSize" : 10000,
  "writeThreshold" : 10000
}
[2023-07-14 12:17:00,24] [info] Metadata summary refreshing every 1 second.
[2023-07-14 12:17:00,27] [info] No metadata archiver defined in config
[2023-07-14 12:17:00,27] [info] No metadata deleter defined in config
[2023-07-14 12:17:00,30] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 12:17:00,36] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 12:17:00,40] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2023-07-14 12:17:00,66] [info] JobRestartCheckTokenDispenser - Distribution rate: 50 per 1 seconds.
[2023-07-14 12:17:00,76] [info] JobExecutionTokenDispenser - Distribution rate: 20 per 10 seconds.
[2023-07-14 12:17:00,91] [info] SingleWorkflowRunnerActor: Version 85
[2023-07-14 12:17:00,93] [info] SingleWorkflowRunnerActor: Submitting workflow
[2023-07-14 12:17:01,06] [info] Unspecified type (Unspecified version) workflow 5beb4ec2-f019-479c-97c1-f20559a619a1 submitted
[2023-07-14 12:17:01,13] [info] SingleWorkflowRunnerActor: Workflow submitted 5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,14] [info] 1 new workflows fetched by cromid-d51c166: 5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,16] [info] WorkflowManagerActor: Starting workflow 5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,19] [info] WorkflowManagerActor: Successfully started WorkflowActor-5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,19] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2023-07-14 12:17:01,23] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
[2023-07-14 12:17:01,45] [info] MaterializeWorkflowDescriptorActor [5beb4ec2]: Parsing workflow as WDL draft-2
[2023-07-14 12:17:02,38] [info] MaterializeWorkflowDescriptorActor [5beb4ec2]: Call-to-Backend assignments: printHelloAndGoodbye.echoHello -> Volcano
[2023-07-14 12:17:02,86] [warn] Volcano [5beb4ec2]: Key/s [docker] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2023-07-14 12:17:04,05] [info] WorkflowExecutionActor-5beb4ec2-f019-479c-97c1-f20559a619a1 [5beb4ec2]: Starting printHelloAndGoodbye.echoHello
[2023-07-14 12:17:05,67] [info] Not triggering log of restart checking token queue status. Effective log interval = None
[2023-07-14 12:17:05,78] [info] Not triggering log of execution token queue status. Effective log interval = None
[2023-07-14 12:17:10,79] [info] Assigned new job execution tokens to the following groups: 5beb4ec2: 1
[2023-07-14 12:17:10,94] [warn] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Unrecognized runtime attribute keys: docker
[2023-07-14 12:17:11,09] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: echo "Hello AWS!"
[2023-07-14 12:17:11,16] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: executing: Name=echo cromwell_5beb4ec2_echoHello|awk -F '_' '{print tolower($3)}'
Uid=echo $(cat /dev/urandom | od -x | head -1 | awk '{print $2$3""$4$5""$6$7""$8$9}')
JobName=echo $Name-$Uid
ExecDir=$(cd dirname $0; pwd)
workflowDir=dirname /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello
touch $ExecDir/stderr.qsub
cat ~/cromwell/eks-template.yaml > /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml
    sed -i "s@JOB_NAME@$JobName@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
    sed -i "s@WORKFLOW_DIR@$workflowDir@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
    sed -i "s@EXECUTION_DIR@$ExecDir@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
    sed -i "s@VOLCANO_QUEUE@cromwell@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
    sed -i "s@CPUS@1@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
    sed -i "s@MEMORYS@500Mi@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
    vcctl job run -f /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml
[2023-07-14 12:17:15,43] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: job id: echohello-bfd04a2494a176c8972298007d26c734
[2023-07-14 12:17:15,43] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Cromwell will watch for an rc file but will *not* double-check whether this job is actually alive (unless Cromwell restarts)
[2023-07-14 12:17:15,43] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Status change from - to Running
[2023-07-14 12:17:17,04] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Status change from Running to Done
[2023-07-14 12:17:17,36] [info] WorkflowExecutionActor-5beb4ec2-f019-479c-97c1-f20559a619a1 [5beb4ec2]: Workflow printHelloAndGoodbye complete. Final Outputs:
{

}
[2023-07-14 12:17:20,42] [info] WorkflowManagerActor: Workflow actor for 5beb4ec2-f019-479c-97c1-f20559a619a1 completed with status 'Succeeded'. The workflow will be removed from the workflow store.
[2023-07-14 12:17:25,16] [info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
  "outputs": {

  },
  "id": "5beb4ec2-f019-479c-97c1-f20559a619a1"
}

Volcano 与 Amazon EKS 集成调优

当我们的 Pod 申请的资源大于 NodeGroup 里机器的资源时,Job 会一直卡在 Pending 状态,无法通过 Karpenter 弹出新的 Amazon EC2 来运行,如下:

$ vcctl job list
Name                                         Creation       Phase       JobType     Replicas    Min   Pending   Running   Succeeded   Failed    Unknown     RetryCount
echohello-300cc1b83423792e68fe43eff0433995   2023-07-19     Pending     Batch       1           1     0         0         0           0         0           0

进一步查看该 Job 的日志,如下:

vcctl job view -N echohello-300cc1b83423792e68fe43eff0433995
Name:       	echohello-300cc1b83423792e68fe43eff0433995
Namespace:  	default
Labels:     	<none>
Annotations:	<none>
....
Events:
Type           	Reason                                  	Age                           	Form                                    	Message
-------        	-------                                 	-------                       	-------                                 	-------
Warning        	PodGroupPending                         	3m52s                         	vc-controller-manager                   	PodGroup default:echohello-300cc1b83423792e68fe43eff0433995 unschedule,reason: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable

问题的本质在于 Volcano 默认采用的是 Gang 调度模式,该模式可以将多个任务(Pod)同时调度到同一组工作节点上运行,以实现任务之间的协同工作和高效利用资源。出现 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable 这种错误通常意味着 Volcano 调度程序无法找到足够的可用节点来运行任务。所以该调度模式必须是 Amazon EKS 集群中已经存在运行的节点,为了解决这个问题,我们可以修改 Volcano 的调度模式。

修改 Volcano 调度模式,执行以下命令:

cat << EOF > volcano-scheduler-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: sla
        arguments:
          sla-waiting-time: 5s
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
EOF
kubectl apply -f volcano-scheduler-configmap.yaml

通过 Cromwell 投递生信流程

本小节演示如何将 Cromwell 与 Amazon EKS 结合来运行 GATK4 HaplotypeCaller。HaplotypeCaller 工具是 GATK 最佳实践流程中的主要步骤之一。

运行 HaplotypeCaller

  1. Download Input data
cd /shared
mkdir -p workflow/haplotype-caller-workflow
cd workflow/haplotype-caller-workflow
aws s3 sync s3://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals hg38_wgs_scattered_calling_intervals
aws s3 cp s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam .
aws s3 cp s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bai .
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict .
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta .
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai .
cat << EOF > hg38_wgs_scattered_calling_intervals.txt
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0003_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0004_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0005_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0006_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0007_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0008_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0009_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0010_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0011_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0012_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0013_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0014_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0015_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0016_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0017_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0018_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0019_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0020_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0021_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0022_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0023_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0024_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0025_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0026_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0027_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0028_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0029_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0030_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0031_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0032_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0033_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0034_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0035_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0036_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0037_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0038_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0039_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0040_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0041_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0042_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0043_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0044_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0045_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0046_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0047_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0048_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0049_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0050_of_50/scattered.interval_list
EOF
  1. Worflow Definition
cat << EOF > HaplotypeCallerWF.wdl
# WORKFLOW DEFINITION
workflow HaplotypeCallerGvcf_GATK4 {
  File input_bam
  File input_bam_index
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  File scattered_calling_intervals_list

  String gatk_docker

  String gatk_path

  Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list)

  String sample_basename = basename(input_bam, ".bam")

  String gvcf_name = sample_basename + ".g.vcf.gz"
  String gvcf_index = sample_basename + ".g.vcf.gz.tbi"

  # Call variants in parallel over grouped calling intervals
  scatter (interval_file in scattered_calling_intervals) {

    # Generate GVCF by interval
    call HaplotypeCaller {
      input:
        input_bam = input_bam,
        input_bam_index = input_bam_index,
        interval_list = interval_file,
        gvcf_name = gvcf_name,
        ref_dict = ref_dict,
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        docker_image = gatk_docker,
        gatk_path = gatk_path
    }
  }

  # Merge per-interval GVCFs
  call MergeGVCFs {
    input:
      input_vcfs = HaplotypeCaller.output_gvcf,
      vcf_name = gvcf_name,
      vcf_index = gvcf_index,
      docker_image = gatk_docker,
      gatk_path = gatk_path
  }

  # Outputs that will be retained when execution is complete
  output {
    File output_merged_gvcf = MergeGVCFs.output_vcf
    File output_merged_gvcf_index = MergeGVCFs.output_vcf_index
  }
}

# TASK DEFINITIONS

# HaplotypeCaller per-sample in GVCF mode
task HaplotypeCaller {
  File input_bam
  File input_bam_index
  String gvcf_name
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  File interval_list
  Int? interval_padding
  Float? contamination
  Int? max_alt_alleles

  String mem_size

  String docker_image
  String gatk_path
  String java_opt

  command {
      \${gatk_path} --java-options \${java_opt} \\
      HaplotypeCaller \\
      -R \${ref_fasta} \\
      -I \${input_bam} \\
      -O \${gvcf_name} \\
      -L \${interval_list} \\
      -ip \${default=100 interval_padding} \\
      -contamination \${default=0 contamination} \\
      --max-alternate-alleles \${default=3 max_alt_alleles} \\
      -ERC GVCF
  }

  runtime {
    dockerImage: docker_image
    memorys: mem_size
    cpus: 1
  }

  output {
    File output_gvcf = "\${gvcf_name}"
  }
}

# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
  Array [File] input_vcfs
  String vcf_name
  String vcf_index

  String mem_size

  String docker_image
  String gatk_path
  String java_opt

  command {
    \${gatk_path} --java-options \${java_opt} \\
      MergeVcfs \
      --INPUT=\${sep=' --INPUT=' input_vcfs} \\
      --OUTPUT=\${vcf_name}
  }

  runtime {
    dockerImage: docker_image
    memorys: mem_size
    cpus: 1
  }

  output {
    File output_vcf = "\${vcf_name}"
    File output_vcf_index = "\${vcf_index}"
  }
}
EOF
  1. Inputs Definition
cat << EOF > HaplotypeCallerWF.inputs.json
{
  "##_COMMENT1": "INPUT BAM",
  "HaplotypeCallerGvcf_GATK4.input_bam": "/shared/workflow/haplotype-caller-workflow/NA12878_24RG_small.hg38.bam",
  "HaplotypeCallerGvcf_GATK4.input_bam_index": "/shared/workflow/haplotype-caller-workflow/NA12878_24RG_small.hg38.bai",

  "##_COMMENT2": "REFERENCE FILES",
  "HaplotypeCallerGvcf_GATK4.ref_dict": "/shared/workflow/haplotype-caller-workflow/Homo_sapiens_assembly38.dict",
  "HaplotypeCallerGvcf_GATK4.ref_fasta": "/shared/workflow/haplotype-caller-workflow/Homo_sapiens_assembly38.fasta",
  "HaplotypeCallerGvcf_GATK4.ref_fasta_index": "/shared/workflow/haplotype-caller-workflow/Homo_sapiens_assembly38.fasta.fai",

  "##_COMMENT3": "INTERVALS",
  "HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list": "/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals.txt",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.interval_padding": 100,

  "##_COMMENT4": "DOCKERS",
  "HaplotypeCallerGvcf_GATK4.gatk_docker": "broadinstitute/gatk:4.0.0.0",

  "##_COMMENT5": "PATHS",
  "HaplotypeCallerGvcf_GATK4.gatk_path": "/gatk/gatk",

  "##_COMMENT6": "JAVA OPTIONS",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.java_opt": "-Xms10000m",
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.java_opt": "-Xms10000m",

  "##_COMMENT7": "MEMORY ALLOCATION",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.mem_size": "12Gi",
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.mem_size": "30Gi",

  "##_COMMENT8": "DISK SIZE ALLOCATION",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.disk_size": 100,
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.disk_size": 100,

  "##_COMMENT9": "PREEMPTION",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.preemptible_tries": 3,
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.preemptible_tries": 3
}
EOF
  1. Running the workflow
cd /shared
nohup java -Dconfig.file=/shared/cromwell/volcano.conf -jar /shared/cromwell-85.jar run /shared/workflow/haplotype-caller-workflow/HaplotypeCallerWF.wdl --inputs /shared/workflow/haplotype-caller-workflow/HaplotypeCallerWF.inputs.json > haplotype_caller.log 2>&1 &
  1. Job log
taif -f haplotype_caller.log

运行情况如下:

$ kubectl get pods
NAME                                                      READY   STATUS      RESTARTS   AGE
haplotypecaller-11af636d71eb3733da71ed5f4e38d7b5-task-0   1/1     Running     0          33m
haplotypecaller-2facecea05e77e637c89603ab532a979-task-0   0/1     Completed   0          33m
haplotypecaller-5440f17212b266d6ac852be054085f40-task-0   0/1     Completed   0          33m
haplotypecaller-6007c2c638f95646d151d46c2bec632d-task-0   0/1     Completed   0          33m
haplotypecaller-7c7e1421a759a1469a241d2bb92c36f4-task-0   0/1     Completed   0          33m
haplotypecaller-8ce6fbb2900581f12bd21f25b07af605-task-0   0/1     Completed   0          33m
haplotypecaller-a07962b5179f200dc1f4de9c9458e0a2-task-0   0/1     Completed   0          33m
haplotypecaller-b2fc938a406587a22361fd2adc8d0a74-task-0   0/1     Completed   0          33m
haplotypecaller-ebf38edbec0901f512c8f3e9724fc8f9-task-0   0/1     Completed   0          33m
haplotypecaller-f3377af2d05404d86be2a69028d91b2c-task-0   0/1     Completed   0          33m

总结

本文详细介绍了生物信息分析中常用的工作流框架 Cromwell,并解决了 Cromwell 和 Volcano 无法集成的问题,同时调整了 Volcano 和 Karpenter 调度冲突问题,最后演示了一个比较接近真实的生信流程工具跑了一遍任务。文章提供了每个步骤需要执行的命令和操作,以及相应的命令行代码示例。按照本文提供的步骤,在第一篇文章中创建好的 Amazon EKS 集群的基础上,就可以搭建出自己的容器化 HPC 集群,最后将您已有的 Cromwell 流程部署到 Cromwell Server 上,就可以高效地运行集群任务。

参考链接

Amazon EKS:https://aws.amazon.com/cn/eks/

Karpenter:https://karpenter.sh/

Volcano:https://volcano.sh/zh/docs/

Cromwell:https://cromwell.readthedocs.io/en/stable/

HaplotypeCallerWF:https://github.com/broadinstitute/cromwell/tree/develop/centaur/src/main/resources/integrationTestCases/germline/haplotype-caller-workflow

本篇作者

陈恩泽

亚马逊云科技公共事业部解决方案架构师,负责 HPC 生物信息软硬件架构设计及优化工作。2010 年加入华大基因,从事华大信息化改造研发相关工作,包含了从样品接收到上机测序整个流程的信息化,提高了生产效率及降低生产事故。2015 年开始接触云计算,领导研发 BGIOnline 信息分析云计算平台,完成了 20000+ 的高深度 WGS 分析,在降低 WGS 成本的同时又将分析时间缩短到 24 小时以内,该平台荣获英特尔基因云计算方案最佳实践 Bio-IT 奖。