亚马逊AWS官方博客
STOmics 系列 2 – Cromwell 和 Volcano 集成技术详解
在本系列的第一篇文章中,介绍了 STOmics、Amazon EKS、Karpenter 和 Volcano 等核心概念,以及如何使用这些组件和服务构建容器化 HPC 平台,以实现 STOmics 的个性化分析工作场景的最佳实践。在实践过程中,遇到了一些困难,本文将详细讨论 Cromwell 和 Volcano 的集成问题,并深入探讨如何使用 Cromwell 将一个 WDL 规范的生信流程投递到 Amazon EKS 集群,以进行高效的任务调度和计算分析。
Cromwell 简介
Cromwell 是一个流行的开源工具,用于管理和执行基于工作流的数据分析任务。它最初是为 Broad Institute 的 Genomics Platform 开发的,但现在已成为许多生物医学研究人员和数据科学家的首选工具之一。Cromwell 的主要特点是可扩展性和灵活性。它支持多种计算环境,包括本地计算机、云计算和高性能计算(HPC)集群。使用 Cromwell,用户可以轻松地定义和运行工作流,管理任务依赖关系,跟踪任务状态和输出结果。Cromwell 还支持多种工作流语言,如 WDL(Workflow Description Language)和 CWL(Common Workflow Language),其中 CWL 在 80 及以上版本已经不支持了。这些语言可以帮助用户描述工作流和任务的输入、输出和参数,从而更轻松地创建可重复和可扩展的工作流。
Volcano 简介
Volcano 是基于 Kubernetes 的容器批量计算平台,主要用于高性能计算场景。它提供了 Kubernetes 目前缺少的一套机制,这些机制通常是机器学习大数据应用、科学计算、特效渲染等多种高性能工作负载所需的。作为一个通用批处理平台,Volcano 与几乎所有的主流计算框架无缝对接,如 Spark 、TensorFlow 、PyTorch、Flink、Argo、MindSpore、 PaddlePaddle 等。它还提供了包括基于各种主流架构的 CPU、GPU 在内的异构设备混合调度能力。Volcano 的设计理念建立在 15 年来多种系统和平台大规模运行各种高性能工作负载的使用经验之上,并结合来自开源社区的最佳思想和实践。
先决条件
- 拥有一个 AWS 帐户
- 创建 AWS 资源的权限(例如 IAM 角色、IAM 策略、Amazon EC2 实例和 Amazon EKS 集群)
- 创建一台用于操作的 EC2 并赋予角色,该角色具有 admin 权限并且该 EC2 已经挂载了和 EKS 集群相同的 Amazon EFS 文件系统,挂载路径都为 /shared
- 创建一个 Amazon EKS 集群并挂载 EFS 文件存储
- 熟悉 Kubernetes 和 Linux shell 命令的基本使用
参考架构图
Cromwell 和 Volcano 集成问题
安装 Volcano
Volcano 官方文档的介绍中有提到支持 Cromwell 的生态,但是在生态栏中却没有 Cromwell 的详细介绍,当然也不止 Cromwell 框架,其他有些框架也是这样的情况,如下图:
本文我们只研究 Cromwell 框架,所以我们先安装 Volcano,再安装 Cromwell 进行集成,Volcano 在 Amazon EKS 上安装也非常简单,我们在已经创建好的 Amazon EKS 集群上运行以下命令:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
验证 Volcano 安装是否成功:
kubectl get all -n volcano-system
安装成功输出如下:
NAME READY STATUS RESTARTS AGE
pod/volcano-admission-8cc65c757-dfv54 1/1 Running 0 49s
pod/volcano-admission-init-mqnsn 0/1 Completed 0 49s
pod/volcano-controllers-56588b7df6-gjzhz 1/1 Running 0 48s
pod/volcano-scheduler-58dc5bd8fb-27g2g 1/1 Running 0 48s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/volcano-admission-service ClusterIP 10.100.119.77 <none> 443/TCP 49s
service/volcano-scheduler-service ClusterIP 10.100.247.193 <none> 8080/TCP 48s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/volcano-admission 1/1 1 1 49s
deployment.apps/volcano-controllers 1/1 1 1 48s
deployment.apps/volcano-scheduler 1/1 1 1 48s
NAME DESIRED CURRENT READY AGE
replicaset.apps/volcano-admission-8cc65c757 1 1 1 49s
replicaset.apps/volcano-controllers-56588b7df6 1 1 1 48s
replicaset.apps/volcano-scheduler-58dc5bd8fb 1 1 1 48s
NAME COMPLETIONS DURATION AGE
job.batch/volcano-admission-init 1/1 6s 49s
安装 vcctl 命令行工具,该工具依赖 go 语言,执行以下命令:
sudo yum install golang -y
cd ~
git clone https://github.com/volcano-sh/volcano.git
cd volcano
make vcctl
sudo cp _output/bin/vcctl /usr/local/bin/
检查是否安装成功:
vcctl version
如果安装成功,输出如下:
API Version: v1alpha1
Version: latest
Git SHA: 5302995a1d000df718fee5f3044df64bcb773d8d
Built At: 2023-07-14 07:51:44
Go Version: go1.18.9
Go OS/Arch: linux/amd64
创建 Volcano Queue 队列:
cd ~
cat << EOF > volcano-queue.yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: cromwell
spec:
weight: 1
reclaimable: false
capability:
cpu: 1500
EOF
kubectl apply -f volcano-queue.yaml
安装 Cromwell
Cromwell 由 Java 语言编写,我们可以直接在 Github 上下载打包好的 jar 包,地址如下:https://github.com/broadinstitute/cromwell/releases
截止到 2023 年 7 月,Cromwell 已经 release 85 个版本,运行 Cromwell 依赖 Java 11 环境。
安装 Java 11 环境,本例子Amazon EC2 采用的是 Amazon linux2 的镜像,执行如下命令:
sudo yum update -y
sudo amazon-linux-extras install java-openjdk11 -y
java -version
下载 Cromwell-85.jar 文件:
cd /shared
wget https://github.com/broadinstitute/cromwell/releases/download/85/cromwell-85.jar
java -jar cromwell-85.jar
安装成功,输出如下:
cromwell 85
Usage: java -jar /path/to/cromwell.jar [server|run|submit] [options] <args>...
--help Cromwell - Workflow Execution Engine
--version
Command: server
Starts a web server on port 8000. See the web server documentation for more details about the API endpoints.
Command: run [options] workflow-source
Run the workflow and print out the outputs in JSON format.
workflow-source Workflow source file or workflow url.
--workflow-root <value> Workflow root.
-i, --inputs <value> Workflow inputs file.
-o, --options <value> Workflow options file.
-t, --type <value> Workflow type.
-v, --type-version <value>
Workflow type version.
-l, --labels <value> Workflow labels file.
-p, --imports <value> A zip file to search for workflow imports.
-m, --metadata-output <value>
An optional JSON file path to output metadata.
Command: submit [options] workflow-source
Submit the workflow to a Cromwell server.
workflow-source Workflow source file or workflow url.
--workflow-root <value> Workflow root.
-i, --inputs <value> Workflow inputs file.
-o, --options <value> Workflow options file.
-t, --type <value> Workflow type.
-v, --type-version <value>
Workflow type version.
-l, --labels <value> Workflow labels file.
-p, --imports <value> A zip file to search for workflow imports.
-h, --host <value> Cromwell server URL.
我们从 Cromwell 官网上的资料查到和 Volcano 集成相对比较简单,只需要配置 Cromwell 的 Backends 即可,如下:
在装好的 Cromwell 的 Amazon EC2 上执行如下命令,生成 backend 配置文件:
cd /shared
mkdir cromwell
cd cromwell
cat << EOF > volcano.conf
include required(classpath("application"))
backend {
default = Volcano
providers {
Volcano {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
runtime-attributes = """
Int runtime_minutes = 600
Int cpus = 2
Int requested_memory_mb_per_core = 8000
String queue = "cromwell"
"""
# If an 'exit-code-timeout-seconds' value is specified:
# - check-alive will be run at this interval for every job
# - if a job is found to be not alive, and no RC file appears after this interval
# - Then it will be marked as Failed.
# Warning: If set, Cromwell will run 'check-alive' for every job at this interval
# exit-code-timeout-seconds = 120
submit = """
vcctl job run -f ${script}
"""
kill = "vcctl job delete -N ${job_id}"
check-alive = "vcctl job view -N ${job_id}"
job-id-regex = "(\\\d+)"
}
}
}
}
EOF
创建简单的流程测试文件:simple-hello.wdl
cat << EOF > simple-hello.wdl
task echoHello{
command {
echo "Hello AWS!"
}
runtime {
dockerImage: "ubuntu:latest"
memorys: "512Mi"
cpus: 1
}
}
workflow printHelloAndGoodbye {
call echoHello
}
EOF
Cromwell 与 Volcano 集成测试:
cd /shared
java -Dconfig.file=/shared/cromwell/volcano.conf -jar /shared/cromwell-85.jar run /shared/cromwell/simple-hello.wdl
结果是错误的,输出如下:
[2023-07-14 09:01:59,93] [info] Running with database db.url = jdbc:hsqldb:mem:af791aa3-a32f-4eb7-9485-cef928350377;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 09:02:05,59] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2023-07-14 09:02:05,60] [info] [RenameWorkflowOptionsInMetadata] 100%
[2023-07-14 09:02:05,98] [info] Running with database db.url = jdbc:hsqldb:mem:dfdbe72d-a916-46a4-989d-a2b8f19138b0;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 09:02:06,81] [info] Slf4jLogger started
[2023-07-14 09:02:07,10] [info] Workflow heartbeat configuration:
{
"cromwellId" : "cromid-4a23a0e",
"heartbeatInterval" : "2 minutes",
"ttl" : "10 minutes",
"failureShutdownDuration" : "5 minutes",
"writeBatchSize" : 10000,
"writeThreshold" : 10000
}
[2023-07-14 09:02:07,29] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 09:02:07,29] [info] Metadata summary refreshing every 1 second.
[2023-07-14 09:02:07,29] [info] No metadata archiver defined in config
[2023-07-14 09:02:07,29] [info] No metadata deleter defined in config
[2023-07-14 09:02:07,30] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 09:02:07,44] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2023-07-14 09:02:07,55] [info] JobRestartCheckTokenDispenser - Distribution rate: 50 per 1 seconds.
[2023-07-14 09:02:07,72] [info] JobExecutionTokenDispenser - Distribution rate: 20 per 10 seconds.
[2023-07-14 09:02:07,93] [info] SingleWorkflowRunnerActor: Version 85
[2023-07-14 09:02:07,97] [info] SingleWorkflowRunnerActor: Submitting workflow
[2023-07-14 09:02:08,16] [info] Unspecified type (Unspecified version) workflow c0db1d21-1327-4471-b08b-33d0bb43a0eb submitted
[2023-07-14 09:02:08,21] [info] SingleWorkflowRunnerActor: Workflow submitted c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,23] [info] 1 new workflows fetched by cromid-4a23a0e: c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,25] [info] WorkflowManagerActor: Starting workflow c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,26] [info] WorkflowManagerActor: Successfully started WorkflowActor-c0db1d21-1327-4471-b08b-33d0bb43a0eb
[2023-07-14 09:02:08,26] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2023-07-14 09:02:08,32] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
[2023-07-14 09:02:08,52] [info] MaterializeWorkflowDescriptorActor [c0db1d21]: Parsing workflow as WDL draft-2
[2023-07-14 09:02:09,47] [info] MaterializeWorkflowDescriptorActor [c0db1d21]: Call-to-Backend assignments: printHelloAndGoodbye.echoHello -> Volcano
[2023-07-14 09:02:09,83] [warn] Volcano [c0db1d21]: Key/s [docker] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2023-07-14 09:02:11,04] [info] WorkflowExecutionActor-c0db1d21-1327-4471-b08b-33d0bb43a0eb [c0db1d21]: Starting printHelloAndGoodbye.echoHello
[2023-07-14 09:02:12,56] [info] Not triggering log of restart checking token queue status. Effective log interval = None
[2023-07-14 09:02:12,74] [info] Not triggering log of execution token queue status. Effective log interval = None
[2023-07-14 09:02:17,75] [info] Assigned new job execution tokens to the following groups: c0db1d21: 1
[2023-07-14 09:02:17,93] [warn] DispatchedConfigAsyncJobExecutionActor [c0db1d21printHelloAndGoodbye.echoHello:NA:1]: Unrecognized runtime attribute keys: docker
[2023-07-14 09:02:18,02] [info] DispatchedConfigAsyncJobExecutionActor [c0db1d21printHelloAndGoodbye.echoHello:NA:1]: echo "Hello AWS!"
[2023-07-14 09:02:18,05] [info] DispatchedConfigAsyncJobExecutionActor [c0db1d21printHelloAndGoodbye.echoHello:NA:1]: executing: vcctl job run -f
[2023-07-14 09:02:18,28] [info] WorkflowManagerActor: Workflow c0db1d21-1327-4471-b08b-33d0bb43a0eb failed (during ExecutingWorkflowState): java.lang.RuntimeException: Unable to start job. Check the stderr file for possible errors: /home/ec2-user/cromwell-executions/printHelloAndGoodbye/c0db1d21-1327-4471-b08b-33d0bb43a0eb/call-echoHello/execution/stderr.submit
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.$anonfun$execute$2(SharedFileSystemAsyncJobExecutionActor.scala:165)
at scala.util.Either.fold(Either.scala:189)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute(SharedFileSystemAsyncJobExecutionActor.scala:160)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute$(SharedFileSystemAsyncJobExecutionActor.scala:155)
at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.execute(ConfigAsyncJobExecutionActor.scala:215)
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$executeAsync$1(StandardAsyncExecutionActor.scala:751)
at scala.util.Try$.apply(Try.scala:210)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync(StandardAsyncExecutionActor.scala:751)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync$(StandardAsyncExecutionActor.scala:751)
at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeAsync(ConfigAsyncJobExecutionActor.scala:215)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:1141)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:1133)
at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.executeOrRecover(ConfigAsyncJobExecutionActor.scala:215)
at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:65)
at cromwell.core.retry.Retry$.withRetry(Retry.scala:46)
at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:61)
at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:65)
at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:88)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
at akka.actor.Actor.aroundReceive(Actor.scala:539)
at akka.actor.Actor.aroundReceive$(Actor.scala:537)
at cromwell.backend.impl.sfs.config.DispatchedConfigAsyncJobExecutionActor.aroundReceive(ConfigAsyncJobExecutionActor.scala:215)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
at akka.actor.ActorCell.invoke(ActorCell.scala:583)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
at akka.dispatch.Mailbox.run(Mailbox.scala:229)
at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
从此次的实验中得出结果,Cromwell 和 Volcano 根本没有进行集成,连 Cromwell 在 Github 上提供的 Volcano backend 代码都缺少一个 } 号,在 backend 加上 } 号后能跑过,但这还不是重点问题,接着又报另外的错误,我们再查看具体的错误日志,报错如下:
Error: flag needs an argument: 'f' in -f
Usage:
vcctl job run [flags]
Flags:
-f, --filename string the yaml file of job
-h, --help help for run
-i, --image string the container image of job (default "busybox")
-k, --kubeconfig string (optional) absolute path to the kubeconfig file (default "/home/ec2-user/.kube/config")
-L, --limits string the resource limit of the task (default "cpu=1000m,memory=100Mi")
-s, --master string the address of apiserver
-m, --min int the minimal available tasks of job (default 1)
-N, --name string the name of job
-n, --namespace string the namespace of job (default "default")
-r, --replicas int the total tasks of job (default 1)
-R, --requests string the resource request of the task (default "cpu=1000m,memory=100Mi")
-S, --scheduler string the scheduler for this job (default "volcano")
Global Flags:
--log-flush-frequency duration Maximum number of seconds between log flushes (default 5s)
我们进一步分析 vcctl job run -f 后面是需要跟上一个 yaml 文件,测试如下:
[ec2-user@ip-172-31-30-173 execution]$ vcctl job run -f script
Failed to run job: only support yaml file
我们再进一步查找 Volcano 在 Github 上相关的 Issues资料,也是证实了目前还不支持 Cromwell,如下:
经过一系列的试错和经验总结后,既然 vcctl job run -f 需要一个 yaml 文件,那我们可以手动创建一个 yaml 文件模板,然后在 backend 中指定该模板,再把一些动态变化的参数开放出来,根据这样的思路,我们新建一个模板,注意需要把 efs-storage-claim 改为自己的 Amazon EFS mount point,如下:
cd shared/cromwell
cat << EOF > eks-template.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: JOB_NAME
spec:
minAvailable: 1
schedulerName: volcano
queue: VOLCANO_QUEUE
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: task
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- name: cromwell
image: C_IMAGE
command:
- sh
- EXECUTION_DIR/script
resources:
requests:
cpu: CPUS
memory: MEMORYS
limits:
cpu: CPUS
memory: MEMORYS
volumeMounts:
- name: efs-pvc
mountPath: /shared
volumes:
- name: efs-pvc
persistentVolumeClaim:
claimName: efs-storage-claim
restartPolicy: Never
EOF
模板里可以根据自己的需求增加一些参数,接着需要修改 backend,如下:
cd /shared/cromwell
cat << EOF > volcano.conf
include required(classpath("application"))
backend {
default = Volcano
providers {
Volcano {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
runtime-attributes = """
Int runtime_minutes = 600
Int cpus = 1
String memorys = "512MB"
String dockerImage = "ubuntu:latest"
String queue = "cromwell"
String job_template = "/shared/cromwell/eks-template.yaml"
"""
submit = """
name=\${"\`echo "+job_name+"|awk -F '_' '{print tolower(\$3)}'\`"}
uid=\${"\`echo \$(cat /dev/urandom | od -x | head -1 | awk '{print \$2\$3\"\"\$4\$5\"\"\$6\$7\"\"\$8\$9}')\`"}
jobName=\${"\`echo \$name-\$uid\`"}
execDir=\$(cd \`dirname \$0\`; pwd)
workflowDir=\${"\`dirname "+cwd+"\`"}
cat \${job_template} > \${script}.yaml
sed -i "s@JOB_NAME@\$jobName@g" \${script}.yaml && \\
sed -i "s@WORKFLOW_DIR@\$workflowDir@g" \${script}.yaml && \\
sed -i "s@EXECUTION_DIR@\$execDir@g" \${script}.yaml && \\
sed -i "s@VOLCANO_QUEUE@\${queue}@g" \${script}.yaml && \\
sed -i "s@CPUS@\${cpus}@g" \${script}.yaml && \\
sed -i "s@C_IMAGE@\${dockerImage}@g" \${script}.yaml && \\
sed -i "s@MEMORYS@\${memorys}@g" \${script}.yaml && \\
vcctl job run -f \${script}.yaml
"""
kill = "vcctl job delete -N \${job_id}"
check-alive = "vcctl job view -N \${job_id}"
job-id-regex = ".*run job (\\\S+) successfully.*"
}
}
}
}
EOF
再次执行投递任务,如下:
cd /shared
java -Dconfig.file=/shared/cromwell/volcano.conf -jar /shared/cromwell-85.jar run /shared/cromwell/simple-hello.wdl
成功输入如下信息:
[2023-07-14 12:16:52,70] [info] Running with database db.url = jdbc:hsqldb:mem:6991c79a-dc66-46fb-8724-c54912be7130;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 12:16:58,50] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2023-07-14 12:16:58,52] [info] [RenameWorkflowOptionsInMetadata] 100%
[2023-07-14 12:16:59,04] [info] Running with database db.url = jdbc:hsqldb:mem:5e0dacf6-bb11-4537-b2b0-8fa06e868fc1;shutdown=false;hsqldb.tx=mvcc
[2023-07-14 12:16:59,85] [info] Slf4jLogger started
[2023-07-14 12:17:00,13] [info] Workflow heartbeat configuration:
{
"cromwellId" : "cromid-d51c166",
"heartbeatInterval" : "2 minutes",
"ttl" : "10 minutes",
"failureShutdownDuration" : "5 minutes",
"writeBatchSize" : 10000,
"writeThreshold" : 10000
}
[2023-07-14 12:17:00,24] [info] Metadata summary refreshing every 1 second.
[2023-07-14 12:17:00,27] [info] No metadata archiver defined in config
[2023-07-14 12:17:00,27] [info] No metadata deleter defined in config
[2023-07-14 12:17:00,30] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 12:17:00,36] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2023-07-14 12:17:00,40] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2023-07-14 12:17:00,66] [info] JobRestartCheckTokenDispenser - Distribution rate: 50 per 1 seconds.
[2023-07-14 12:17:00,76] [info] JobExecutionTokenDispenser - Distribution rate: 20 per 10 seconds.
[2023-07-14 12:17:00,91] [info] SingleWorkflowRunnerActor: Version 85
[2023-07-14 12:17:00,93] [info] SingleWorkflowRunnerActor: Submitting workflow
[2023-07-14 12:17:01,06] [info] Unspecified type (Unspecified version) workflow 5beb4ec2-f019-479c-97c1-f20559a619a1 submitted
[2023-07-14 12:17:01,13] [info] SingleWorkflowRunnerActor: Workflow submitted 5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,14] [info] 1 new workflows fetched by cromid-d51c166: 5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,16] [info] WorkflowManagerActor: Starting workflow 5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,19] [info] WorkflowManagerActor: Successfully started WorkflowActor-5beb4ec2-f019-479c-97c1-f20559a619a1
[2023-07-14 12:17:01,19] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2023-07-14 12:17:01,23] [info] WorkflowStoreHeartbeatWriteActor configured to flush with batch size 10000 and process rate 2 minutes.
[2023-07-14 12:17:01,45] [info] MaterializeWorkflowDescriptorActor [5beb4ec2]: Parsing workflow as WDL draft-2
[2023-07-14 12:17:02,38] [info] MaterializeWorkflowDescriptorActor [5beb4ec2]: Call-to-Backend assignments: printHelloAndGoodbye.echoHello -> Volcano
[2023-07-14 12:17:02,86] [warn] Volcano [5beb4ec2]: Key/s [docker] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2023-07-14 12:17:04,05] [info] WorkflowExecutionActor-5beb4ec2-f019-479c-97c1-f20559a619a1 [5beb4ec2]: Starting printHelloAndGoodbye.echoHello
[2023-07-14 12:17:05,67] [info] Not triggering log of restart checking token queue status. Effective log interval = None
[2023-07-14 12:17:05,78] [info] Not triggering log of execution token queue status. Effective log interval = None
[2023-07-14 12:17:10,79] [info] Assigned new job execution tokens to the following groups: 5beb4ec2: 1
[2023-07-14 12:17:10,94] [warn] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Unrecognized runtime attribute keys: docker
[2023-07-14 12:17:11,09] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: echo "Hello AWS!"
[2023-07-14 12:17:11,16] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: executing: Name=echo cromwell_5beb4ec2_echoHello|awk -F '_' '{print tolower($3)}'
Uid=echo $(cat /dev/urandom | od -x | head -1 | awk '{print $2$3""$4$5""$6$7""$8$9}')
JobName=echo $Name-$Uid
ExecDir=$(cd dirname $0; pwd)
workflowDir=dirname /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello
touch $ExecDir/stderr.qsub
cat ~/cromwell/eks-template.yaml > /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml
sed -i "s@JOB_NAME@$JobName@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
sed -i "s@WORKFLOW_DIR@$workflowDir@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
sed -i "s@EXECUTION_DIR@$ExecDir@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
sed -i "s@VOLCANO_QUEUE@cromwell@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
sed -i "s@CPUS@1@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
sed -i "s@MEMORYS@500Mi@g" /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml && \
vcctl job run -f /shared/cromwell-executions/printHelloAndGoodbye/5beb4ec2-f019-479c-97c1-f20559a619a1/call-echoHello/execution/script.yaml
[2023-07-14 12:17:15,43] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: job id: echohello-bfd04a2494a176c8972298007d26c734
[2023-07-14 12:17:15,43] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Cromwell will watch for an rc file but will *not* double-check whether this job is actually alive (unless Cromwell restarts)
[2023-07-14 12:17:15,43] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Status change from - to Running
[2023-07-14 12:17:17,04] [info] DispatchedConfigAsyncJobExecutionActor [5beb4ec2printHelloAndGoodbye.echoHello:NA:1]: Status change from Running to Done
[2023-07-14 12:17:17,36] [info] WorkflowExecutionActor-5beb4ec2-f019-479c-97c1-f20559a619a1 [5beb4ec2]: Workflow printHelloAndGoodbye complete. Final Outputs:
{
}
[2023-07-14 12:17:20,42] [info] WorkflowManagerActor: Workflow actor for 5beb4ec2-f019-479c-97c1-f20559a619a1 completed with status 'Succeeded'. The workflow will be removed from the workflow store.
[2023-07-14 12:17:25,16] [info] SingleWorkflowRunnerActor workflow finished with status 'Succeeded'.
{
"outputs": {
},
"id": "5beb4ec2-f019-479c-97c1-f20559a619a1"
}
Volcano 与 Amazon EKS 集成调优
当我们的 Pod 申请的资源大于 NodeGroup 里机器的资源时,Job 会一直卡在 Pending 状态,无法通过 Karpenter 弹出新的 Amazon EC2 来运行,如下:
$ vcctl job list
Name Creation Phase JobType Replicas Min Pending Running Succeeded Failed Unknown RetryCount
echohello-300cc1b83423792e68fe43eff0433995 2023-07-19 Pending Batch 1 1 0 0 0 0 0 0
进一步查看该 Job 的日志,如下:
vcctl job view -N echohello-300cc1b83423792e68fe43eff0433995
Name: echohello-300cc1b83423792e68fe43eff0433995
Namespace: default
Labels: <none>
Annotations: <none>
....
Events:
Type Reason Age Form Message
------- ------- ------- ------- -------
Warning PodGroupPending 3m52s vc-controller-manager PodGroup default:echohello-300cc1b83423792e68fe43eff0433995 unschedule,reason: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
问题的本质在于 Volcano 默认采用的是 Gang 调度模式,该模式可以将多个任务(Pod)同时调度到同一组工作节点上运行,以实现任务之间的协同工作和高效利用资源。出现 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable 这种错误通常意味着 Volcano 调度程序无法找到足够的可用节点来运行任务。所以该调度模式必须是 Amazon EKS 集群中已经存在运行的节点,为了解决这个问题,我们可以修改 Volcano 的调度模式。
修改 Volcano 调度模式,执行以下命令:
cat << EOF > volcano-scheduler-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: sla
arguments:
sla-waiting-time: 5s
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
EOF
kubectl apply -f volcano-scheduler-configmap.yaml
通过 Cromwell 投递生信流程
本小节演示如何将 Cromwell 与 Amazon EKS 结合来运行 GATK4 HaplotypeCaller。HaplotypeCaller 工具是 GATK 最佳实践流程中的主要步骤之一。
运行 HaplotypeCaller
- Download Input data
cd /shared
mkdir -p workflow/haplotype-caller-workflow
cd workflow/haplotype-caller-workflow
aws s3 sync s3://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals hg38_wgs_scattered_calling_intervals
aws s3 cp s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam .
aws s3 cp s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bai .
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict .
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta .
aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai .
cat << EOF > hg38_wgs_scattered_calling_intervals.txt
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0003_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0004_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0005_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0006_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0007_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0008_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0009_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0010_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0011_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0012_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0013_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0014_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0015_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0016_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0017_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0018_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0019_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0020_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0021_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0022_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0023_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0024_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0025_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0026_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0027_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0028_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0029_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0030_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0031_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0032_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0033_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0034_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0035_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0036_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0037_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0038_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0039_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0040_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0041_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0042_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0043_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0044_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0045_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0046_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0047_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0048_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0049_of_50/scattered.interval_list
/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals/temp_0050_of_50/scattered.interval_list
EOF
- Worflow Definition
cat << EOF > HaplotypeCallerWF.wdl
# WORKFLOW DEFINITION
workflow HaplotypeCallerGvcf_GATK4 {
File input_bam
File input_bam_index
File ref_dict
File ref_fasta
File ref_fasta_index
File scattered_calling_intervals_list
String gatk_docker
String gatk_path
Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list)
String sample_basename = basename(input_bam, ".bam")
String gvcf_name = sample_basename + ".g.vcf.gz"
String gvcf_index = sample_basename + ".g.vcf.gz.tbi"
# Call variants in parallel over grouped calling intervals
scatter (interval_file in scattered_calling_intervals) {
# Generate GVCF by interval
call HaplotypeCaller {
input:
input_bam = input_bam,
input_bam_index = input_bam_index,
interval_list = interval_file,
gvcf_name = gvcf_name,
ref_dict = ref_dict,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
docker_image = gatk_docker,
gatk_path = gatk_path
}
}
# Merge per-interval GVCFs
call MergeGVCFs {
input:
input_vcfs = HaplotypeCaller.output_gvcf,
vcf_name = gvcf_name,
vcf_index = gvcf_index,
docker_image = gatk_docker,
gatk_path = gatk_path
}
# Outputs that will be retained when execution is complete
output {
File output_merged_gvcf = MergeGVCFs.output_vcf
File output_merged_gvcf_index = MergeGVCFs.output_vcf_index
}
}
# TASK DEFINITIONS
# HaplotypeCaller per-sample in GVCF mode
task HaplotypeCaller {
File input_bam
File input_bam_index
String gvcf_name
File ref_dict
File ref_fasta
File ref_fasta_index
File interval_list
Int? interval_padding
Float? contamination
Int? max_alt_alleles
String mem_size
String docker_image
String gatk_path
String java_opt
command {
\${gatk_path} --java-options \${java_opt} \\
HaplotypeCaller \\
-R \${ref_fasta} \\
-I \${input_bam} \\
-O \${gvcf_name} \\
-L \${interval_list} \\
-ip \${default=100 interval_padding} \\
-contamination \${default=0 contamination} \\
--max-alternate-alleles \${default=3 max_alt_alleles} \\
-ERC GVCF
}
runtime {
dockerImage: docker_image
memorys: mem_size
cpus: 1
}
output {
File output_gvcf = "\${gvcf_name}"
}
}
# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
Array [File] input_vcfs
String vcf_name
String vcf_index
String mem_size
String docker_image
String gatk_path
String java_opt
command {
\${gatk_path} --java-options \${java_opt} \\
MergeVcfs \
--INPUT=\${sep=' --INPUT=' input_vcfs} \\
--OUTPUT=\${vcf_name}
}
runtime {
dockerImage: docker_image
memorys: mem_size
cpus: 1
}
output {
File output_vcf = "\${vcf_name}"
File output_vcf_index = "\${vcf_index}"
}
}
EOF
- Inputs Definition
cat << EOF > HaplotypeCallerWF.inputs.json
{
"##_COMMENT1": "INPUT BAM",
"HaplotypeCallerGvcf_GATK4.input_bam": "/shared/workflow/haplotype-caller-workflow/NA12878_24RG_small.hg38.bam",
"HaplotypeCallerGvcf_GATK4.input_bam_index": "/shared/workflow/haplotype-caller-workflow/NA12878_24RG_small.hg38.bai",
"##_COMMENT2": "REFERENCE FILES",
"HaplotypeCallerGvcf_GATK4.ref_dict": "/shared/workflow/haplotype-caller-workflow/Homo_sapiens_assembly38.dict",
"HaplotypeCallerGvcf_GATK4.ref_fasta": "/shared/workflow/haplotype-caller-workflow/Homo_sapiens_assembly38.fasta",
"HaplotypeCallerGvcf_GATK4.ref_fasta_index": "/shared/workflow/haplotype-caller-workflow/Homo_sapiens_assembly38.fasta.fai",
"##_COMMENT3": "INTERVALS",
"HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list": "/shared/workflow/haplotype-caller-workflow/hg38_wgs_scattered_calling_intervals.txt",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.interval_padding": 100,
"##_COMMENT4": "DOCKERS",
"HaplotypeCallerGvcf_GATK4.gatk_docker": "broadinstitute/gatk:4.0.0.0",
"##_COMMENT5": "PATHS",
"HaplotypeCallerGvcf_GATK4.gatk_path": "/gatk/gatk",
"##_COMMENT6": "JAVA OPTIONS",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.java_opt": "-Xms10000m",
"HaplotypeCallerGvcf_GATK4.MergeGVCFs.java_opt": "-Xms10000m",
"##_COMMENT7": "MEMORY ALLOCATION",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.mem_size": "12Gi",
"HaplotypeCallerGvcf_GATK4.MergeGVCFs.mem_size": "30Gi",
"##_COMMENT8": "DISK SIZE ALLOCATION",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.disk_size": 100,
"HaplotypeCallerGvcf_GATK4.MergeGVCFs.disk_size": 100,
"##_COMMENT9": "PREEMPTION",
"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.preemptible_tries": 3,
"HaplotypeCallerGvcf_GATK4.MergeGVCFs.preemptible_tries": 3
}
EOF
- Running the workflow
cd /shared
nohup java -Dconfig.file=/shared/cromwell/volcano.conf -jar /shared/cromwell-85.jar run /shared/workflow/haplotype-caller-workflow/HaplotypeCallerWF.wdl --inputs /shared/workflow/haplotype-caller-workflow/HaplotypeCallerWF.inputs.json > haplotype_caller.log 2>&1 &
- Job log
taif -f haplotype_caller.log
运行情况如下:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
haplotypecaller-11af636d71eb3733da71ed5f4e38d7b5-task-0 1/1 Running 0 33m
haplotypecaller-2facecea05e77e637c89603ab532a979-task-0 0/1 Completed 0 33m
haplotypecaller-5440f17212b266d6ac852be054085f40-task-0 0/1 Completed 0 33m
haplotypecaller-6007c2c638f95646d151d46c2bec632d-task-0 0/1 Completed 0 33m
haplotypecaller-7c7e1421a759a1469a241d2bb92c36f4-task-0 0/1 Completed 0 33m
haplotypecaller-8ce6fbb2900581f12bd21f25b07af605-task-0 0/1 Completed 0 33m
haplotypecaller-a07962b5179f200dc1f4de9c9458e0a2-task-0 0/1 Completed 0 33m
haplotypecaller-b2fc938a406587a22361fd2adc8d0a74-task-0 0/1 Completed 0 33m
haplotypecaller-ebf38edbec0901f512c8f3e9724fc8f9-task-0 0/1 Completed 0 33m
haplotypecaller-f3377af2d05404d86be2a69028d91b2c-task-0 0/1 Completed 0 33m
总结
本文详细介绍了生物信息分析中常用的工作流框架 Cromwell,并解决了 Cromwell 和 Volcano 无法集成的问题,同时调整了 Volcano 和 Karpenter 调度冲突问题,最后演示了一个比较接近真实的生信流程工具跑了一遍任务。文章提供了每个步骤需要执行的命令和操作,以及相应的命令行代码示例。按照本文提供的步骤,在第一篇文章中创建好的 Amazon EKS 集群的基础上,就可以搭建出自己的容器化 HPC 集群,最后将您已有的 Cromwell 流程部署到 Cromwell Server 上,就可以高效地运行集群任务。
参考链接
Amazon EKS:https://aws.amazon.com/cn/eks/
Karpenter:https://karpenter.sh/
Volcano:https://volcano.sh/zh/docs/
Cromwell:https://cromwell.readthedocs.io/en/stable/
HaplotypeCallerWF:https://github.com/broadinstitute/cromwell/tree/develop/centaur/src/main/resources/integrationTestCases/germline/haplotype-caller-workflow