AWS HPC Blog

Encoding workflow dependencies in AWS Batch

Most users of HPC or Batch systems need to analyze data with multiple operations to get meaningful results. That’s really driven by the nature of scientific research or engineering processes – it’s rare that a single task generates the insight you need.

AWS Batch customers are no exception, of course, which is why Batch supports nested dependencies to create relationships between jobs. In today’s post, we’ll walk you through how to encode job dependencies in Batch, using a simple machine learning pipeline as an example.

Our scenario

Let’s take the following high-level diagram depicting a simple machine learning pipeline.

Figure 1- A simplified diagram of a machine learning model training workflow, showcasing the serial dependencies between each operation on data to build the final trained model.

Figure 1- A simplified diagram of a machine learning model training workflow, showcasing the serial dependencies between each operation on data to build the final trained model.

The diagram shows a straight-forward set of serial steps (or jobs) that transform data into a trained model that can be used for inference within some other system. Even for this simple example workflow there are a few things to note: (1) each step can have different requirements for CPU, memory, storage, or even a GPU, so it makes sense to encapsulate each as a separate job definition; (2) steps further in the chain depend on data from an earlier step; and because of that, (3) if an earlier step does not succeed, then it makes no sense to provision resources for the dependent processes that follow.

When using AWS Batch, you have a choice for how to encode the dependencies between jobs:

  1. Use the SubmitJob API to define the dependencies as you are submitting the job requests. This job dependencies are directly encoded within, and managed by, Batch.
  2. Define the job dependencies outside of Batch using a workflow framework, such as Apache Airflow or AWS Step Functions.

I’ll describe both of these methods, starting with directly encoding job dependencies in the Batch API. I’ll use AWS Step Functions, which integrates natively with AWS Batch, to describe encoding dependencies outside of Batch.

Encoding job dependencies at runtime with AWS Batch

When you submit a job request to AWS Batch, you have the option of defining a dependency on a previously submitted job. This is what we mean by “runtime” – that some API requests have been made and you can refer to these in subsequent requests in an iterative manner and that job dependencies are not formally defined before submitting requests are acknowledged by the Batch service.

The following example shows how to submit a job using the AWS CLI and then using the returned job ID in a subsequent job request to define the dependency.

# Submit job A
aws batch submit-job --job-name jobA --job-queue myQueue --job-definition jdA

# Output 
{
    "jobName": "example",
    "jobId": "876da822-4198-45f2-a252-6cea32512ea8"
}

# Submit job B
aws batch submit-job --job-name jobB --job-queue myQueue --job-definition jdB --depends-on jobId="876da822-4198-45f2-a252-6cea32512ea8"

As the following animation (Figure 2) shows, the Batch scheduler will keep the second job (Job B) in the PENDING state until the first job (Job A) completes successfully.  After Job A succeeds, Job B progresses from PENDING to RUNNABLE and eventually to the SUCCEEDED state.

Figure 2 – Animation showing job state changes of primary and dependent jobs. The Batch scheduler will keep the second job (Job B) in the PENDING state until the first job (Job A) completes successfully.  After Job A succeeds, Job B progresses from PENDING to RUNNABLE and eventually to the SUCCEEDED state.

If Job A fails, Job B also immediately transitions from PENDING to FAILED, as shown in Figure 2.

Figure 3 - Animation showing job state changes of a primary and dependent job, and how failure of the primary job triggers the failure of the dependent job.

Figure 3 – Animation showing job state changes of a primary and dependent job, and how failure of the primary job triggers the failure of the dependent job.

Defining dependencies for array jobs

The previous example shows how dependencies work for basic jobs, which are individual units of work. AWS Batch also allows you to submit array jobs – which allow you to submit multiple units of work that share common parameters such as the application, vCPUs, and memory in a single API request.

When you submit an array job request with an array size of 1000, a single parent job gets created that spawns 1000 child basic jobs. Each child job has the same base job ID as the parent, but adds the array index as a suffix. The array index is also passed to the child job as the environment variable AWS_BATCH_JOB_ARRAY_INDEX.

Figure 4 - A Batch array job showing the parent (Job A) and indexed basic child jobs ([A:0, A:1, ...])

Figure 4 – A Batch array job showing the parent (Job A) and indexed basic child jobs ([A:0, A:1, …])

You can define two types of job dependencies for array jobs, SEQUENTIAL or N_TO_N. A SEQUENTIAL dependency type for array jobs will result in the basic child jobs to be dependent on the previously indexed job to succeed before it can start work. The job with index 1 will not start until index 0 succeeds. This is a way to define a sequence of operations within the elements of the array job. Figure 3 is an illustration of the SEQUENTIAL array job dependency.  Failure of a parent will cascade down the line and also transition the parent array job to FAILED.

Figure 5. A SEQUENTIAL job dependencies for an array job, showcasing within-array sequential execution.

Figure 5 – A SEQUENTIAL job dependencies for an array job, showcasing within-array sequential execution.

The second array job dependency type, N_TO_N, is between array jobs. Given two array jobs, A and B, and given that B depends on A, then each child job of B must wait for the corresponding index of child jobs in A to complete before it can begin. For example, Job B:1 can’t start until job A:1 finishes. Figure 4 illustrates the dependency definition A and B, and the child job dependencies.

Figure 6 - A N_TO_N job dependencies between array jobs, showcasing the dependency of execution between the child job of the dependent array job (Job B) to the same indexed child job of the primary array job (Job A).

Figure 6 – A N_TO_N job dependencies between array jobs, showcasing the dependency of execution between the child job of the dependent array job (Job B) to the same indexed child job of the primary array job (Job A).

If you cancel or shut down a parent array job, all of the child jobs are cancelled or shut down with it. You can cancel or shut down individual child jobs (which moves them to the FAILED status) without affecting the other child jobs. However, if a child array job enters a FAILED status, the parent job also fails.

Finally, dependencies between basic and array jobs act as they would for dependencies between basic jobs, where a dependency from a basic on an array job parent will only start the basic job if all the array child-jobs complete successfully.

Defining job dependencies outside of AWS Batch

Another way to define dependencies between jobs is to do so outside of AWS Batch using a workflow framework such as Apache Airflow or AWS Step Functions (and there are many more – Batch is widely supported by the community).

The main advantages of leveraging a workflow framework are:

Figure 7 - Diagram of a Step Function state machine

Figure 7 – Diagram of a Step Function state machine

  • Workflow check-pointing for manual reviews, or taking advantage of flexible compute options like EC2 Spot Instances
  • Restarting workflows from a checkpoint so that you don’t waste compute resources running jobs that already have results
  • Leveraging multiple compute resources or services depending on the task requirement
  • Advanced retry strategies, such as adding more job resources (e.g., more memory) or switching resource pools
  • Multiple error-handling routines that are triggered based on the type of error encountered

AWS Step Functions allows you to define a workflow as a state machine using the Amazon States Language – a JSON-based, structured language. The diagram on the right side of this page shows the visual representation of a state machine of the previous machine learning example. Each of the task states integrate with AWS Batch to perform work. By default, when a state reports an error, AWS Step Functions causes the execution to fail entirely and future tasks are not sent to AWS Batch at all. However, Step Functions has some advanced error handling features that you can take advantage of to deal with different types of errors, such as runtime parameters, timeouts, and permissions errors. Being able to define fallback states and advanced retry strategies are the main advantage of working with full-featured workflow frameworks such as Step Functions.

You can read about common workflow design patterns in HPC, and how to encode those patterns using Step Functions and Batch, in the Orchestrating high performance computing with AWS Step Functions and AWS Batch blog post by AWS solutions architects, Dan Fox and Sabha Parameswaran.

Conclusion

We discussed two methods for encoding dependencies between jobs submitted to AWS Batch. The first was leveraging the features of the AWS Batch API for basic and array jobs as you submit work to the job queue. This has the advantage of working within a single service for all your workflow needs.

The second method described encoding dependencies outside of AWS Batch using AWS Step Functions as an example. Use external dependency management when you want to take advantage of advanced error handling capabilities, have to support multiple job schedulers and resources, or are already have experience working within a framework that fits into your current systems.

For more information on Batch and Step Functions, refer to our documentation. Finally, if you have other questions about AWS services for HPC, reach out to us on Twitter at @TechHpc.

Angel Pizarro

Angel Pizarro

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.