Introducing GPU health checks in AWS ParallelCluster 3.6

GPU failures are relatively rare but when they do occur, they can have severe consequences for HPC and deep learning tasks. For example, they can disrupt long-running simulations and distributed training jobs. Amazon EC2 verifies GPU health before it launches an instance. It also does periodic status checks that can detect and mitigate many failure modes. However, this approach can miss failures that arise when a GPU instance has been active for some time.

With AWS ParallelCluster 3.6, you can configure NVIDIA GPU health checks that run at the start of your Slurm jobs. If the health check fails, the job is re-queued on another instance. Meanwhile, the instance is marked as unavailable for new work and is de-provisioned once any other jobs running on it have completed. This helps increase the reliability of GPU-based workloads (NVIDIA-based ones, at least), and helps prevent unwanted spend resulting from unsuccessful jobs.

Using GPU Health Checks

To get started with GPU health checks, you’ll need ParallelCluster 3.6.0 or higher. You can follow this online guide to help you upgrade. Next, edit your cluster configuration as described in the examples below and in the AWS ParallelCluster documentation. Finally, create a cluster using the new configuration.

By default, GPU health checks are off on new clusters. You can enable or disable them at the queue level as well as on individual compute resources. To do this, add a HealthChecks stanza to Slurm queues and/or individual compute resources. If you set a value for HealthChecks:Gpu:Enabled at the compute resource level, it overrides the setting from the queue level.

Scheduling:
    SlurmQueues:
        - Name: <string>
          HealthChecks:
            Gpu:
                Enabled: <boolean>
          ComputeResources:
            - Name: <string>
              HealthChecks:
                Gpu:
                    Enabled: <boolean>

When GPU health checks are enabled on a queue or compute resource, a Slurm prolog script is executed at the beginning of each job that runs on them. The prolog activates NVIDIA Data Center GPU Manager (DCGM) and the NVIDIA Fabric Manager, and temporarily sets the GPU persistence mode to active. Then, it executes DCGM at run level 2, which checks the functionality of the NVIDIA software stack, PCIe and NVLink status, GPU memory function, and GPU memory bandwidth. Once the status check has run, the node is returned to original state. We estimate this to take less than 3 minutes to complete.

If the GPU diagnostic test succeeds, ParallelCluster logs a success message and the job continues to run.

If it fails, ParallelCluster logs an error message and begins the mitigation process. First, the job is rescheduled onto another instance. Then, the failing instance is drained so no more work is allocated to it. Once all jobs running on it have completed, it’s terminated. If there is no GPU on an instance, this gets logged and the diagnostic test is skipped.

You can find logs for GPU Health checks on the individual compute instances at /var/log/parallelcluster/slurm_health_check.log. However, once an instance is decommissioned by AWS ParallelCluster, you no longer have access to it to read this file.

Health check logs are also stored persistently in your cluster’s Amazon CloudWatch log group, however. To find them, go to the AWS Console and navigate to CloudWatch, then choose Log groups. Search for your cluster name, then choose the log group that matches it.

Log group names are a combination of cluster name and a date stamp, so if you have more than one cluster with the same name, choose the one whose date stamp matches your intended cluster’s creation date.
Under Log streams, you can find health check logs named after the instance they ran on. For example, log stream ip-172-31-2-17.i-0dc17624a7862b835.slurm_health_check contains logs from an instance whose private IP was 32.2.17 and had the instance identifier 0dc17624a7862b835.

You can use the instance identifier in correspondence with Amazon support should you wish to report a GPU failure.

Details to be aware of

GPU health checks are a straightforward feature. Activate them, and ParallelCluster will transparently monitor for and attempt to mitigate failed GPUs. However, there are a few details to keep in mind as you start using them.

Cost

Validating GPU health when a job starts minimizes the amount of time a GPU instance runs with a degraded GPU. However, you will still incur usage charges for at least the 2-3 minutes it takes to run the health check, and for as long as it takes other jobs on the instance to complete. Also, logs from GPU health checks are sent to an instance-specific stream in your cluster’s Amazon CloudWatch log group. You may incur charges for this additional log data.

Custom Prologs and Epilogs

If you explicitly set custom prologs or epilogs for your Slurm jobs, that may conflict with GPU health checks. This is explained in detail in the AWS ParallelCluster documentation. Briefly, ParallelCluster 3.6.0 points to a directory to find prolog and epilog scripts. Consult the documentation to understand how your prolog and epilog configurations interact with this. It may be as simple as putting your prolog and epilog scripts in the directories that ParallelCluster provides to have them run as part of your Slurm jobs.

Custom AMIs

You can use GPU health checks with any ParallelCluster AMI from version 3.6.0 on, as well as derivative custom AMIs. GPU health checks rely on the presence of NVIDIA DCGM. If it can’t be found, a log message will be generated and the job will continue to run. Therefore, when you are testing a new custom AMI, we recommend you inspect some health check logs on a GPU-based instance to ensure that health checks can run as expected.

Conclusion

With AWS ParallelCluster 3.6.0, you can configure your cluster to detect and recover from GPU failures for any NVIDIA-based GPU instances. This can help minimize unwanted costs and lost time with GPU-intensive workloads. You’ll need to update your ParallelCluster installation, then add HealthChecks settings to your cluster configuration to use this new feature.

Try out GPU health checks and let us know how we can improve them, or any other aspect of AWS ParallelCluster. If they make your life quantifiably better, you can tell us about that, too. Reach us on Twitter at @TechHPC or by email at ask-hpc@amazon.com.

Twitter/Social Media Excerpt:

Suggested tags: HPC, ParallelCluster, NVIDIA, GPU, AI/ML

AWS HPC Blog