How do I monitor jobs stuck in the RUNNABLE state in AWS Batch?

Last updated: 2019-11-15

I want to monitor jobs stuck in the RUNNABLE state in AWS Batch. How can I do this?

Short Description

The number of seconds that your AWS Batch jobs can be stuck in the RUNNABLE state can vary based on your compute environment (for example, 180 seconds).

To monitor jobs stuck in the RUNNABLE state, you can create an Amazon CloudWatch Events rule that triggers an AWS Lambda function on a schedule. The Lambda function then checks for all the RUNNABLE jobs that are stuck for more than 180 seconds on all compute environments.

To create a monitoring rule, choose one of the following options:

  • Create a monitoring rule manually
  • Create a monitoring rule with an AWS CloudFormation template

Resolution

Create a monitoring rule manually

1.    Create an Amazon Simple Notification Service (Amazon SNS) topic, and then subscribe to that topic.

Important: In your subscription, choose Email as your endpoint type, and then enter your email address. Be sure to note your SNS topic ARN.

2.    Create a Lambda function based on the following code.

Important: Under the Lambda function that you created, create an environment variable named NotificationTopicARN for the key. For the value, pass in the SNS notification ARN (For example: arn:aws:sns:us-east-1:111122223333:MonitorBatchJobs-BatchJobSNSNotificationTopic-UZ11C85BGW8Q).

#!/usr/bin/env python
import sys
import boto3
from pprint import pprint
import datetime
import argparse
import json
import os

def lambda_handler(event, context):
  """ Lambda Handler Function """
  seconds = int(event['seconds'])
  try:
    result = get_all_runnable_jobs(seconds)
    if len(result) > 0:
      notify_to_sns(seconds)
  except:
    print("An exception occurred while calling get_all_runnable_jobs() Function ")
  return result

def notify_to_sns(sec):
  """ Notify through SNS """
  try:
    client = boto3.client('sns')
    runnableJobs = get_all_runnable_jobs(sec)
    topic = os.environ['NotificationTopicARN']
    #response = client.publish(TopicArn=os.environ['NotificationTopicARN'], Message=json.dumps(runnableJobs))
    response = client.publish(TopicArn=topic, Message=json.dumps(runnableJobs))
    print("Notification delivered", response)
  except:
    print("An error occurred while sending SNS notification")  

def get_all_jobqueues():
  """ Get all the Job queues """
  try:
    client = boto3.client('batch')
    response = client.describe_job_queues()
    queues = response['jobQueues']
    myqueues = []
    for queue in queues:
      myqueues.append(str(queue['jobQueueArn']))
    return myqueues
  except:
    print("An error while getting all job queues")

def get_all_runnable_jobs(sec):
  """ Get all the runnable jobs under each queue """
  try:
    client = boto3.client('batch')
    myjobqueues = get_all_jobqueues()
    allrunnablejobs = []
    for jobqueue in myjobqueues:
      myrunnablejob = {}
      queuejobs = client.list_jobs(jobQueue=jobqueue, jobStatus="RUNNABLE")
      jobs = queuejobs['jobSummaryList']
      if jobs:
        for job in jobs:
          myrunnablejob["jobQueueName"] = jobqueue
          myrunnablejob["status"] = job['status']
          myrunnablejob["jobId"] = job['jobId']
          unixtimestamp = job['createdAt']
          currenttime = datetime.datetime.now()
          readable = datetime.datetime.fromtimestamp(unixtimestamp/1000.0)
          myrunnablejob["JobStuckInSeconds"] = (currenttime - readable).seconds
          if myrunnablejob["JobStuckInSeconds"] > sec:
            allrunnablejobs.append(myrunnablejob)
    return allrunnablejobs
  except:
    print("Exception while callng get_all_runnable_jobs Function")

Note: The Lambda function requires an AWS Identity and Access Management (IAM) policy to work. If you create the policy manually, attach the policy to the Lambda role when you create the Lambda function. If you're using an AWS CloudFormation stack, the stack creates the policy for you.

3.    Create an Amazon CloudWatch Events rule to trigger the Lambda function.

Important: For your target, choose the Lambda function that you created in step 2. For Configure input, choose Constant (JSON text), and then enter {"seconds":180} in the text box. This constant is the argument passed into the Python function.

Create a monitoring rule with an AWS CloudFormation template

1.    Open the AWS CloudFormation console.

2.    Choose Create Stack.

3.    For the Select an Amazon S3 template URL, enter https://cf-templates-1ad3q9ggd0m6y-us-east-1.s3.amazonaws.com/2019312cso-Batch_lambda_python_newversion.yaml, and then choose Next.

4.    Complete the rest of the steps in the setup wizard, and then choose Create.

5.    After the stack is created and you receive an email from Amazon SNS, choose Confirm subscription in the email.


Did this article help you?

Anything we could improve?


Need more help?