Slurm REST API in AWS ParallelCluster

This post was contributed by Sean Smith, Sr HPC Solution Architect, and Ryan Kilpadi, SDE Intern, HPC

AWS ParallelCluster offers powerful compute capabilities for problems ranging from discovering new drugs, to designing F1 race cars, to predicting the weather. In all these cases there’s a need for a human to sit in the loop – maybe an engineer running a simulation or perhaps a scientist submitting their lab results for analysis.

In this post we’ll show you how to programmatically submit and monitor jobs using the open-source Slurm REST API. This allows ParallelCluster to be integrated into an automated system via API calls. For example, this could mean that whenever a genome sample is read from a sequencer, it’s automatically fed through a secondary analysis pipeline to align the individual reads, or when new satellite data lands in an Amazon S3 bucket, it triggers a job to create the latest weather forecast.

Today, we’ll show how to set this up with AWS ParallelCluster. We’ll also link to a GitHub repository with code you can use and show examples of how to call the API using both curl and Python.

Architecture

This diagram shows an example cluster architecture with the Slurm REST API. The REST API runs on the HeadNode and submits jobs to the compute queues. The credentials used to authenticate with the API are stored in AWS Secrets Manager. The compute queues shown are examples only: the cluster can be configured with any instance configuration you desire.

Figure 1 – The REST API runs on the HeadNode and submits jobs to the compute queues. The credentials used to authenticate with the API are stored in AWS Secrets Manager. The compute queues shown are examples only, the cluster can be configured with any instance configuration you desire.

For this tutorial, we’ll be using ParallelCluster UI to set up our cluster with the Slurm REST API enabled. To set up ParallelCluster UI, refer to our online documentation. If you’d rather use the ParallelCluster CLI, see the example YAML configuration step 5.

Step 1 – Create a Security Group to allow inbound API requests

By default, your cluster will not be able to accept incoming HTTPS requests to the REST API. You will need to create a security group to allow traffic from outside the cluster to call the API.

Navigate to the EC2 Security Group console and choose Create security group.
Under Security group name, enter Slurm REST API (or another name of your choosing)
Ensure the VPC matches your cluster’s VPC
Add an Inbound rule and select HTTPS under Type, then change the Source to only the CIDR range that you want to have access. For example, you can use the CIDR associated with your VPC to restrict access to within your VPC.
Choose Create security group

Figure 2 – Create your security group, adding a VPC and an inbound rule to allow only HTTPS connections from a specific CID

Step 2 – Add Additional IAM Permissions

If you’re using the AWS ParallelCluster UI, please follow the instructions under the ParallelCluster UI Tutorial section ‘G’ – Setup IAM Permissions.

Step 3 – Configure your Cluster

In your cluster configuration, return to the HeadNode section > Advanced options > Additional Security Groups > add the Slurm REST API Security Group you created in Step 1. Under Scripts > on node configured > add the following script:

https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh

Figure 4 – Add the script to be run after node configuration

Under Additional IAM permissions, add the policy:

arn:aws:iam::aws:policy/SecretsManagerReadWrite

Figure 5 – Add IAM policy to allow updates to AWS SecretsManager. This is needed to automatically refresh the JSON Web Token (JWT).

Create your cluster.

Step 4 – Validate the configuration

Your configuration file should look something like the text that follows. If you opted to use the CLI instead of the UI, you will need to replace:

AdditionalSecurityGroups – this should contain an additional security group that allows connections to the Slurm REST API (Step 1).
OnNodeConfigured: thish should reference the post-install script: https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh

Imds:
  ImdsSupport: v1.0
HeadNode:
  InstanceType: c5.xlarge
  Imds:
    Secured: true
  Ssh:
    KeyName: amzn2
  LocalStorage:
    RootVolume:
      VolumeType: gp3
  Networking:
    SubnetId: subnet-xxxxxxxxxxxxxx
    AdditionalSecurityGroups:
      - sg-slurmrestapixxxxxxxxxx
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/SecretsManagerReadWrite
  CustomActions:
    OnNodeConfigured:
      Script: >-
        https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: queue-1
      ComputeResources:
        - Name: queue-1-cr-1
          Instances:
            - InstanceType: c5.xlarge
          MinCount: 0
          MaxCount: 4
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      Networking:
        SubnetIds:
          - subnet-xxxxxxxxxxxxxxxxxx
Region: us-east-2
Image:
	Os: alinux2

Step 5 – Call the API

Log in to a machine on the same network that you allowed via the Security Group in Step 1. Make sure this machine is able to talk to the HeadNode.

ssh username@ip

Set the following environment variable:

export CLUSTER_NAME=[name of cluster]

Find the information needed to call the API and construct an API request. To do this, we’ll need a few pieces of information.
- JWT token: The post install script will have created a secret in AWS SecretsManager under the name slurm_token_$CLUSTER_NAME . Either use the AWS console or the AWS CLI to find your secret based on the cluster name:

export JWT=$(aws secretsmanager get-secret-value --secret-id slurm_token_$CLUSTER_NAME | jq -r '.SecretString')NOTE: Since the Slurm REST API script is not integrated into ParallelCluster, this secret will not be automatically deleted along with the cluster. You may want to remove it manually on cluster deletion.

- Head node public IP: This can be found in your Amazon EC2 dashboard or by using the ParallelCluster CLI:

export HEADNODE_IP=$(pcluster describe-cluster-instances -n $CLUSTER_NAME | jq -r '.instances[0].publicIpAddress')

- Cluster user: This depends on your AMI, but it will usually be either ec2-user , ubuntu , or centos.

export CLUSTER_USER=ec2-user

Call the API using curl:

curl -H "X-SLURM-USER-NAME: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" https://$HEADNODE_IP/slurm/v0.0.39/ping -k

You’ll get a response back like:

{
    "meta": {
        "plugin": {
            "type": "openapi\/v0.0.39",
            "name": "REST v0.0.39"
        },
        "Slurm": {
            "version": {
            "major": 23,
            "micro": 2,
            "minor": 2
        },
         "release": "23.02.2"
    }...

Submit a job using the API. Specify the job parameters using JSON. You may need to modify the standard directories depending on the cluster user.
Post a job to the API:

curl -H "X-SLURM-USER-TOKEN: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" -X POST https://$IP/slurm/v0.0.39/job/submit -H "Content-Type: application/json" -d @testjob.json -k

c. Verify that the job is running:

curl -H "X-SLURM-USER-NAME: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" https://$IP/slurm/v0.0.39/jobs -k

Calling the API using the Python requests library

Create a script called slurmapi.py with the following contents:

#!/usr/bin/env python3
import argparse
import boto3
import requests
import json

# Create argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--cluster-name', type=str, required=True)
parser.add_argument('-u', '--cluster-user', type=str, required=False)
subparsers = parser.add_subparsers(dest='command', required=True)

diag_parser = subparsers.add_parser('diag', help="Get diagnostics")
ping_parser = subparsers.add_parser('ping', help="Ping test")

submit_job_parser = subparsers.add_parser('submit-job', help="Submit a job")
submit_job_parser.add_argument('-j', '--job', type=str, required=True)

list_jobs_parser = subparsers.add_parser('list-jobs', help="List active jobs")

describe_job_parser = subparsers.add_parser('describe-job', help="Describe a job by id")
describe_job_parser.add_argument('-j', '--job-id', type=int, required=True)

cancel_parser = subparsers.add_parser('cancel-job', help="Cancel a job")
cancel_parser.add_argument('-j', '--job-id', type=int, required=True)

args = parser.parse_args()

# Get JWT token
client = boto3.client('secretsmanager')
boto_response = client.get_secret_value(SecretId=f'slurm_token_{args.cluster_name}')
jwt_token = boto_response['SecretString']

# Get cluster headnode IP
client = boto3.client('ec2')
filters = [{'Name': 'tag:parallelcluster:cluster-name', 'Values': [args.cluster_name]}]
boto_response = client.describe_instances(Filters=filters)
headnode_ip = boto_response['Reservations'][0]['Instances'][0]['PublicIpAddress']

url = f'https://{headnode_ip}/slurm/v0.0.39'
headers = {'X-SLURM-USER-TOKEN': jwt_token}
if args.cluster_user:
    headers['X-SLURM-USER-NAME'] = args.cluster_user

# Make request
if args.command == 'ping':
    r = requests.get(f'{url}/ping', headers=headers, verify=False)
elif args.command == 'diag':
    r = requests.get(f'{url}/diag', headers=headers, verify=False)
elif args.command == 'submit-job':
    with open(args.job) as job_file:
        job_json = json.load(job_file)
    r = requests.post(f'{url}/job/submit', headers=headers, json=job_json, verify=False)
elif args.command == 'list-jobs':
    r = requests.get(f'{url}/jobs', headers=headers, verify=False)
elif args.command == 'describe-job':
    r = requests.get(f'{url}/job/{args.job_id}', headers=headers, verify=False)
elif args.command == 'cancel-job':
    r = requests.delete(f'{url}/job/{args.job_id}', headers=headers, verify=False)

print(r.text)

Submitt a job

./slurmapi.py -n [cluster_name] submit-job -u [cluster_user] -j testjob.json

Getting more information

./slurmapi.py -h

Conclusion

Setting up the Slurm REST API allows you to programmatically control the cluster, this makes it possible to build the cluster into an automated workflow. This enables new use cases such as automated secondary analysis of genomics data, risk analysis in financial markets, weather prediction, among a myriad of other use cases. We’re excited to see what you build, drop us a line on twitter to showcase what you come up with.

AWS HPC Blog