AWS HPC Blog

Slurm REST API in AWS ParallelCluster

This post was contributed by Sean Smith, Sr HPC Solution Architect, and Ryan Kilpadi, SDE Intern, HPC

AWS ParallelCluster offers powerful compute capabilities for problems ranging from discovering new drugs, to designing F1 race cars, to predicting the weather. In all these cases there’s a need for a human to sit in the loop – maybe an engineer running a simulation or perhaps a scientist submitting their lab results for analysis.

In this post we’ll show you how to programmatically submit and monitor jobs using the open-source Slurm REST API. This allows ParallelCluster to be integrated into an automated system via API calls. For example, this could mean that whenever a genome sample is read from a sequencer, it’s automatically fed through a secondary analysis pipeline to align the individual reads, or when new satellite data lands in an Amazon S3 bucket, it triggers a job to create the latest weather forecast.

Today, we’ll show how to set this up with AWS ParallelCluster. We’ll also link to a GitHub repository with code you can use and show examples of how to call the API using both curl and Python.

Architecture

This diagram shows an example cluster architecture with the Slurm REST API. The REST API runs on the HeadNode and submits jobs to the compute queues. The credentials used to authenticate with the API are stored in AWS Secrets Manager. The compute queues shown are examples only: the cluster can be configured with any instance configuration you desire.

Figure 1 – The REST API runs on the HeadNode and submits jobs to the compute queues. The credentials used to authenticate with the API are stored in AWS Secrets Manager. The compute queues shown are examples only, the cluster can be configured with any instance configuration you desire.

Figure 1 – The REST API runs on the HeadNode and submits jobs to the compute queues. The credentials used to authenticate with the API are stored in AWS Secrets Manager. The compute queues shown are examples only, the cluster can be configured with any instance configuration you desire.

In this tutorial, we will be using ParallelCluster UI to set up our cluster with the Slurm REST API enabled. To set up ParallelCluster UI, refer to our online documentation. If you would rather use the ParallelCluster CLI, see the example YAML configuration in step 5 and skip steps 2-4.

Step 1 – Set up Slurm Accounting

Slurm Accounting is required to enable the Slurm REST API. Follow the instructions in our workshop to set up an accounting database but don’t start cluster creation just yet. If you would rather use the CLI, refer to our online documentation instead, where we also have more information on using Slurm accounting with AWS ParallelCluster.

Step 2 – Create a Security Group to allow inbound API requests

By default, your cluster will not be able to accept incoming HTTPS requests to the REST API. You will need to create a security group to allow traffic from outside the cluster to call the API.

  1. Navigate to the EC2 Security Group console and choose Create security group.
  2. Under Security group name, enter Slurm REST API (or another name of your choosing)
  3. Ensure the VPC matches your cluster’s VPC
  4. Add an Inbound rule and select HTTPS under Type, then change the Source to only the CIDR range that you want to have access. For example, you can use the CIDR associated with your VPC to restrict access to within your VPC.
  5. Choose Create security group
Figure 2 – Create your security group, adding a VPC and an inbound rule to allow only HTTPS connections from a specific CID

Figure 2 – Create your security group, adding a VPC and an inbound rule to allow only HTTPS connections from a specific CID

Step 3 – Add Additional IAM Permissions

If you’re using the AWS ParallelCluster UI, please follow the instructions under the ParallelCluster UI Tutorial section ‘G’Setup IAM Permissions.

Step 4 – Configure your Cluster

  1. In your cluster configuration, return to the HeadNode section > Advanced options > Additional Security Groups > add the Slurm REST API Security Group you created in Step 1. Under Scripts > on node configured > add the following script:
https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh
Figure 4 – Add the script to be run after node configuration

Figure 4 – Add the script to be run after node configuration

  1. Under Additional IAM permissions, add the policy:
arn:aws:iam::aws:policy/SecretsManagerReadWrite
Figure 5 – Add IAM policy to allow updates to AWS SecretsManager. This is needed to automatically refresh the JSON Web Token (JWT).

Figure 5 – Add IAM policy to allow updates to AWS SecretsManager. This is needed to automatically refresh the JSON Web Token (JWT).

  1. Create your cluster.

Step 5 – Validate the configuration

Your configuration file should look something like this. Slurm accounting configuration is not included here as there are a few different ways this can be set up (see Step 1). If you opted to use the CLI instead of the UI, you will need to replace:

Imds:
  ImdsSupport: v1.0
HeadNode:
  InstanceType: c5.xlarge
  Imds:
    Secured: true
  Ssh:
    KeyName: amzn2
  LocalStorage:
    RootVolume:
      VolumeType: gp3
  Networking:
    SubnetId: subnet-xxxxxxxxxxxxxx
    AdditionalSecurityGroups:
      - sg-slurmrestapixxxxxxxxxx
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/SecretsManagerReadWrite
  CustomActions:
    OnNodeConfigured:
      Script: >-
        https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: queue-1
      ComputeResources:
        - Name: queue-1-cr-1
          Instances:
            - InstanceType: c5.xlarge
          MinCount: 0
          MaxCount: 4
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      Networking:
        SubnetIds:
          - subnet-xxxxxxxxxxxxxxxxxx
Region: us-east-2
Image:
	Os: alinux2

Step 6 – Call the API

  1. Log in to a machine on the same network that you allowed via the Security Group in Step 2. Make sure this machine is able to talk to the HeadNode.
ssh username@ip
  1. Set the following environment variable:
export CLUSTER_NAME=[name of cluster]
  1. Find the information needed to call the API and construct an API request. To do this, we’ll need a few pieces of information (if you have trouble with the following commands, you may need to specify a region):
    • JWT token: The post install script will have created a secret in AWS SecretsManager under the name slurm_token_$CLUSTER_NAME . Note that the head node will rotate this secret every 20 minutes and it will not remain valid. Either use the AWS console or the AWS CLI to find your secret based on the cluster name:
export JWT=$(aws secretsmanager get-secret-value --secret-id slurm_token_$CLUSTER_NAME | jq -r '.SecretString')NOTE: Since the Slurm REST API script is not integrated into ParallelCluster, this secret will not be automatically deleted along with the cluster. You may want to remove it manually on cluster deletion.
    • Head node public IP: This can be found in your Amazon EC2 dashboard or by using the ParallelCluster CLI:
export HEADNODE_IP=$(pcluster describe-cluster-instances -n $CLUSTER_NAME | jq -r '.instances[0].publicIpAddress')
    • Cluster user: This depends on your AMI, but it will usually be either ec2-user , ubuntu , or centos.
export CLUSTER_USER=ec2-user
  1. Call the API using curl:
curl -H "X-SLURM-USER-NAME: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" https://$HEADNODE_IP/slurm/v0.0.39/ping -k

You’ll get a response back like:

{
    "meta": {
        "plugin": {
            "type": "openapi\/v0.0.39",
            "name": "REST v0.0.39"
        },
        "Slurm": {
            "version": {
            "major": 23,
            "micro": 2,
            "minor": 2
        },
         "release": "23.02.2"
    }...
  • Submit a job using the API. Specify the job parameters using JSON. You may need to modify the standard directories depending on the cluster user.
  • Post a job to the API:
curl -H "X-SLURM-USER-TOKEN: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" -X POST https://$IP/slurm/v0.0.39/job/submit -H "Content-Type: application/json" -d @testjob.json -k
  • c. Verify that the job is running:
curl -H "X-SLURM-USER-NAME: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" https://$IP/slurm/v0.0.39/jobs -k

Calling the API using the Python requests library

  1. Create a script called slurmapi.py with the following contents:
#!/usr/bin/env python3
import argparse
import boto3
import requests
import json

# Create argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--cluster-name', type=str, required=True)
parser.add_argument('-u', '--cluster-user', type=str, required=False)
parser.add_argument(‘-r’, ‘—region’, type=str, required=False)
subparsers = parser.add_subparsers(dest='command', required=True)

diag_parser = subparsers.add_parser('diag', help="Get diagnostics")
ping_parser = subparsers.add_parser('ping', help="Ping test")

submit_job_parser = subparsers.add_parser('submit-job', help="Submit a job")
submit_job_parser.add_argument('-j', '--job', type=str, required=True)

list_jobs_parser = subparsers.add_parser('list-jobs', help="List active jobs")

describe_job_parser = subparsers.add_parser('describe-job', help="Describe a job by id")
describe_job_parser.add_argument('-j', '--job-id', type=int, required=True)

cancel_parser = subparsers.add_parser('cancel-job', help="Cancel a job")
cancel_parser.add_argument('-j', '--job-id', type=int, required=True)

args = parser.parse_args()

if args.region:
    boto3.setup_default_session(region_name=args.region)

# Get JWT token
client = boto3.client('secretsmanager')
boto_response = client.get_secret_value(SecretId=f'slurm_token_{args.cluster_name}')
jwt_token = boto_response['SecretString']

# Get cluster headnode IP
client = boto3.client('ec2')
filters = [{'Name': 'tag:parallelcluster:cluster-name', 'Values': [args.cluster_name]}]
boto_response = client.describe_instances(Filters=filters)
headnode_ip = boto_response['Reservations'][0]['Instances'][0]['PublicIpAddress']

url = f'https://{headnode_ip}/slurm/v0.0.39'
headers = {'X-SLURM-USER-TOKEN': jwt_token}
if args.cluster_user:
    headers['X-SLURM-USER-NAME'] = args.cluster_user

# Make request
if args.command == 'ping':
    r = requests.get(f'{url}/ping', headers=headers, verify=False)
elif args.command == 'diag':
    r = requests.get(f'{url}/diag', headers=headers, verify=False)
elif args.command == 'submit-job':
    with open(args.job) as job_file:
        job_json = json.load(job_file)
    r = requests.post(f'{url}/job/submit', headers=headers, json=job_json, verify=False)
elif args.command == 'list-jobs':
    r = requests.get(f'{url}/jobs', headers=headers, verify=False)
elif args.command == 'describe-job':
    r = requests.get(f'{url}/job/{args.job_id}', headers=headers, verify=False)
elif args.command == 'cancel-job':
    r = requests.delete(f'{url}/job/{args.job_id}', headers=headers, verify=False)

print(r.text)
  1. Submit a job
python3 slurmapi.py -n [cluster_name] -r [region] submit-job -u [cluster_user] -j testjob.json
  1. Getting more information
python3 slurmapi.py -h

Conclusion

Setting up the Slurm REST API allows you to programmatically control the cluster, which makes it possible to build the cluster into an automated workflow. This enables new use cases such as automated secondary analysis of genomics data, risk analysis in financial markets, weather prediction, among a myriad of other use cases. We’re excited to see what you build. Drop us a line to tell us what you come up with.

Ryan Kilpadi

Ryan Kilpadi

Ryan Kilpadi is a returning SDE intern on the HPC team working on AWS ParallelCluster. He worked on implementing the Slurm REST API on ParallelCluster as a summer internship project in 2022.