AWS HPC Blog
Slurm REST API in AWS ParallelCluster
This post was contributed by Sean Smith, Sr HPC Solution Architect, and Ryan Kilpadi, SDE Intern, HPC
AWS ParallelCluster offers powerful compute capabilities for problems ranging from discovering new drugs, to designing F1 race cars, to predicting the weather. In all these cases there’s a need for a human to sit in the loop – maybe an engineer running a simulation or perhaps a scientist submitting their lab results for analysis.
In this post we’ll show you how to programmatically submit and monitor jobs using the open-source Slurm REST API. This allows ParallelCluster to be integrated into an automated system via API calls. For example, this could mean that whenever a genome sample is read from a sequencer, it’s automatically fed through a secondary analysis pipeline to align the individual reads, or when new satellite data lands in an Amazon S3 bucket, it triggers a job to create the latest weather forecast.
Today, we’ll show how to set this up with AWS ParallelCluster. We’ll also link to a GitHub repository with code you can use and show examples of how to call the API using both curl and Python.
Architecture
This diagram shows an example cluster architecture with the Slurm REST API. The REST API runs on the HeadNode and submits jobs to the compute queues. The credentials used to authenticate with the API are stored in AWS Secrets Manager. The compute queues shown are examples only: the cluster can be configured with any instance configuration you desire.
In this tutorial, we will be using ParallelCluster UI to set up our cluster with the Slurm REST API enabled. To set up ParallelCluster UI, refer to our online documentation. If you would rather use the ParallelCluster CLI, see the example YAML configuration in step 5 and skip steps 2-4.
Step 1 – Set up Slurm Accounting
Slurm Accounting is required to enable the Slurm REST API. Follow the instructions in our workshop to set up an accounting database but don’t start cluster creation just yet. If you would rather use the CLI, refer to our online documentation instead, where we also have more information on using Slurm accounting with AWS ParallelCluster.
Step 2 – Create a Security Group to allow inbound API requests
By default, your cluster will not be able to accept incoming HTTPS requests to the REST API. You will need to create a security group to allow traffic from outside the cluster to call the API.
- Navigate to the EC2 Security Group console and choose Create security group.
- Under Security group name, enter
Slurm REST API
(or another name of your choosing) - Ensure the VPC matches your cluster’s VPC
- Add an Inbound rule and select
HTTPS
under Type, then change the Source to only the CIDR range that you want to have access. For example, you can use the CIDR associated with your VPC to restrict access to within your VPC. - Choose Create security group
Step 3 – Add Additional IAM Permissions
If you’re using the AWS ParallelCluster UI, please follow the instructions under the ParallelCluster UI Tutorial section ‘G’ – Setup IAM Permissions.
Step 4 – Configure your Cluster
- In your cluster configuration, return to the HeadNode section > Advanced options > Additional Security Groups > add the Slurm REST API Security Group you created in Step 1. Under Scripts > on node configured > add the following script:
https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh
- Under Additional IAM permissions, add the policy:
arn:aws:iam::aws:policy/SecretsManagerReadWrite
- Create your cluster.
Step 5 – Validate the configuration
Your configuration file should look something like this. Slurm accounting configuration is not included here as there are a few different ways this can be set up (see Step 1). If you opted to use the CLI instead of the UI, you will need to replace:
AdditionalSecurityGroups
– this should contain an additional security group that allows connections to the Slurm REST API (Step 2).OnNodeConfigured
: this should reference the post-install script:https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh
Imds:
ImdsSupport: v1.0
HeadNode:
InstanceType: c5.xlarge
Imds:
Secured: true
Ssh:
KeyName: amzn2
LocalStorage:
RootVolume:
VolumeType: gp3
Networking:
SubnetId: subnet-xxxxxxxxxxxxxx
AdditionalSecurityGroups:
- sg-slurmrestapixxxxxxxxxx
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/SecretsManagerReadWrite
CustomActions:
OnNodeConfigured:
Script: >-
https://raw.githubusercontent.com/aws-samples/aws-parallelcluster-post-install-scripts/main/rest-api/postinstall.sh
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue-1
ComputeResources:
- Name: queue-1-cr-1
Instances:
- InstanceType: c5.xlarge
MinCount: 0
MaxCount: 4
ComputeSettings:
LocalStorage:
RootVolume:
VolumeType: gp3
Networking:
SubnetIds:
- subnet-xxxxxxxxxxxxxxxxxx
Region: us-east-2
Image:
Os: alinux2
Step 6 – Call the API
- Log in to a machine on the same network that you allowed via the Security Group in Step 2. Make sure this machine is able to talk to the HeadNode.
ssh username@ip
- Set the following environment variable:
export CLUSTER_NAME=[name of cluster]
- Find the information needed to call the API and construct an API request. To do this, we’ll need a few pieces of information (if you have trouble with the following commands, you may need to specify a region):
- JWT token: The post install script will have created a secret in AWS SecretsManager under the name
slurm_token_$CLUSTER_NAME
. Note that the head node will rotate this secret every 20 minutes and it will not remain valid. Either use the AWS console or the AWS CLI to find your secret based on the cluster name:
- JWT token: The post install script will have created a secret in AWS SecretsManager under the name
export JWT=$(aws secretsmanager get-secret-value --secret-id slurm_token_$CLUSTER_NAME | jq -r '.SecretString')
NOTE: Since the Slurm REST API script is not integrated into ParallelCluster, this secret will not be automatically deleted along with the cluster. You may want to remove it manually on cluster deletion.
-
- Head node public IP: This can be found in your Amazon EC2 dashboard or by using the ParallelCluster CLI:
export HEADNODE_IP=$(pcluster describe-cluster-instances -n $CLUSTER_NAME | jq -r '.instances[0].publicIpAddress')
-
- Cluster user: This depends on your AMI, but it will usually be either ec2-user , ubuntu , or centos.
export CLUSTER_USER=ec2-user
- Call the API using curl:
curl -H "X-SLURM-USER-NAME: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" https://$HEADNODE_IP/slurm/v0.0.39/ping -k
You’ll get a response back like:
{
"meta": {
"plugin": {
"type": "openapi\/v0.0.39",
"name": "REST v0.0.39"
},
"Slurm": {
"version": {
"major": 23,
"micro": 2,
"minor": 2
},
"release": "23.02.2"
}...
- Submit a job using the API. Specify the job parameters using JSON. You may need to modify the standard directories depending on the cluster user.
- Post a job to the API:
curl -H "X-SLURM-USER-TOKEN: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" -X POST https://$IP/slurm/v0.0.39/job/submit -H "Content-Type: application/json" -d @testjob.json -k
- c. Verify that the job is running:
curl -H "X-SLURM-USER-NAME: $CLUSTER_USER" -H "X-SLURM-USER-TOKEN: $JWT" https://$IP/slurm/v0.0.39/jobs -k
Calling the API using the Python requests library
- Create a script called
slurmapi.py
with the following contents:
#!/usr/bin/env python3 import argparse import boto3 import requests import json # Create argument parser parser = argparse.ArgumentParser() parser.add_argument('-n', '--cluster-name', type=str, required=True) parser.add_argument('-u', '--cluster-user', type=str, required=False) parser.add_argument(‘-r’, ‘—region’, type=str, required=False) subparsers = parser.add_subparsers(dest='command', required=True) diag_parser = subparsers.add_parser('diag', help="Get diagnostics") ping_parser = subparsers.add_parser('ping', help="Ping test") submit_job_parser = subparsers.add_parser('submit-job', help="Submit a job") submit_job_parser.add_argument('-j', '--job', type=str, required=True) list_jobs_parser = subparsers.add_parser('list-jobs', help="List active jobs") describe_job_parser = subparsers.add_parser('describe-job', help="Describe a job by id") describe_job_parser.add_argument('-j', '--job-id', type=int, required=True) cancel_parser = subparsers.add_parser('cancel-job', help="Cancel a job") cancel_parser.add_argument('-j', '--job-id', type=int, required=True) args = parser.parse_args() if args.region: boto3.setup_default_session(region_name=args.region) # Get JWT token client = boto3.client('secretsmanager') boto_response = client.get_secret_value(SecretId=f'slurm_token_{args.cluster_name}') jwt_token = boto_response['SecretString'] # Get cluster headnode IP client = boto3.client('ec2') filters = [{'Name': 'tag:parallelcluster:cluster-name', 'Values': [args.cluster_name]}] boto_response = client.describe_instances(Filters=filters) headnode_ip = boto_response['Reservations'][0]['Instances'][0]['PublicIpAddress'] url = f'https://{headnode_ip}/slurm/v0.0.39' headers = {'X-SLURM-USER-TOKEN': jwt_token} if args.cluster_user: headers['X-SLURM-USER-NAME'] = args.cluster_user # Make request if args.command == 'ping': r = requests.get(f'{url}/ping', headers=headers, verify=False) elif args.command == 'diag': r = requests.get(f'{url}/diag', headers=headers, verify=False) elif args.command == 'submit-job': with open(args.job) as job_file: job_json = json.load(job_file) r = requests.post(f'{url}/job/submit', headers=headers, json=job_json, verify=False) elif args.command == 'list-jobs': r = requests.get(f'{url}/jobs', headers=headers, verify=False) elif args.command == 'describe-job': r = requests.get(f'{url}/job/{args.job_id}', headers=headers, verify=False) elif args.command == 'cancel-job': r = requests.delete(f'{url}/job/{args.job_id}', headers=headers, verify=False) print(r.text)
- Submit a job
python3 slurmapi.py -n [cluster_name] -r [region] submit-job -u [cluster_user] -j testjob.json
- Getting more information
python3 slurmapi.py -h
Conclusion
Setting up the Slurm REST API allows you to programmatically control the cluster, which makes it possible to build the cluster into an automated workflow. This enables new use cases such as automated secondary analysis of genomics data, risk analysis in financial markets, weather prediction, among a myriad of other use cases. We’re excited to see what you build. Drop us a line to tell us what you come up with.