AWS Open Source Blog
Amazon API Gateway for HPC job submission
AWS ParallelCluster simplifies the creation and the deployment of HPC clusters. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale.
In this post we combine AWS ParallelCluster and Amazon API Gateway to allow an HTTP interaction with the scheduler. You can submit, monitor, and terminate jobs using the API, instead of connecting to the master node via SSH. This makes it possible to integrate ParallelCluster programmatically with other applications running on premises or on AWS.
The API uses AWS Lambda and AWS Systems Manager to execute the user commands without granting direct SSH access to the nodes, thus enhancing the security of whole cluster.
VPC configuration
The VPC used for this configuration can be created using the VPC Wizard. You can also use an existing VPC that respects the AWS ParallelCluster network requirements.
In Select a VPC Configuration, choose VPC with Public and Private Subnets and then Select.
Before starting the VPC Wizard, allocate an Elastic IP Address. This will be used to configure a NAT gateway for the private subnet. A NAT gateway is required to enable compute nodes in the AWS ParallelCluster private subnet to download the required packages and to access the AWS services public endpoints. See AWS ParallelCluster network requirements.
You can find more details about the VPC creation and configuration options in VPC with Public and Private Subnets (NAT).
The example below uses the following configuration:
IPv4 CIDR block: 10.0.0.0/16
VPC name: Cluster VPC
Public subnet’s IPv4 CIDR: 10.0.0.0/24
Availability Zone: eu-west-1a
Public subnet name: Public subnet
Private subnet’s IPv4 CIDR:1 0.0.1.0/24
Availability Zone: eu-west-1b
Private subnet name: Private subnet
Elastic IP Allocation ID: <id of the allocated Elastic IP>
Enable DNS hostnames: yes
AWS ParallelCluster configuration
AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS cloud; to get started, see Installing AWS ParallelCluster.
After the AWS ParallelCluster command line has been configured, create the cluster template file below in .parallelcluster/config
. The master_subnet_id
contains the id of the created public subnet and the compute_subnet_id
contains the private one. The ec2_iam_role
is the role that will be used for all the instances of the cluster. The steps to create this role will be explained below.
[aws]
aws_region_name = eu-west-1
[cluster slurm]
scheduler = slurm
compute_instance_type = c5.large
initial_queue_size = 2
max_queue_size = 10
maintain_initial_size = false
base_os = alinux
key_name = AWS_Ireland
vpc_settings = public
ec2_iam_role = parallelcluster-custom-role
[vpc public]
master_subnet_id = subnet-01fc20e143543f8af
compute_subnet_id = subnet-0b1ae2790497d83ec
vpc_id = vpc-0cdee679c5a6163bd
[global]
update_check = true
sanity_check = true
cluster_template = slurm
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
IAM custom Roles for SSM endpoints
To allow ParallelCluster nodes to call Lambda and SSM endpoints, it is necessary to configure a custom IAM Role.
See AWS Identity and Access Management Roles in AWS ParallelCluster for details on the default AWS ParallelCluster policy.
From the AWS console:
- Access the AWS Identity and Access Management (IAM) service and click on Policies.
- Choose Create policy and paste the following policy into the JSONsection. Be sure to modify
<REGION>
,<AWS ACCOUNT ID>
to match the values for your account, and also update the S3 bucket name frompcluster-scripts
to the the bucket you want to use to store the input/output data from jobs and save the output of SSM execution commands.
{
"Version": "2012-10-17",
"Statement": [
{
"Resource": [
"*"
],
"Action": [
"ec2:DescribeVolumes",
"ec2:AttachVolume",
"ec2:DescribeInstanceAttribute",
"ec2:DescribeInstanceStatus",
"ec2:DescribeInstances",
"ec2:DescribeRegions"
],
"Sid": "EC2",
"Effect": "Allow"
},
{
"Resource": [
"*"
],
"Action": [
"dynamodb:ListTables"
],
"Sid": "DynamoDBList",
"Effect": "Allow"
},
{
"Resource": [
"arn:aws:sqs:<REGION>:<AWS ACCOUNT ID>:parallelcluster-*"
],
"Action": [
"sqs:SendMessage",
"sqs:ReceiveMessage",
"sqs:ChangeMessageVisibility",
"sqs:DeleteMessage",
"sqs:GetQueueUrl"
],
"Sid": "SQSQueue",
"Effect": "Allow"
},
{
"Resource": [
"*"
],
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"autoscaling:SetDesiredCapacity",
"autoscaling:DescribeTags",
"autoScaling:UpdateAutoScalingGroup",
"autoscaling:SetInstanceHealth"
],
"Sid": "Autoscaling",
"Effect": "Allow"
},
{
"Resource": [
"arn:aws:dynamodb:<REGION>:<AWS ACCOUNT ID>:table/parallelcluster-*"
],
"Action": [
"dynamodb:PutItem",
"dynamodb:Query",
"dynamodb:GetItem",
"dynamodb:DeleteItem",
"dynamodb:DescribeTable"
],
"Sid": "DynamoDBTable",
"Effect": "Allow"
},
{
"Resource": [
"arn:aws:s3:::<REGION>-aws-parallelcluster/*"
],
"Action": [
"s3:GetObject"
],
"Sid": "S3GetObj",
"Effect": "Allow"
},
{
"Resource": [
"arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/parallelcluster-*"
],
"Action": [
"cloudformation:DescribeStacks"
],
"Sid": "CloudFormationDescribe",
"Effect": "Allow"
},
{
"Resource": [
"*"
],
"Action": [
"sqs:ListQueues"
],
"Sid": "SQSList",
"Effect": "Allow"
},
{
"Effect": "Allow",
"Action": [
"ssm:DescribeAssociation",
"ssm:GetDeployablePatchSnapshotForInstance",
"ssm:GetDocument",
"ssm:DescribeDocument",
"ssm:GetManifest",
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:ListAssociations",
"ssm:ListInstanceAssociations",
"ssm:PutInventory",
"ssm:PutComplianceItems",
"ssm:PutConfigurePackageResult",
"ssm:UpdateAssociationStatus",
"ssm:UpdateInstanceAssociationStatus",
"ssm:UpdateInstanceInformation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2messages:AcknowledgeMessage",
"ec2messages:DeleteMessage",
"ec2messages:FailMessage",
"ec2messages:GetEndpoint",
"ec2messages:GetMessages",
"ec2messages:SendReply"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::pcluster-data/*"
]
}
]
}
Choose Review policy and, in the next section, enter parallelcluster-custom-policy
string and choose Create policy.
Now you can create the Role. Choose Role in the left menu and then Create role.
Select AWS service as type of trusted entity and EC2 as service that will use this role as shown here:
Choose Next Permissions to proceed.
In the policy selection, select the parallelcluster-custom-policy that you just created.
Choose Next: Tags and then Next: Review.
In the Real Name box, enter parallelcluster-custom-role
and confirm by choosing Create role.
Slurm commands execution with AWS Lambda
AWS Lambda allows you to run your code without provisioning or managing servers. Lambda is used, in this solution, to execute the Slurm commands in the Master node. The AWS Lambda function can be created from the AWS console as explained in the Create a Lambda Function with the Console documentation.
For Function name, enter slurmAPI
.
As Runtime, enter Python 2.7
.
Choose Create function to create it.
The code below should be pasted into the Function code section, which you can see by scrolling further down the page. The Lambda function uses AWS Systems Manager to execute the scheduler commands, preventing any SSH access to the node. Please modify <REGION>
appropriately, and update the S3 bucket name from pcluster-data
to the name you chose earlier.
import boto3
import time
import json
import random
import string
def lambda_handler(event, context):
instance_id = event["queryStringParameters"]["instanceid"]
selected_function = event["queryStringParameters"]["function"]
if selected_function == 'list_jobs':
command='squeue'
elif selected_function == 'list_nodes':
command='scontrol show nodes'
elif selected_function == 'list_partitions':
command='scontrol show partitions'
elif selected_function == 'job_details':
jobid = event["queryStringParameters"]["jobid"]
command='scontrol show jobs %s'%jobid
elif selected_function == 'submit_job':
script_name = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(10)])
jobscript_location = event["queryStringParameters"]["jobscript_location"]
command = 'aws s3 cp s3://%s %s.sh; chmod +x %s.sh'%(jobscript_location,script_name,script_name)
s3_tmp_out = execute_command(command,instance_id)
submitopts = ''
try:
submitopts = event["headers"]["submitopts"]
except Exception as e:
submitopts = ''
command = 'sbatch %s %s.sh'%(submitopts,script_name)
body = execute_command(command,instance_id)
return {
'statusCode': 200,
'body': body
}
def execute_command(command,instance_id):
bucket_name = 'pcluster-data'
ssm_client = boto3.client('ssm', region_name="<REGION>")
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
username='ec2-user'
response = ssm_client.send_command(
InstanceIds=[
"%s"%instance_id
],
DocumentName="AWS-RunShellScript",
OutputS3BucketName=bucket_name,
OutputS3KeyPrefix="ssm",
Parameters={
'commands':[
'sudo su - %s -c "%s"'%(username,command)
]
},
)
command_id = response['Command']['CommandId']
time.sleep(1)
output = ssm_client.get_command_invocation(
CommandId=command_id,
InstanceId=instance_id,
)
while output['Status'] != 'Success':
time.sleep(1)
output = ssm_client.get_command_invocation(CommandId=command_id,InstanceId=instance_id)
if (output['Status'] == 'Failed') or (output['Status'] =='Cancelled') or (output['Status'] == 'TimedOut'):
break
body = ''
files = list(bucket.objects.filter(Prefix='ssm/%s/%s/awsrunShellScript/0.awsrunShellScript'%(command_id,instance_id)))
for obj in files:
key = obj.key
body += obj.get()['Body'].read()
return body
In the Basic settings section, set 10 seconds as Timeout.
Choose Save in the top right to save the function.
In the Execution role section, choose View the join-domain-finction-role role on the IAM console (indicated by the red arrow in the image below).
In the newly-opened tab, Choose Attach Policy and then Create Policy.
This last action will open a new tab in your Browser. From this new tab, choose Create policy and then Json.
Modify the <REGION>
, <AWS ACCOUNT ID>
appropriately, and also update the S3 bucket name from pcluster-data
to the name you chose earlier.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:SendCommand"
],
"Resource": [
"arn:aws:ec2:<REGION>:<AWS ACCOUNT ID>:instance/*",
"arn:aws:ssm:<REGION>::document/AWS-RunShellScript",
"arn:aws:s3:::pcluster-data/ssm"
]
},
{
"Effect": "Allow",
"Action": [
"ssm:GetCommandInvocation"
],
"Resource": [
"arn:aws:ssm:<REGION>:<AWS ACCOUNT ID>:*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::pcluster-data",
"arn:aws:s3:::pcluster-data/*"
]
}
]
}
In the next section, enter as Name the ExecuteSlurmCommands
string and then choose Create policy.
Close the current tab and move to the previous one.
Refresh the list, select the ExecuteSlurmCommands policy and then Attach policy, as shown here:
Execute the AWS Lambda function with Amazon API Gateway
The Amazon API Gateway allows the creation of REST and WebSocket APIs that act as a “front door” for applications to access data, business logic, or functionality from your backend services like AWS Lambda.
Sign in to the API Gateway console.
If this is your first time using API Gateway, you will see a page that introduces you to the features of the service. Choose Get Started. When the Create Example API popup appears, choose OK.
If this is not your first time using API Gateway, choose Create API.
Create an empty API as follows and choose Create API:
You can now create the slurm
resource choosing the root resource (/) in the Resources tree and selecting Create Resource from the Actions dropdown menu as shown here:
The new resource can be configured as follows:
Configure as proxy resource: unchecked
Resource Name: slurm
Resource Path: /slurm
Enable API Gateway CORS: unchecked
To confirm the configuration, choose Create Resource.
In the Resource list, choose /slurm and then Actions and Create method as shown here:
Choose ANYfrom the dropdown menu, and choose the checkmark icon.
In the “/slurm – ANY – Setup” section, use the following values:
Integration type: Lambda Function
Use Lambda Proxy integration: checked
Lambda Region: eu-west-1
Lambda Function: slurmAPI
Use Default Timeout: checked
and then choose Save.
Choose OK when prompted with Add Permission to Lambda Function.
You can now deploy the API by choosing Deploy API from the Actions dropdown menu as shown here:
For Deployment stage choose [new stage], for Stage name enter slurm
and then choose Deploy:
Take note of the API’s Invoke URL – it will be required for the API interaction.
Deploy the Cluster
The cluster can now be created using the following command line:
pcluster create -t slurm slurmcluster
-t slurm indicates which section of the cluster template to use.
slurmcluster is the name of the cluster that will be created.
For more details, see the AWS ParallelCluster Documentation. A detailed explanation of the pcluster command line parameters can be found in AWS ParallelCluster CLI Commands.
How to interact with the slurm API
The slurm API created in the previous steps requires some parameters:
instanceid
– the instance id of the Master node.function
– the API function to execute. Accepted values .arelist_jobs
,list_nodes
,list_partitions
,job_details
andsubmit_job
.jobscript_location
– the s3 location of the job script (required only whenfunction=submit_job
) .submitopts
– the submission parameters passed to the scheduler (optional, can be used whenfunction=submit_job
).
Here is an example of the interaction with the API:
#Submit a job
$ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=submit_job&jobscript_location=pcluster-data/job_script.sh" -H 'submitopts: --job-name=TestJob --partition=compute'
Submitted batch job 11
#List of the jobs
$ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=list_jobs"
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
11 compute TestJob ec2-user R 0:14 1 ip-10-0-3-209
#Job details
$ curl -s POST "https://966p4hvg04.execute-api.eu-west-1.amazonaws.com/slurm/slurm?instanceid=i-062155b00c02a6c8e&function=job_details&jobid=11"
JobId=11 JobName=TestJob
UserId=ec2-user(500) GroupId=ec2-user(500) MCS_label=N/A
Priority=4294901759 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:06 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-06-26T14:42:09 EligibleTime=2019-06-26T14:42:09
AccrueTime=Unknown
StartTime=2019-06-26T14:49:18 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-06-26T14:49:18
Partition=compute AllocNode:Sid=ip-10-0-1-181:28284
ReqNodeList=(null) ExcNodeList=(null)
NodeList=ip-10-0-3-209
BatchHost=ip-10-0-3-209
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/ec2-user/C7XMOG2hPo.sh
WorkDir=/home/ec2-user
StdErr=/home/ec2-user/slurm-11.out
StdIn=/dev/null
StdOut=/home/ec2-user/slurm-11.out
Power=
The authentication to the API can be managed following the Controlling and Managing Access to a REST API in API Gateway Documentation.
Teardown
When you have finished your computation, the cluster can be destroyed using the following command:
pcluster delete slurmcluster
The additional created resources can be destroyed following the official AWS documentation:
Conclusion
This post has shown you how to deploy a Slurm cluster using AWS ParallelCluster, and integrate it with the Amazon API Gateway.
This solution uses the Amazon API Gateway, AWS Lambda, and AWS Systems Manager to simplify interaction with the cluster without granting access to the command line of the Master node, improving the overall security. You can extend the API by adding additional schedulers or interaction workflows and can be integrated with external applications.