AWS HPC Blog
Implementing e-mail and SMS notifications in AWS ParallelCluster with Slurm
AWS ParallelCluster allows customers to deploy HPC clusters using AWS. You can select the size you need for your cluster, the operating system (OS), file systems, and CPU or GPU architectures to meet your needs – and only pay for what you use. Slurm manages jobs on the cluster, and informs the autoscaling functions in ParallelCluster.
In this post we’ll demonstrate how you can use a post-install script to add e-mail notifications and SMS notifications to ParallelCluster for Slurm job events. And we’ll show how you can use this to interact with other AWS resources – to create your own workflows. This could be a “job completion” notification from Slurm that triggers a post-processing AWS Lambda function to run on output data that a job wrote to an Amazon Simple Storage Service (Amazon S3) bucket.
Enabling Slurm job notifications
Slurm has an option to send e-mails to users when a job changes state, for example when the job starts and ends. To do this, the server that’s running the slurmctld
daemon must already be configured to send e-mails using SMTP.
In the example that follows, we’re instructing Slurm to send notifications via e-mail for all job events:
#SBATCH --mail-user=jdoe@example.com
#SBATCH --mail-type=ALL
When Slurm sends e-mails for these events, they don’t include an e-mail body – it packs everything it knows about the job in the subject of the e-mail, as we’ve shown in Figure 1. That empty e-mail body is just begging to be filled with additional job information.
What is Slurm-Mail?
Slurm-Mail is an open source add-on for Slurm that allows HPC administrators to configure their Slurm installations to send HTML5 e-mails to their users about their jobs. The e-mails it generates contain additional information about users’ jobs from Slurm’s job accounting database. You can customize the e-mails using templates depending on your individual needs.
Slurm-Mail provides packages for RHEL-based operating systems, including Amazon Linux. It also supports Ubuntu and SLES.
How can we use Slurm-Mail with AWS ParallelCluster?
The Slurm-Mail package repository provides RPMs for Amazon Linux 2. We can therefore install Slurm-Mail on the headnode of our ParallelCluster. However, we need to tell Slurm-Mail where to deliver e-mails. For this we can use Amazon Simple Email Service (SES).
Amazon SES gives you the ability to send e-mails from your cloud-based resources, e.g. from Amazon Compute Cloud (Amazon EC2) hosts, Lambda functions, and more. This eliminates the need for you to set-up and operate your own mail servers. You can send messages using Amazon SES with either the Amazon SES API, or an SMTP interface. In this blog post, we’ll make use of the SMTP interface to send e-mails from the headnode on our ParallelCluster, using Slurm-Mail.
In the Amazon SES console you can create an e-mail identity that will send the e-mails. To help prevent fraud and abuse, and to help protect your reputation as a sender, Amazon SES places all new accounts in the SES sandbox. Here, you’ll only be able to send and receive e-mails to the e-mail identities you have defined in Amazon SES. If you’re only sending e-mails to yourself or e-mail addresses you are able to verify, then the sandbox will be enough for your needs. If not, you’ll have to cut a ticket to customer support to migrate out of the Amazon SES sandbox.
Once you’ve created and verified your e-mail identity in the Amazon SES console you can create SMTP credentials (shown in Figure 1).
The process of creating SMTP credentials will create a new IAM (Identity and Access Management) user. Take note of the credentials displayed to you when this happens, because you’ll need these to configure Slurm-Mail. Also note your SMTP server settings – they’re available in the SES console (as shown in Figure 2).
With this information, you’re now ready to make use of the AWS ParallelCluster Slurm-Mail recipe, which is part of the HPC Recipes Library. This is a launchable AWS CloudFormation template which will ask you for a number of input values to define your cluster. These include the SMTP settings that Slurm-Mail will use. We’ve detailed the input parameters you need in Table 1, and there’s an example of these being used, in Figure 3.
During the creation of the headnode resource, we invoke the postinstall.sh
script to install and configure Slurm-Mail. The script will create a yum
configuration file on the headnode to help ensure that you can keep your Slurm-Mail installation up to date.
It will take approximately 30-40 minutes per cluster to provision the resources. Once the stack has finished deploying, you can check the output values to find the public IP address of the headnode (Figure 4).
Using the headnode IP and your SSH key you will be able to login to the headnode of your cluster to submit jobs. Slurm-Mail will then send you job notification emails. We’ve shown an example e-mail in Figure 5.
Adding e-mail notifications to an existing cluster
Before proceeding to update an existing ParallelCluster deployment, make sure you have an AWS SES e-mail identity configured along with a SMTP user and password (see the previous section).
Now you need to login to the headnode of your ParallelCluster and execute some commands to download the post-install script:
wget https://raw.githubusercontent.com/aws-samples/aws-hpc-recipes/main/recipes/pcluster/slurm_accounting_with_email/assets/postinstall.sh -O ./post-install.sh
chmod +x ./post-install.sh
Now you can execute the script with the necessary command line options to configure Slurm-Mail on your headnode. The script will perform some actions:
- Determines the OS version
- Configures the
yum
or Ubuntu repository for Slurm-Mail based on the OS. - Installs the
Slurm-Mail
package - Updates the
Slurm-Mail
configuration file to use your SMTP settings from SES - Updates your Slurm configuration file to use Slurm-Mail for e-mail notifications
- Restarts your Slurm controller daemon (
slurmctld
)
You can then run the script as root to perform the installation and configuration – make sure you replace the $SES_*
variables with the values for your set-up.
sudo ./post-install.sh -c /opt/slurm/etc/slurm.conf -d /opt/slurm/bin \
-e $SES_email_address -u $SES_user -p $SES_password \
-n $SES_port -s $SES_server
How can we take this further?
So far we’ve been able to receive notifications about our Slurm jobs running on AWS ParallelCluster via e-mail. But what if we wanted to receive an SMS text message – or even trigger another AWS service to perform an action? For example we could invoke an AWS Lambda function to post-process files that a Slurm job wrote to an Amazon S3 bucket.
It turns out we can do this, thanks to the ability of SES to send a notification that includes the e-mail headers to an Amazon Simple Notification Service (SNS) topic of our choosing when the e-mails are successfully delivered.
From here we can configure the SNS topic to invoke a Lambda function that uses the SMS sending facilities in SES. Since we include the e-mail headers in the Amazon SNS notification, we can just extract “Subject” and include this in the text message that we want to send.
In the recipe repository we’ve included a recipe that demonstrates this set-up. The recipe makes use of a CloudFormation custom resource to configure SES to send successful e-mail delivery notifications to a SNS topic that we also create in that same template.
Figure 6 illustrates the flow of information all the way from Slurm on the headnode, through Slurm-Mail, to Amazon SES, through to Amazon SNS to send the SMS text message (you may need to read that sentence a couple of times, but we promise this makes sense). All the resources we show in that diagram are created by the recipe.
This code snippet shows how we can send a SMS message using SNS in a Lambda function, using the Python runtime:
import json
import os
import boto3
def handler(event, context):
message = json.loads(event['Records'][0]['Sns']['Message'])
if "mail" in message:
# send to phone number
sns = boto3.client("sns")
print(f'sending text message to {os.environ["SMS_NUMBER"]}')
try:
sns.publish(
PhoneNumber=os.environ["SMS_NUMBER"],
Message=message["mail"]["commonHeaders"]["subject"]
)
except Exception as err:
print(f'Unable to send message: {err}')
return
print('message sent ')
return
Sending a SMS text message is only one possibility, however. By using Amazon SNS to trigger an AWS Lambda function, we can interact with many other AWS services to perform tasks for us by using the Boto3 module for Python for example.
Conclusion
We’ve shown you how you can provision a working AWS ParallelCluster with just a few mouse clicks that – once deployed – is ready and waiting to send to us job notification e-mails. We can submit our jobs without having to manually check-up on them. The content provided in the e-mails is further enhanced, thanks to Slurm-Mail.
Finally, we’ve also shown you how to take this further and deploy a ParallelCluster capable of also sending us SMS text messages when our jobs’ change state. This is just one example of how Slurm job notifications can be used to trigger – and pass information to – other AWS services.