AWS HPC Blog
Introducing login nodes in AWS ParallelCluster
If you’re a user of a Slurm-based HPC cluster, it’s likely you interact with your cluster using a login node. It’s the portal through which you access your cluster’s vast computational resources. You’ve probably used one to browse your files, submit jobs (and check on them) and compile your code.
You can do all these things using the headnode, too, but when a cluster is shared among multiple users in an enterprise or lab, someone compiling their code on the headnode can hamper other users trying to submit jobs, or just doing their own work. Some AWS ParallelCluster customers have worked around this limitation by manually creating login nodes for their users, but this involved a lot of undocumented steps and forced their admins to know about ParallelCluster’s internals.
So we’re happy to announce that AWS ParallelCluster 3.7 now supports adding login nodes to your cluster, out of the box. In this post we’ll show you an example of setting this up for a cluster, and highlight some of the more important tunable options for tweaking the experience.
Getting started with AWS ParallelCluster Login Nodes
Login nodes are specified in a similar way to compute nodes: as a ‘pool of nodes’ which in this case have single purpose. You can specify one pool of login nodes with as many instances as you would like to configure for your cluster.
If you want to try this feature out and explore how it works without having to first design a cluster, you can use our one-click launchable stack from the HPC Recipes Library (a super useful resource which we described in a recent post on this channel).
Or, you can follow the steps here to augment an existing configuration file you have on hand, to enable login nodes on a new cluster, or to make an update to an existing one.
Step 1: Configure ParallelCluster with login nodes
First, ensure that you’re using ParallelCluster 3.7.0. (or later) which introduces this new feature. You can then create a new cluster or update an existing one with login nodes.
Enabling login nodes starts by configuring them in the YAML configuration file that describes your ParallelCluster. You can always retrieve the YAML config for a running cluster using the pcluster-describe
command (there’s an example in our documentation).
When you defined your ComputeResources as part of SlurmQueues
, you specified the instance type, security groups, and several other details.
It’s similar for login nodes where you now define these parameters in a new section called LoginNodes
. There are a few more settings unique to login nodes like: Count
, which specifies the number of nodes; and the GraceTimePeriod
, which lets you set a countdown timer for logged in users on a node when ParallelCluster is planning to stop it. The full spread of options is in our LoginNodes
documentation.
The following snippet (which you can add to the end of your config file) shows how to setup a cluster with three login nodes, as part of a pool named CFDCluster
that act as a front door to users to login and submit fluid dynamics jobs on your cluster.
LoginNodes:
Pools:
- Name: CFDCluster
Count: 3 # Specify the number of login nodes you need
InstanceType: t2.micro # Choose an appropriate instance type
Ssh:
KeyName: CFSClusterKey # The key pair setup in your AWS account
Networking:
SubnetIds:
- subnet-XXXXXXXXX # Specify the subnet for your login nodes
SecurityGroups:
- sg-XXXXXXXXX # security groups your EC2 Login nodes will be within
You can customize the settings according to your requirements, including the instance type, subnet, and security groups. There are some additional settings you can specify like IAM roles and policies, and additional security groups. You can also define custom AMIs if you want to be more opinionated about the experience for your users when they login (for instance, by giving them access to compilers or licensed debuggers). You’ll find all these details in the ParallelCluster documentation on LoginNodes properties.
Step 2: Update your cluster
After you’ve finished editing your ParallelCluster config file, you’re ready to update your running cluster, or create a new one.
If you’re setting up a new cluster with login nodes, use this command along with the configuration file that contains your settings:
pcluster create-cluster —cluster-configuration your_config_file_name —cluster-name your_cluster_name
If you’re updating an existing cluster then it’s just:
pcluster update-cluster --cluster-configuration your_config_file_name --cluster-name your_cluster_name
Of course, replace your_config_file_name
with the name of your configuration file and your_cluster_name
with the name of your cluster.
Step 3: Access your login nodes
Once your cluster is ready, users with appropriate permissions can access the login nodes using SSH. But first they’ll need to find the address of the nodes they want to connect to.
Login nodes are provisioned with a single connection address to a Network Load Balancer (which is a feature of Elastic Load Balancer), specifically configured for the pool of login nodes. The exact address depends on the type of subnet you specified in the LoginNodes
pool configuration. All connection requests are managed by the Network Load Balancer using round-robin routing.
To retrieve the address of the single connection provisioned to access the login nodes, you can run the pcluster describe-cluster command.
pcluster describe-cluster --cluster-name your_cluster_name
This command will also provide more information about the status of the login nodes. Here’s an example what it returns for our login nodes:
"loginNodes":
{ "status": "active",
"address": "8af2145440569xyz.us-east-1.amazonaws.com",
"scheme": "internet-facing|internal",
"healthyNodes": 3,
"unhealthyNodes": 0 },
Now it’s just a simple matter of SSH’ing to that address:
ssh username@8af2145440569xyz.us-east-1.amazonaws.com
Users can submit and manage jobs, and they’ll have access to any shared storage the cluster uses, too, so they can manage their files and see the output from their jobs.
You’ll notice that the describe-cluster
output also gave you some information on the status and health of your login nodes. You can get more granular information on the state, IP address, and launch time for individual login nodes in the pool, by using the pcluster describe-cluster-instances
command, specifying the node-type as LoginNodes
:
pcluster describe-cluster-instances --node-type LoginNode --cluster-name your_cluster_name
Check our documentation for more about this.
Controlling the population of login nodes
Once configured, your login nodes will keep running until you remove them from the pool.
Adding and removing login nodes
Adding and removing login nodes from a pool is straight forward. You set the Count parameter of the LoginNodes
configuration in the ParallelCluster YAML file to your desired number. Then, update your running cluster configuration using the pcluster update-cluster command
, as before:
pcluster update-cluster --cluster-configuration your_config_file_name --cluster-name your_cluster_name
To remove all login nodes you can just set the Count
to 0
and update the cluster.
Setting a grace-time message for users when terminating Login Nodes
When you remove login nodes from the pool using the Count
parameter we just described, ParallelCluster will terminate the Amazon EC2 instances powering them. During a node’s termination, logged in users will receive terminal notifications in their SSH windows alerting them about the impending shutdown. The message will specify a grace-time period during which no new connections will be allowed, except for those from the cluster’s default user. The message is customizable by the cluster administrator from the headnode or from a login node by editing the file /opt/parallelcluster/shared_login_nodes/loginmgtd_config.json
. Here’s what that looks like, by default:
{
"termination_script_path": "/opt/parallelcluster/shared_login_nodes/loginmgtd_on_termination.sh",
"termination_message": "The system will be terminated within 10 minutes.",
"gracetime_period": "10"
}
Conclusion
AWS ParallelCluster 3.7 introduced login nodes – a powerful and flexible way to manage access and more carefully define the user experience for your HPC cluster users.
By distributing interactive user sessions across a pool of dedicated login nodes, you can improve your cluster’s performance, expand the tools available to your end users, and streamline everyone’s data access. With careful configuration and management, login nodes can become an integral part of your cloud-based HPC infrastructure, enabling efficient and scalable computing for your organization.
If you want to get started right away, you can use our one-click launchable stack from the HPC Recipes Library to get a complete test environment launched quickly, which you can customize (or delete) later. This comes from the HPC Recipe Library which can help you quickly achieve feature-rich, reliable HPC deployments that are ready to run a diverse range of workloads – regardless of where you’re starting from.