AWS HPC Blog

Introducing login nodes in AWS ParallelCluster

Introducing login nodes in AWS ParallelClusterIf you’re a user of a Slurm-based HPC cluster, it’s likely you interact with your cluster using a login node. It’s the portal through which you access your cluster’s vast computational resources. You’ve probably used one to browse your files, submit jobs (and check on them) and compile your code.

You can do all these things using the headnode, too, but when a cluster is shared among multiple users in an enterprise or lab, someone compiling their code on the headnode can hamper other users trying to submit jobs, or just doing their own work. Some AWS ParallelCluster customers have worked around this limitation by manually creating login nodes for their users, but this involved a lot of undocumented steps and forced their admins to know about ParallelCluster’s internals.

So we’re happy to announce that AWS ParallelCluster 3.7 now supports adding login nodes to your cluster, out of the box. In this post we’ll show you an example of setting this up for a cluster, and highlight some of the more important tunable options for tweaking the experience.

Getting started with AWS ParallelCluster Login Nodes

Login nodes are specified in a similar way to compute nodes: as a ‘pool of nodes’ which in this case have single purpose. You can specify one pool of login nodes with as many instances as you would like to configure for your cluster.

If you want to try this feature out and explore how it works without having to first design a cluster, you can use our one-click launchable stack from the HPC Recipes Library (a super useful resource which we described in a recent post on this channel).

Or, you can follow the steps here to augment an existing configuration file you have on hand, to enable login nodes on a new cluster, or to make an update to an existing one.

Step 1: Configure ParallelCluster with login nodes

First, ensure that you’re using ParallelCluster 3.7.0.  (or later) which introduces this new feature. You can then create a new cluster or update an existing one with login nodes.

Enabling login nodes starts by configuring them in the YAML configuration file that describes your ParallelCluster. You can always retrieve the YAML config for a running cluster using the pcluster-describe command (there’s an example in our documentation).

When you defined your ComputeResources as part of SlurmQueues, you specified the instance type, security groups, and several other details.

It’s similar for login nodes where you now define these parameters in a new section called LoginNodes. There are a few more settings unique to login nodes like: Count, which specifies the number of nodes; and the GraceTimePeriod, which lets you set a countdown timer for logged in users on a node when ParallelCluster is planning to stop it. The full spread of options is in our LoginNodes documentation.

The following snippet (which you can add to the end of your config file) shows how to setup a cluster with three login nodes, as part of a pool named CFDCluster that act as a front door to users to login and submit fluid dynamics jobs on your cluster.

LoginNodes:
   Pools:
    - Name: CFDCluster
      Count: 3  # Specify the number of login nodes you need
      InstanceType: t2.micro  # Choose an appropriate instance type
      Ssh:
         KeyName: CFSClusterKey # The key pair setup in your AWS account
      Networking: 
        SubnetIds: 
          - subnet-XXXXXXXXX  # Specify the subnet for your login nodes
        SecurityGroups:
          - sg-XXXXXXXXX  # security groups your EC2 Login nodes will be within

You can customize the settings according to your requirements, including the instance type, subnet, and security groups. There are some additional settings you can specify like IAM roles and policies, and additional security groups. You can also define custom AMIs if you want to be more opinionated about the experience for your users when they login (for instance, by giving them access to compilers or licensed debuggers). You’ll find all these details in the ParallelCluster documentation on LoginNodes properties.

Step 2: Update your cluster

After you’ve finished editing your ParallelCluster config file, you’re ready to update your running cluster, or create a new one.

If you’re setting up a new cluster with login nodes, use this command along with the configuration file that contains your settings:

pcluster create-cluster —cluster-configuration your_config_file_name —cluster-name your_cluster_name

If you’re updating an existing cluster then it’s just:

pcluster update-cluster --cluster-configuration your_config_file_name --cluster-name your_cluster_name

Of course, replace your_config_file_name with the name of your configuration file and your_cluster_name with the name of your cluster.

Step 3: Access your login nodes

Once your cluster is ready, users with appropriate permissions can access the login nodes using SSH. But first they’ll need to find the address of the nodes they want to connect to.

Login nodes are provisioned with a single connection address to a Network Load Balancer (which is a feature of Elastic Load Balancer), specifically configured for the pool of login nodes. The exact address depends on the type of subnet you specified in the LoginNodes pool configuration. All connection requests are managed by the Network Load Balancer using round-robin routing.

To retrieve the address of the single connection provisioned to access the login nodes, you can run the pcluster describe-cluster command.

pcluster describe-cluster --cluster-name your_cluster_name

This command will also provide more information about the status of the login nodes. Here’s an example what it returns for our login nodes:

"loginNodes": 
{ "status": "active", 
  "address": "8af2145440569xyz.us-east-1.amazonaws.com", 
  "scheme": "internet-facing|internal", 
  "healthyNodes": 3, 
  "unhealthyNodes": 0 },

Now it’s just a simple matter of SSH’ing to that address:

ssh username@8af2145440569xyz.us-east-1.amazonaws.com

Users can submit and manage jobs, and they’ll have access to any shared storage the cluster uses, too, so they can manage their files and see the output from their jobs.

You’ll notice that the describe-cluster output also gave you some information on the status and health of your login nodes. You can get more granular information on the state, IP address, and launch time for individual login nodes in the pool, by using the pcluster describe-cluster-instances command, specifying the node-type as LoginNodes:

pcluster describe-cluster-instances --node-type LoginNode --cluster-name your_cluster_name

Check our documentation for more about this.

Controlling the population of login nodes

Once configured, your login nodes will keep running until you remove them from the pool.

Adding and removing login nodes

Adding and removing login nodes from a pool is straight forward. You set the Count parameter of the LoginNodes configuration in the ParallelCluster YAML file to your desired number. Then, update your running cluster configuration using the pcluster update-cluster command, as before:

pcluster update-cluster --cluster-configuration your_config_file_name --cluster-name your_cluster_name

To remove all login nodes you can just set the Count to 0 and update the cluster.

Setting a grace-time message for users when terminating Login Nodes

When you remove login nodes from the pool using the Count parameter we just described, ParallelCluster will terminate the Amazon EC2 instances powering them. During a node’s termination, logged in users will receive terminal notifications in their SSH windows alerting them about the impending shutdown. The message will specify a grace-time period during which no new connections will be allowed, except for those from the cluster’s default user. The message is customizable by the cluster administrator from the headnode or from a login node by editing the file /opt/parallelcluster/shared_login_nodes/loginmgtd_config.json . Here’s what that looks like, by default:

{
  "termination_script_path": "/opt/parallelcluster/shared_login_nodes/loginmgtd_on_termination.sh",
  "termination_message": "The system will be terminated within 10 minutes.",
  "gracetime_period": "10"
}

Conclusion

AWS ParallelCluster 3.7 introduced login nodes – a powerful and flexible way to manage access and more carefully define the user experience for your HPC cluster users.

By distributing interactive user sessions across a pool of dedicated login nodes, you can improve your cluster’s performance, expand the tools available to your end users, and streamline everyone’s data access. With careful configuration and management, login nodes can become an integral part of your cloud-based HPC infrastructure, enabling efficient and scalable computing for your organization.

If you want to get started right away, you can use our one-click launchable stack from the HPC Recipes Library to get a complete test environment launched quickly, which you can customize (or delete) later. This comes from the HPC Recipe Library which can help you quickly achieve feature-rich, reliable HPC deployments that are ready to run a diverse range of workloads – regardless of where you’re starting from.

Austin Cherian

Austin Cherian

Austin is a Senior Product Manager-Technical for High Performance Computing at AWS. Previously, he was a Snr Developer Advocate for HPC & Batch, based in Singapore. He's responsible for ensuring AWS ParallelCluster grows to ensure a smooth journey for customers deploying their HPC workloads on AWS. Prior to AWS, Austin was the Head of Intel’s HPC & AI business for India where he led the team that helped customers with a path to High Performance Computing on Intel architectures.

Brendan Bouffler

Brendan Bouffler

Brendan Bouffler is the head of the Developer Relations in HPC Engineering at AWS. He’s been responsible for designing and building hundreds of HPC systems in all kind of environments, and joined AWS when it became clear to him that cloud would become the exceptional tool the global research & engineering community needed to bring on the discoveries that would change the world for us all. He holds a degree in Physics and an interest in testing several of its laws as they apply to bicycles. This has frequently resulted in hospitalization.