Managing AWS ParallelCluster SSH users with OpenLDAP

A common request from AWS ParallelCluster users is to have the ability to deploy multiple POSIX user accounts. The wiki on the project GitHub page documents a simple mechanism for achieving this, and a previous blog post, “AWS ParallelCluster with AWS Directory Services Authentication,” documents how to integrate AWS ParallelCluster with AWS Directory Service. However, some customers might prefer a traditional directory service hosted locally to the cluster to allow for administration that is more convenient. This eliminates a requirement for stopping and restarting the head node, or needing to learn the details of how to manage Active Directory.

A multi-user AWS ParallelCluster environment is ideally suited for cases in which a team of scientists or engineers is closely collaborating and needs to share resources, such as data or account budget. Using a single set of file systems makes cluster management easier for the administrator, particularly if they are new to AWS. When time-to-solution and total spend must be carefully balanced, having jobs from multiple users run within a single scheduler is also helpful.

In this post, we describe how to enable the OpenLDAP directory service on the cluster’s head node. This enablement can create and synchronize a local collection of users.

The process documented here applies to clusters deployed using the CentOS 7 operating system, for AWS ParallelCluster version 2.8.1. Other operating systems can follow a similar process, but will require minor changes to the commands used to install the relevant packages (e.g., using apt-get in lieu of yum).

Preparing a multi-user cluster

After installing and configuring the AWS ParallelCluster command-line tool, set the configuration parameters post_install and s3_read_resource (or s3_read_write_resource) to allow a script to be executed on the cluster’s head and compute nodes at boot time. The following is an example of a minimal configuration file to deploy a Slurm-based cluster using C5n.18xlarge instances, which are well suited for scaling HPC workloads:

[aws]
aws_region_name = eu-west-1
[global]
update_check = true
sanity_check = true
cluster_template = default

[cluster default]
key_name = <YOUR-SSH-KEY>
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge
enable_efa = compute
placement_group = DYNAMIC
placement = compute
disable_hyperthreading = true
post_install = s3://<YOUR-S3-BUCKET-NAME>/post-install.sh
s3_read_resource = arn:aws:s3:::<YOUR-S3-BUCKET-NAME>/*
initial_queue_size = 0
max_queue_size = 20
maintain_initial_size = false
scheduler = slurm
cluster_type = ondemand
master_root_volume_size = 340
compute_root_volume_size = 340
base_os = centos7
vpc_settings = std-vpc

[vpc std-vpc]
vpc_id = <YOUR-VPC-ID>
master_subnet_id = <YOUR-CONTROLLER-SUBNET-ID>
compute_subnet_id = <YOUR_COMPUTE-SUBNET-ID>

The single post-install script contains all installation and deployment steps associated with the OpenLDAP server:

#!/bin/bash
exec > >(tee /var/log/post-install.log|logger -t user-data -s 2>/dev/console) 2>&1

### Utility functions

install_server_packages() {
    yum -y install openldap compat-openldap openldap-clients openldap-servers openldap-servers-sql openldap-devel
}

prepare_ldap_server() {
    # Load environment variables from ParallelCluster
    source /etc/parallelcluster/cfnconfig
    # Start the server
    systemctl start slapd
    systemctl enable slapd
    # Generate a random string to use as the password
    echo $RANDOM | md5sum | awk '{ print $1 }' > /root/.ldappasswd
    chmod 400 /root/.ldappasswd
    # Use the password to generate a ldap password hash
    LDAP_HASH=$(slappasswd -T /root/.ldappasswd)
    # Initial LDAP setup specification
    cat <<-EOF > /root/ldapdb.ldif
dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcSuffix
olcSuffix: dc=${stack_name},dc=internal

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootDN
olcRootDN: cn=ldapadmin,dc=${stack_name},dc=internal

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: ${LDAP_HASH}
EOF
    # Apply LDAP settings
    ldapmodify -Y EXTERNAL -H ldapi:/// -f /root/ldapdb.ldif
    # Apply LDAP database settings
    cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG
    chown ldap:ldap /var/lib/ldap/*
    # Apply minimal set of LDAP schemas
    ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/cosine.ldif
    ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/nis.ldif
    # Specify a minimal directory structure
    cat <<-EOF > /root/struct.ldif
dn: dc=${stack_name},dc=internal
dc: ${stack_name}
objectClass: top
objectClass: domain

dn: cn=ldapadmin ,dc=${stack_name},dc=internal
objectClass: organizationalRole
cn: ldapadmin
description: LDAP Admin

dn: ou=Users,dc=${stack_name},dc=internal
objectClass: organizationalUnit
ou: Users
EOF
    # Apply the directory structure
    ldapadd -x -W -D "cn=ldapadmin,dc=${stack_name},dc=internal" -f /root/struct.ldif -y /root/.ldappasswd
    # Save the controller hostname to a shared location for later use
    echo "ldap_server=$(hostname)" > /home/.ldap
}

install_client_packages() {
    yum -y install openldap-clients nss-pam-ldapd
}

prepare_ldap_client() {
    source /etc/parallelcluster/cfnconfig
    source /home/.ldap
    authconfig --enableldap \
               --enableldapauth \
               --ldapserver=${ldap_server} \
               --ldapbasedn="dc=${stack_name},dc=internal" \
               --enablemkhomedir \
               --update
    systemctl restart nslcd
}

### Main body

# Load environment variables from ParallelCluster
source /etc/parallelcluster/cfnconfig
# If the script is being executed on the controller, set up the LDAP server
if [ $cfn_node_type  == 'MasterServer' ]; then
    install_server_packages
    prepare_ldap_server

fi

# For both controller and compute nodes, enable LDAP authentication
install_client_packages
prepare_ldap_client

When the script is executed on the controller instance, the install_server_packages and prepare_ldap_server functions execute first, followed by the install_client_packages and prepare_ldap_client functions. All of these steps occur during the initial cluster creation process. On subsequent instance launches (i.e., when the cluster scales up to meet the demands of submitted jobs), the post-install process on the compute nodes will execute only the install_client_packages and prepare_ldap_client steps.

That’s it! When the cluster launch completes, it will be ready to accommodate new users.

Post-install process

Before adding users, let’s walk through the post-install process step-by-step; skip ahead to the next section if you just want to get on with using your cluster.

Within the prepare_ldap_server function, we first start the slapd daemon—this is the OpenLDAP service—which will maintain a database of users and allow clients to authenticate.

Next, we generate a random string to use as the LDAP configuration password, and place it in the home directory root to reuse later. We also generate an appropriate hash from the password to insert into LDAP. The initial LDAP configuration sets the olcSuffix, olcRootDN, and olcRootPW parameters to values we choose. In this case, the important variable assignments are cn=ldapadmin,dc=${stack_name},dc=internal and ${LDAP_HASH}.

We are free to modify the LDAP admin account common name from ldapadmin to anything else we choose, and likewise adjust the domain components from ${stack_name} and internal to suit our own preferences. The key point is to ensure that our chosen values are consistent throughout the script. The ldapmodify command applies these settings to the running LDAP service.

Next, we import a default database configuration from the files provided with the LDAP server package, and ensure the ldap service user owns all the relevant files. The final server configuration step is to define and apply a simple user directory structure. This sets out which files are related to each other and allows them to be grouped into organizational units. For this minimal example, we need only the COSINE and NIS schema definitions, which are provided as part of the LDAP installation. These describe the minimum requirements for adding a user to LDAP. Our directory structure (specified in the struct.ldif file created inline) contains a single organizational unit for our AWS ParallelCluster users.

Before progressing to the client configuration, we record the hostname of the LDAP server in a shared location that will be accessible to clients on compute instances. In this case, we can place a file in the /home directory, as this will be shared with the rest of the cluster via NFS.

The prepare_ldap_client function establishes the connection between clients, running on the controller instance as well as any future compute instances, and the LDAP server—in this case, using variables defined in both a standard AWS ParallelCluster location (/etc/parallelcluster/cfnconfig) and the /home/.ldap file we created during server preparation.

Adding cluster users

Once the cluster controller instance boots and the LDAP configuration is complete, we need to add users to the directory service. Log into the cluster as the default user (centos in this case) with the command:

pcluster ssh <YOUR-CLUSTER-NAME>

Alternatively, use pcluster status <YOUR-CLUSTER-NAME> to obtain the public IP address, and connect to this instance via SSH using the command:

ssh -i <path-to-your-private-key> centos@<YOUR-CLUSTER-PUBLIC-IP>

If you used a locally generated public/private key pair stored in the default location (~/.ssh), you can omit the -i <path-to-your-private-key> section of the command.

Once connected, create a bash script with the following contents (we will refer to this script as add-user.sh):

#!/bin/bash

# Only take action if two arguments are provided and the UID is high enough
if [ $# -eq 2 ] && [ "$2" -eq "$2" ] 2>/dev/null && [ "$2" -gt 1000 ]; then
  USERNAME=$1
  USERID=$2
else
  echo "Usage: `basename $0` <user-name> <user-id>"
  echo "<user-name> must be a string"
  echo "<user-id> must be an integer greater than 1000"
  exit 1
fi

# Load env vars which identify instance type
source  /etc/parallelcluster/cfnconfig

# Write a minimal LDAP object configuration for a user
cat <<-EOF > /tmp/${USERNAME}.ldif
dn: uid=${USERNAME},ou=Users,dc=${stack_name},dc=internal
objectClass: top
objectClass: account
objectClass: posixAccount
objectClass: shadowAccount
cn: ${USERNAME}
uid: ${USERNAME}
uidNumber: ${USERID}
gidNumber: 100
homeDirectory: /home/${USERNAME}
loginShell: /bin/bash
EOF

# Add the user to LDAP
ldapadd -x -D "cn=ldapadmin,dc=${stack_name},dc=internal" -f /tmp/${USERNAME}.ldif -y /root/.ldappasswd

# Tidy up and verify the entry was successful
rm /tmp/${USERNAME}.ldif
getent passwd $1

Make the script executable (chmod +x add-user.sh). Adding a user to LDAP does not require root privileges in general, but in this case our only record of the ldapadmin password is in the file /root/.ldappasswd. We will need to use sudo when executing the script. Pass your desired username and UID as arguments, for example:

sudo ./add-user.sh alice 3000

All users created in this manner will belong to the users group (GID 100). If necessary, you can modify this parameter to assign the users to different groups. Note that group creation is not automatic; any custom groups should be created prior to adding the users.

As an alternative to using the previously described scripted process, you can retrieve the LDAP password from the file and store it outside the cluster using your normal password protection mechanisms. Then you will execute ldapadd on a command prompt as a regular (non-root) user. In this case, you will need to ensure that a valid LDAP object configuration is available, along with a file containing the password, and then execute:

source /etc/parallelcluster/cfnconfig
ldapadd -x -D "cn=ldapadmin,dc=${stack_name},dc=internal" -f <YOUR-LDIF-FILE> -y <PASSWORD-FILE>

Once a user is added, their account will be present within the system (confirmed using getent passwd or sudo getent shadow), but their home directory will not yet exist; the home directory is generally created during first login. At this point, your users will still not be able to connect to the cluster via SSH, as there are no entries in their ~/.ssh/authorized_keys file. Connecting using a password also is not possible, both because we have not configured a password for the user, and because password authentication is disabled by default in the AWS ParallelCluster SSH daemon configuration.

Note that we can incorporate SSH keys into the LDAP configuration by using a more complex schema. This would be the more appropriate choice if implementing a separate LDAP server providing credentials for multiple clusters, or other instances that do not have shared home directories. To avoid additional complexity in this simple example, we will instead copy an SSH public key, provided by the prospective user, to the appropriate location.

Assuming we are still logged in as centos and have saved the public key for the new alice user to ~/alice.pub, save the following script as add-key.sh and execute via sudo ./add-key.sh alice alice.pub:

#!/bin/bash

# Only take action if two arguments are provided and the second is a local file
if [ $# -eq 2 ] && [ -f "$2" ] ; then
  USERNAME=$1
  KEYFILE=$2
else
  echo "Usage: `basename $0` <user-name> <key-file>"
  echo "<user-name> should be a user account"
  echo "<key-file> should be an SSH public key file"
  exit 1
fi

# Create the user home and .ssh directory, set up the authorized_keys file
# Note this will overwrite any existing keys if used multiple times
mkhomedir_helper $USERNAME
mkdir -p /home/$USERNAME/.ssh
cat $KEYFILE > /home/$USERNAME/.ssh/authorized_keys
chmod 600 /home/$USERNAME/.ssh/authorized_keys
chown -R $USERNAME:users /home/$USERNAME

The owner of the corresponding private key matching alice.pub can now SSH into the cluster using the username alice. Through repeated use of add-user.sh and add-key.sh, the cluster administrator can add as many additional accounts as needed simply by adjusting the arguments to each script and ensuring each set of username, user ID, and public key file are unique.

Users created in this manner will not be able to use sudo to assume root privileges; you should reserve administrative actions for the default centos user. For regular HPC cluster usage (submitting jobs to the scheduler, connecting to compute nodes via SSH from the controller, working with the contents of local and network filesystems), these accounts have essentially identical behavior to accounts created using the adduser command.

Conclusion

In this post, we have walked through a simple mechanism for provisioning multiple users within AWS ParallelCluster using a locally installed OpenLDAP server on the controller. With minimal complexity, we configured the directory structure to provide SSH access for HPC users.

Feature image via Pixabay.

AWS Open Source Blog

Managing AWS ParallelCluster SSH users with OpenLDAP

Preparing a multi-user cluster

Post-install process

Adding cluster users

Conclusion

Resources

Follow