Managing AWS ParallelCluster SSH users with AWS OpsWorks

In a previous article, we highlighted the potential for deploying a local LDAP server to provide a mechanism for managing a multi-user AWS ParallelCluster deployment with low administrator overhead. If we want our cluster users to access or manage other AWS resources, it’s preferable to control their access via AWS Identity and Access Management (IAM). Federation with a centralized directory service and the use of third-party tools help. However, the complexity of these solutions often presents a significant barrier, particularly in cases where HPC users manage their research computing environments for themselves.

In this article, we provide an additional stepping stone for users who need to progress beyond local cluster user management toward a more complete integration with IAM. AWS OpsWorks is primarily a configuration management service integrated with Chef and Puppet. For user management, we don’t need to make use of these configuration management tools directly. Instead, we leverage an AWS OpsWorks Stacks capability that allows IAM users to be assigned an SSH key and provisioned as POSIX user accounts on a registered instance.

Account setup

The summarized manual preparation steps we need to follow to enable OpsWorks-managed users within AWS ParallelCluster are:

Create an Amazon Virtual Private Cloud (Amazon VPC) and subnets to host the cluster.
Add cluster users to IAM.
Create a minimal OpsWorks stack targeting the cluster VPC.
Import IAM users into OpsWorks.

Additionally, you should ensure that account quotas for both Amazon Elastic Compute Cloud (Amazon EC2) and OpsWorks suit your needs, requesting quota increases where necessary.

This guide assumes you have full administrator access to IAM within the AWS account, as well as permissions to work with Amazon EC2 and OpsWorks services. If this is not the case, consult with your account administrator to determine which steps they need to undertake on your behalf.

To begin, we create an Amazon VPC and subnets for our cluster. To see how this can be achieved using the pcluster CLI tool, check out the AWS ParallelCluster documentation. Alternatively, you can create your own networking resources using your preferred method.

Next, we need to add the desired user accounts to IAM. Note that we need to grant users either programmatic access or console access during creation. These options relate to general IAM usage; neither is required in order for the integration between OpsWorks and AWS ParallelCluster to function.

Select programmatic access if you do not intend to grant your users any access to the AWS account other than SSH access to the deploying cluster. User accounts do not need any specific access permissions granted. Although we can arrange IAM user accounts into groups if desired, they will not be reflected in the POSIX groups within the cluster.

Once we create the user, we can optionally disable the automatically created API keys. Navigate to the user via the IAM dashboard and open the security credentials tab. Delete or inactivate the API key.

Once IAM users are in place, we create a new OpsWorks stack in the same VPC we plan to use for the AWS ParallelCluster deployment. Most settings retain their default values when using the default Chef 12 stack. We will not use OpsWorks to provision any instances or apply any Chef configuration management recipes.

Screenshot of settings when creating a new OpsWorks stack.

Once we create the stack, we are ready to import IAM users. On the OpsWorks console, navigate to Users (under OpsWorks Stacks) and then choose Import IAM users to <your region>. Select the IAM users to import, then click Import to OpsWorks. Now we are able to integrate those user accounts with any OpsWorks stacks created within that Region. (Note: If multiple Regions are used, we must import users to each Region.)

For each imported user, we next need to grant access to resources within a stack. In the table of imported users, choose the edit link for the first user. From the user dashboard, enter the SSH public key for the user. Additionally, use the check boxes under Instance access to control their access permissions and sudo privileges on a per-stack basis. The Permission level section grants the user additional IAM permissions for resources in the stack. This is not required, so leave the selection as IAM Policies Only. With this configuration, the user only has the IAM permissions already associated with their user account.

Screenshot of user settings within OpsWorks.

Repeat the user configuration steps for each IAM account we imported into OpsWorks. Once this setup is complete, we are ready to deploy AWS ParallelCluster.

Cluster deployment

To allow instances to register themselves with OpsWorks during the post-install phase, apply the AWS-managed IAM policy AWSOpsWorksInstanceRegistration to the controller and all compute instances. For deregistration, we need permissions provided by the AWSOpsWorksRegisterCLI_EC2 policy. Add these policies to the ParallelCluster configuration file via the additional_iam_policies parameter, as shown in the following example:

[aws]
aws_region_name = eu-west-1

[global]
update_check = true
sanity_check = true
cluster_template = default

[cluster default]
key_name = <YOUR-SSH-KEY>
additional_iam_policies = arn:aws:iam::aws:policy/AWSOpsWorksInstanceRegistration,arn:aws:iam::aws:policy/AWSOpsWorksRegisterCLI_EC2
master_instance_type = t3.large
post_install = s3://<YOUR-S3-BUCKET-NAME>/post-install.sh
s3_read_resource = arn:aws:s3:::<YOUR-S3-BUCKET-NAME>/*
post_install_args = "<YOUR-OPSWORKS-STACK-ID>"
scheduler = slurm
queue_settings = compute,standard,highmem
master_root_volume_size = 340
compute_root_volume_size = 340
base_os = centos7
fsx_settings = scratchfs
vpc_settings = std-vpc

[fsx scratchfs]
shared_dir = /fsx
storage_capacity = 1200
deployment_type = SCRATCH_2

[vpc std-vpc]
vpc_id = <YOUR-VPC-ID>
master_subnet_id = <YOUR-CONTROLLER-SUBNET-ID>
compute_subnet_id = <YOUR_COMPUTE-SUBNET-ID>

[queue compute]
compute_resource_settings = c5n.18xl
placement_group = DYNAMIC
enable_efa = true
disable_hyperthreading = true
compute_type = ondemand

[queue standard]
compute_resource_settings = m5n.24xl
placement_group = DYNAMIC
enable_efa = true
disable_hyperthreading = true
compute_type = ondemand

[queue highmem]
compute_resource_settings = r5n.24xl
placement_group = DYNAMIC
enable_efa = true
disable_hyperthreading = true
compute_type = ondemand

[compute_resource c5n.18xl]
instance_type = c5n.18xlarge
max_count = 10

[compute_resource m5n.24xl]
instance_type = m5n.24xlarge
max_count = 10

[compute_resource r5n.24xl]
instance_type = r5n.24xlarge
max_count = 10

Note the inclusion of a post_install parameter and corresponding post_install_args. The former should be a reference to the following script (uploaded to an Amazon Simple Storage Service [Amazon S3] bucket). The latter should include the ID of the intended OpsWorks stack. Obtain this ID from the OpsWorks dashboard by clicking on the stack name and copying the OpsWorks ID. Note that this is not the same as the stack name.

Registration of instances with OpsWorks is achieved using an AWS CLI command executed within the post-install script. Instances must also be deregistered from OpsWorks when no longer needed; to accomplish this, we insert an additional script onto the controller instance to perform periodic deregistration of terminated instances via a cron job.

#!/bin/bash

exec > >(tee /var/log/post-install.log|logger -t post-inst -s 2>/dev/console) 2>&1

### Utility functions

register_opsworks_client() {
    source /etc/parallelcluster/cfnconfig
    aws opsworks register --use-instance-profile \
                          --infrastructure-class ec2 \
                          --region $cfn_region \
                          --stack-id $opsworks_stack_id \
                          --local
}

configure_opsworks_deregistration() {
    cat <<-'EOF' > /root/deregister_instances.sh
#!/bin/bash

dereg_log="/var/log/opsworks-deregister.log"

exec > >(tee -a $dereg_log|logger -t owdereg) 2>&1

export aws_bin="/opt/parallelcluster/pyenv/versions/3.6.9/envs/cookbook_virtualenv/bin/aws"

source /etc/parallelcluster/cfnconfig

export opsworks_stack_id=$cfn_postinstall_args

stack_members=$($aws_bin opsworks --region $cfn_region describe-instances --stack-id $opsworks_stack_id | jq -r ."Instances[] | [.Ec2InstanceId, .InstanceId] | @csv")

echo "`date`: OpsWorks deregistration cycle started." >> $dereg_log

for instance in $stack_members; do
  ec2_id=$(echo $instance | awk -F',' '{print $1}' | tr -d '"')
  opsworks_id=$(echo $instance | awk -F',' '{print $2}' | tr -d '"')
  echo "Checking instance: $ec2_id"
  ec2_state=$($aws_bin ec2 describe-instances --instance-id $ec2_id --region $cfn_region | jq -r ".Reservations[].Instances[].State.Name")
  echo "Instance $ec2_id is in state \"$ec2_state\""
  if [ "$ec2_state" = "terminated" ]; then
    echo "Instance $ec2_id will be deregistered from OpsWorks"
    aws opsworks deregister-instance --instance-id $opsworks_id --region $cfn_region
  else
    echo "Instance $ec2_id still in use or transitioning state - not deregistering."
  fi
done

echo "`date`: OpsWorks deregistration cycle completed." >> $dereg_log
EOF

chmod +x /root/deregister_instances.sh

echo "*  *  *  *  * root /root/deregister_instances.sh" >> /etc/crontab
}

### Main body

# Load environment variables from ParallelCluster
source /etc/parallelcluster/cfnconfig

# Obtain the OpsWorks stack ID from the pcluster "post_install_args" parameter
export opsworks_stack_id=$cfn_postinstall_args

# If the script is being executed on the controller, set up a cron job to run OpsWorks deregistration
if [ $cfn_node_type  == 'MasterServer' ]; then
    configure_opsworks_deregistration
fi

# For both controller and compute nodes, register the OpsWorks client
register_opsworks_client

The deregistration script (inserted via the configure_opsworks_deregistration function) obtains the EC2 instance ID and corresponding OpsWorks ID for all instances registered with the current stack. It then checks their state—if an instance is in the terminated state, it can be safely deregistered. The cron job created by the preceding example post-install script runs every minute. We can reduce this frequency depending on the typical runtime of the cluster jobs and the value of scaledown_idletime set within the cluster configuration. A log of deregistration activity can be found in /var/log/opsworks-deregister.log.

Once the cluster is deployed using the pcluster CLI tool, both the default system user (centos, in this example) and any users configured via OpsWorks can access the cluster via SSH and submit jobs to the batch scheduler. The SSH user name for imported user accounts is visible on the user configuration page within OpsWorks. Each imported user is a member of the opsworks POSIX group.

Managing users after deployment

When an OpsWorks stack registers instances, an opsworks-agent service is installed. This step allows the list of users and their access/sudo permissions to stay in sync with the stack configuration. To add users, we need to return to the OpsWorks console and import additional users from IAM to the stack (creating the IAM users first where necessary).

Warning: If we remove access to resources for a particular user, their user account is deleted from all instances within the stack and their home directory deleted as well. We can block SSH access for a user without deleting their account and home directory by replacing their SSH key with an invalid input (e.g., a random string) in the OpsWorks user management console. Note that the SSH key must not be empty; an empty key is equivalent to denying access via the stack configuration options, and results in the user account being deleted.

Remember that management of OpsWorks users is independent of stacks. If desired, multiple clusters can be registered with the same stack (for example, to give a single pool of users access to multiple clusters with different configurations). Alternatively, create a separate stack for each cluster to provide granular access control.

Conclusion

In this post, we walked through the process of creating an OpsWorks Stack for the purposes of managing SSH user access to AWS ParallelCluster. By using OpsWorks, account administrators have a mechanism to associate IAM accounts with POSIX users on EC2 instances and to automate both key rotation and access control changes.

AWS Open Source Blog

Managing AWS ParallelCluster SSH users with AWS OpsWorks

Account setup

Cluster deployment

Managing users after deployment

Conclusion

Resources

Follow

Learn

Resources

Developers

Help