AWS Open Source Blog

AWS ParallelCluster with AWS Directory Services Authentication

AWS ParallelCluster simplifies the creation and deployment of HPC clusters. In this post we combine ParallelCluster with AWS Directory Services to create a multi-user, POSIX-compliant system with centralized authentication and automated home directory creation.

To grant only the minimum permissions to the nodes in the cluster, no AD configuration parameters or permissions are stored directly on the cluster nodes. Instead, the ParallelCluster nodes when booted will automatically trigger an AWS Lambda function, which in turn uses AWS Systems Manager Parameter Store and AWS KMS to securely join the node to the domain. Users will log in to ParallelCluster nodes using their AD credentials.

VPC configuration for ParallelCluster

The VPC used for this configuration can be created using the “VPC Wizard” tool. You can also use an existing VPC that meets the AWS ParallelCluster network requirements.

 

 

In Select a VPC Configuration, choose VPC with Public and Private Subnets and then click Select.

 

Prior to starting the VPC Wizard, allocate an Elastic IP Address. This will be used to configure a NAT gateway for the private subnet. A NAT gateway is required to enable compute nodes in the AWS ParallelCluster private subnet to download the required packages and to access the AWS services public endpoints. See AWS ParallelCluster network requirements.

Please be sure to select two different availability zones for the Public and Private subnets. While this is not strictly required for ParallelCluster itself, we will later use these subnets again for SimpleAD, which requires subnets to be in two distinct availability zones.

 

You can find more details about VPC creation and configuration options in VPC with Public and Private Subnets (NAT).

AWS Directory Services configuration

For simplicity in this example, we will configure Simple AD as the directory service, but this solution will work with any Active Directory system.

Simple AD configuration is performed from the AWS Directory Service console. The required configuration steps are are described in Getting Started with Simple AD.

For this example, set the Simple AD configuration as follows:

Directory DNS name: test.domain
Directory NetBIOS name: TEST
Administrator password: <Your DOMAIN password>

 

 

In the networking section, select the VPC and the two subnets created in the previous steps.

The following screenshot contains the Directory details:

 

Make note of the DNS addresses listed in the directory details as these will be needed later (in this example, 10.0.0.92 and 10.0.1.215).

DHCP options set for AD

In order for nodes to join the AD Domain, a DHCP option set must be configured for the VPC, consistent with the domain name and DNS of the Simple AD service previously configured.

From the AWS VPC dashboard, set the following:

Name: custom DHCP options set
Domain name: test.domain eu-west-1.compute.internal
Domain name servers: 10.0.0.92, 10.0.1.215

The “Domain name” field must contain the Simple AD domain and the AWS regional domain (where the cluster and SimpleAD are being configured), separated by a space.

 

 

You can now assign the new DHCP options set to the VPC:

 

How to manage users and groups in Simple Active Directory

See Manage Users and Groups in Simple AD. If you prefer to use a Linux OS for account management, see How to Manage Identities in Simple AD Directories for details.

Using AWS Key Management Service to secure AD Domain joining credentials

AWS Key Management Service is a secure and resilient service that uses FIPS 140-2 validated hardware security modules to protect your keys. This service will be used to generate a key and encrypt the domain joining password, as explained in the next section.

In the AWS Console, navigate to the AWS Key Management Service (KMS) and click on Create key.

In Display name for the key, write “SimpleADJoinPassword” and click Next, leaving the default settings for all other sections.

In Customer managed keys, take note of the created Key ID.

 

AWS Systems Manager Parameter Store

AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. We will use it to securely store the domain joining information, i.e. the domain name and the joining password.

From the AWS console, access the AWS Systems Manager and select Parameter Store. You need to create two specific parameters: the DomainName which contains the name of the domain and the DomainPassword that contains the domain administrator password.

To create the first parameter, Click on Create parameter and add the following information in the Parameter details section:

Name: DomainName
Type: String
Value: test.domain

Click on Create parameter to create the parameter.

You can now create the DomainPassword parameter with the following details:

Name: DomainPassword
Type: SecureString
KMS KEY ID: alias/SimpleADJoinPassword
Value: <your_ad_password>

Click on Create parameter to create it.

The result should be similar to the screenshot below:

 

AWS ParallelCluster configuration

AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS cloud; to get started, see Installing AWS ParallelCluster.

After the AWS ParallelCluster command line has been configured, create the cluster template file provided below in .parallelcluster/config . The master_subnet_id contains the ID of the created public subnet; the compute_subnet_id contains the private one.

The ec2_iam_role is the role that will be used for all the instances of the cluster. The steps for creating this role will be explained in the next section.

[aws]
aws_region_name = eu-west-1

[cluster slurm]
scheduler = slurm
compute_instance_type = c5.large
initial_queue_size = 2
max_queue_size = 10
maintain_initial_size = false
base_os = alinux
key_name = AWS_Ireland
vpc_settings = public
ec2_iam_role = parallelcluster-custom-role
pre_install = s3://pcluster-scripts/pre_install.sh
post_install = s3://pcluster-scripts/post_install.sh

[vpc public]
master_subnet_id = subnet-01fc20e143543f8af
compute_subnet_id = subnet-0b1ae2790497d83ec
vpc_id = vpc-0cdee679c5a6163bd

[global]
update_check = true
sanity_check = true
cluster_template = slurm

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

The s3://pcluster-scripts bucket contains the pre- and post-installation scripts required for the configuration of the master and compute nodes inside the domain. A unique bucket name will be required – create an S3 bucket and replace s3://pcluster-scripts with your chosen name.

The pre_install script installs the required packages and joins the node inside domain:

#!/bin/bash

# Install the required packages
yum -y install sssd realmd krb5-workstation samba-common-tools
instance_id=$(curl http://169.254.169.254/latest/meta-data/instance-id)
region=$(curl  -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/[a-z]$//')
# Lambda function to join the linux system in the domain
aws --region ${region} lambda invoke --function-name join-domain-function /tmp/out --payload '{"instance": "'${instance_id}'"}' --log-type None
output=""
while [ -z "$output" ]
do
  sleep 5
  output=$(realm list)
done
#This line allows the users to login without the domain name
sed -i 's/use_fully_qualified_names = True/use_fully_qualified_names = False/g' /etc/sssd/sssd.conf
#This line configure sssd to create the home directories in the shared folder
mkdir /shared/home/
sed -i '/fallback_homedir/c\fallback_homedir = /home/%u' /etc/sssd/sssd.conf
sleep 1
service sssd restart
# This line is required for AWS Parallel Cluster to understand correctly the custom domain
sed -i "s/--fail \${local_hostname_url}/--fail \${local_hostname_url} | awk '{print \$1}'/g" /opt/parallelcluster/scripts/compute_ready

The post_install script configures the ssh service to accept connections with a password:

#!/bin/bash

sed -i 's/PasswordAuthentication no//g' /etc/ssh/sshd_config
echo "PasswordAuthentication yes" >> /etc/ssh/sshd_config
sleep 1
service sshd restart

Copy the pre_install and post_install scripts into the S3 bucket created previously.

AD Domain join with AWS Lambda

AWS Lambda allows you to run code without provisioning or managing servers. Lambda is used in this solution to securely join the Linux node to the Simple AD domain.

You can Create a Lambda Function with the Console.

For Function name, enter join-domain-function.

As Runtime, enter Python 2.7.

Choose “Create function” to create it.

 

The following code should be entered within the Function code section, which you can find by scrolling down in the page. Please modify <REGION> with the correct value.

import json
import boto3
import time

def lambda_handler(event, context):
    json_message = json.dumps(event)
    message = json.loads(json_message)
    instance_id = message['instance']
    ssm_client = boto3.client('ssm', region_name="<REGION>") # use region code in which you are working
    DomainName = ssm_client.get_parameter(Name='DomainName')
    DomainName_value = DomainName['Parameter']['Value']
    DomainPassword = ssm_client.get_parameter(Name='DomainPassword',WithDecryption=True)
    DomainPassword_value = DomainPassword['Parameter']['Value']
    response = ssm_client.send_command(
             InstanceIds=[
                "%s"%instance_id
                     ],
             DocumentName="AWS-RunShellScript",
             Parameters={
                'commands':[
                     'echo "%s" | realm join -U Administrator@%s %s --verbose;rm -rf /var/lib/amazon/ssm/i-*/document/orchestration/*'%(DomainPassword_value,DomainName_value,DomainName_value)                       ]
                  },
               )
    return {
        'statusCode': 200,
        'body': json.dumps('Command Executed!')
    }

In the Basic settings section, set 10 sec as Timeout.

Click on Save in the top right to save the function.

In the Execution role section, click on the highlighted section to edit the role.

 

 

In the newly-opened tab, Click on Attach Policies and then Create Policy.

 

The last action opens another new tab in your browser.

Click on Create policy and then JSON.

 

 

The following policy can be entered inside the JSON editor. Please modify the <REGION>, <AWS ACCOUNT ID> and <KEY ID> with the correct values.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter"
            ],
            "Resource": [
                "arn:aws:ssm:<REGION>:<AWS ACCOUNT ID>:parameter/DomainName"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter"
            ],
            "Resource": [
                "arn:aws:ssm:<REGION>:<AWS ACCOUNT ID>:parameter/DomainPassword"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:SendCommand"
            ],
            "Resource": [
                "arn:aws:ec2:<REGION>:<AWS ACCOUNT ID>:instance/*",
                "arn:aws:ssm:<REGION>::document/AWS-RunShellScript"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:kms:<REGION>:<AWS ACCOUNT ID>:key/<KEY ID>"
            ]
        }
    ]
}

In the next section, enter “GetJoinCredentials” as the Name and click Create policy.

Close the current tab and move to the previous one to select the policy for the Lambda role.

Refresh the list, select the GetJoinCredentials policy, and click Attach policy.

 

IAM custom Roles for Lambda and SSM endpoints

To allow ParallelCluster nodes to call Lambda and SSM endpoints, you need to configure a custom IAM Role.

See AWS Identity and Access Management Roles in AWS ParallelCluster for details on the default AWS ParallelCluster policy.

From the AWS console:

  • access the AWS Identity and Access Management (IAM) service and click on Policies.
  • choose Create policy and, in the JSON section, paste the following policy. Be sure to modify <REGION> , <AWS ACCOUNT ID> to match the values for your account, and also update the S3 bucket name from pcluster-scripts to the name you chose earlier.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "ec2:DescribeVolumes",
                "ec2:AttachVolume",
                "ec2:DescribeInstanceAttribute",
                "ec2:DescribeInstanceStatus",
                "ec2:DescribeInstances",
                "ec2:DescribeRegions"
            ],
            "Sid": "EC2",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "dynamodb:ListTables"
            ],
            "Sid": "DynamoDBList",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:sqs:<REGION>:<AWS ACCOUNT ID>:parallelcluster-*"
            ],
            "Action": [
                "sqs:SendMessage",
                "sqs:ReceiveMessage",
                "sqs:ChangeMessageVisibility",
                "sqs:DeleteMessage",
                "sqs:GetQueueUrl"
            ],
            "Sid": "SQSQueue",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:DescribeTags",
                "autoScaling:UpdateAutoScalingGroup",
                "autoscaling:SetInstanceHealth"
            ],
            "Sid": "Autoscaling",
            "Effect": "Allow"
        },
        {
            "Action": [
                "cloudformation:DescribeStacks",
                "cloudformation:DescribeStackResource"
            ],
            "Resource": [
                "arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/parallelcluster-*/*"
            ],
            "Effect": "Allow",
            "Sid": "CloudFormation"
        },
        {
            "Resource": [
                "arn:aws:dynamodb:<REGION>:<AWS ACCOUNT ID>:table/parallelcluster-*"
            ],
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:Query",
                "dynamodb:GetItem",
                "dynamodb:DeleteItem",
                "dynamodb:DescribeTable"
            ],
            "Sid": "DynamoDBTable",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:s3:::<REGION>-aws-parallelcluster/*"
            ],
            "Action": [
                "s3:GetObject"
            ],
            "Sid": "S3GetObj",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/parallelcluster-*"
            ],
            "Action": [
                "cloudformation:DescribeStacks"
            ],
            "Sid": "CloudFormationDescribe",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "sqs:ListQueues"
            ],
            "Sid": "SQSList",
            "Effect": "Allow"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:DescribeAssociation",
                "ssm:GetDeployablePatchSnapshotForInstance",
                "ssm:GetDocument",
                "ssm:DescribeDocument",
                "ssm:GetManifest",
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:ListAssociations",
                "ssm:ListInstanceAssociations",
                "ssm:PutInventory",
                "ssm:PutComplianceItems",
                "ssm:PutConfigurePackageResult",
                "ssm:UpdateAssociationStatus",
                "ssm:UpdateInstanceAssociationStatus",
                "ssm:UpdateInstanceInformation"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssmmessages:CreateControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:OpenDataChannel"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2messages:AcknowledgeMessage",
                "ec2messages:DeleteMessage",
                "ec2messages:FailMessage",
                "ec2messages:GetEndpoint",
                "ec2messages:GetMessages",
                "ec2messages:SendReply"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "lambda:InvokeFunction",
            "Resource": "arn:aws:lambda:<REGION>:<AWS ACCOUNT ID>:function:join-domain-function"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::pcluster-scripts/*"
            ]
        }
    ]
}

Click Review policy, and in the next section enter “parallelcluster-custom-policy” as the Name string. Click Create policy.

Now you can finally create the Role. Choose Role in the left menu and then Create role.

Select AWS service as the type of trusted entity, and EC2 as the service that will use this role.

Choose Next to proceed in the creation process.

 

 

In the policy selection, select the parallelcluster-custom-policy that was just created.

Click through the Next: Tags and then Next: Review pages.

In the Role name box, enter “parallelcluster-custom-role” and confirm with the Create role button.

Deploy ParallelCluster

The cluster can now be created using the following command line:

pcluster create -t slurm slurmcluster

-t slurm indicates which section of the cluster template to use. slurmcluster is the name of the cluster that will be created. For more details, see the AWS ParallelCluster Documentation. A detailed explanation of the pcluster command line parameters can be found in AWS ParallelCluster CLI Commands.

You can now connect to the Master node of the cluster with any Simple AD user and run the desired workload.

Teardown

When you have finished your computation, the cluster can be destroyed using the following command:

pcluster delete slurmcluster

The additonal created resources can be destroyed following the instructions in the AWS documentation:

Conclusion

This blog post has shown you how to deploy and integrate Simple AD with AWS ParallelCluster, allowing cluster nodes to be securely and automatically joined to a domain to provide centralized user authentication. This solution encrypts and stores the domain joining credentials using AWS Systems Manager Parameter Store with AWS KMS, and uses AWS Lambda at node boot to join the AD Domain.

Dario La Porta

Dario La Porta

Dario La Porta is a Senior HPC Professional Services Consultant at Amazon Web Services. He started out working for Nice Software in 2011 and joined AWS in 2016 through the acquisition of Nice Software. He helps customers in the HPC migration journey to the AWS cloud.