Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

This post is contributed by Sean Smith, Software Development Engineer II, AWS ParallelCluster & Arya Hezarkhani, Software Development Engineer II, AWS Batch and HPC

On August 2, 2019, AWS Batch announced support for Elastic Fabric Adapter (EFA). This enables you to run highly performant, distributed high performance computing (HPC) and machine learning (ML) workloads by using AWS Batch’s managed resource provisioning and job scheduling.

EFA is a network interface for Amazon EC2 instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system bypasses the hardware interface and enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, HPC applications using the Message Passing Interface (MPI) and ML applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of cores or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and ﬂexibility of the AWS Cloud.

AWS Batch is a cloud-native batch scheduler that manages instance provisioning and job scheduling. AWS Batch automatically provisions instances according to job speciﬁcations, with the appropriate placement group, networking conﬁgurations, and any user-speciﬁed ﬁle system. It also automatically sets up the EFA interconnect to the instances it launches, which you specify through a single launch template parameter.

In this post, we walk through the setup of EFA on AWS Batch and run the NAS Parallel Benchmark (NPB), a benchmark suite that evaluates the performance of parallel supercomputers, using the open source implementation of MPI, OpenMPI.

Prerequisites

This walk-through assumes:

You have an AWS account.
You are familiar with the AWS Command Line Interface (AWS CLI).

Conﬁguring your compute environment

First, configure your compute environment to launch instances with the EFA device.

Creating an EC2 placement group

The ﬁrst step is to create a cluster placement group. This is a logical grouping of instances within a single Availability Zone. The chief beneﬁt of a cluster placement group is non-blocking, non-oversubscribed, fully bi-sectional network connectivity. Use a Region that supports EFA—currently, that is us-east-1, us-east-2, us-west-2, and eu-west-1. Run the following command:

$ aws ec2 create-placement-group --group-name "efa" --strategy "cluster" --region [your-region]

Creating an EC2 launch template

Next, create a launch template that contains a user-data script to install EFA libraries onto the instance. Launch templates enable you to store launch parameters so that you do not have to specify them every time you launch an instance. This will be the launch template used by AWS Batch to scale the necessary compute resources in your AWS Batch Compute Environment.

First, encode the user data into base64-encoding. This example uses the CLI utility base64 to do so.

$ echo "MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii" 
cloud-init-per once yum_wget yum install -y wget
cloud-init-per once wget_efa wget -q --timeout=20 https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-latest.tar.gz -O /tmp/aws-efa-installer-latest.tar.gz

cloud-init-per once tar_efa tar -xf /tmp/aws-efa-installer-latest.tar.gz -C /tmp 
pushd /tmp/aws-efa-installer
cloud-init-per once install_efa ./efa_installer.sh -y 
pop /tmp/aws-efa-installer

cloud-init-per once efa_info /opt/amazon/efa/bin/fi_info -p efa

--==MYBOUNDARY==--" | base64

Save the base64-encoded output, because you need it to create the launch template.

Next, make sure that your default security group is conﬁgured correctly. On the EC2 console, select the default security group associated with your default VPC and edit the inbound rules to allow SSH and All traﬃc to itself. This must be set explicitly to the security group ID for EFA to work, as seen in the following screenshot.

SecurityGroupInboundRules

Then edit the outbound rules and add a rule that allows all inbound traﬃc from the security group itself, as seen in the following screenshot. This is a requirement for EFA to work.

SecurityGroupOutboundRules

Now, create an ecsInstanceRole, the Amazon ECS instance profile that will be applied to Amazon EC2 instances in a Compute Environment. To create a role, follow these steps.

Choose Roles, then Create Role.
Select AWS Service, then EC2.
Choose Permissions.
Attach the managed policy AmazonEC2ContainerServiceforEC2Role.
Name the role ecsInstanceRole.

You will create the launch template using the ID of the security group, the ID of a subnet in your default VPC, and the ecsInstanceRole that you created.

Next, choose an instance type that supports EFA, that’s denoted by the n in the instance name. This example uses c5n.18xlarge instances.

You also need an Amazon Machine Image (AMI) ID. This example uses the latest ECS-optimized AMI based on Amazon Linux 2. Grab the AMI ID that corresponds to the Region that you are using.

This example uses UserData to install EFA. This adds 1.5 minutes of bootstrap time to the instance launch. In production workloads, bake the EFA installation into the AMI to avoid this additional bootstrap delay.

Now create a ﬁle called launch_template.json with the following content, making sure to substitute the account ID, security group, subnet ID, AMI ID, and key name.

{
"LaunchTemplateName": "EFA-Batch-LaunchTemplate", "LaunchTemplateData": {
"InstanceType": "c5n.18xlarge", 
"IamInstanceProfile": {
"Arn": "arn:aws:iam::<Account Id>:instance-profile/ecsInstanceRole"
},
"NetworkInterfaces": [
{
"DeviceIndex": 0,
"Groups": [
"<Security Group>"
],
"SubnetId": "<Subnet Id>",
"InterfaceType": "efa",
"Description": "NetworkInterfaces Configuration For EFA and Batch"
}
],
"Placement": {
"GroupName": "efa"
},
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "from-lt",
"Value": "networkInterfacesConfig-EFA-Batch"
}
]
}
],
"UserData": "TUlNRS1WZXJzaW9uOiAxLjAKQ29udGVudC1UeXBlOiBtdWx0aXBhcnQvbWl4ZWQ7IGJvdW5kYXJ5PSI9PU1ZQk9VTkRBUlk9PSIKCi0tPT1NWUJPVU5EQVJZPT0KQ29udGVudC1UeXBlOiB0ZXh0L2Nsb3VkLWJvb3Rob29rOyBjaGFyc2V0PSJ1cy1hc2NpaSIKCmNsb3VkLWluaXQtcGVyIG9uY2UgeXVtX3dnZXQgeXVtIGluc3RhbGwgLXkgd2dldAoKY2xvdWQtaW5pdC1wZXIgb25jZSB3Z2V0X2VmYSB3Z2V0IC1xIC0tdGltZW91dD0yMCBodHRwczovL3MzLXVzLXdlc3QtMi5hbWF6b25hd3MuY29tL2F3cy1lZmEtaW5zdGFsbGVyL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLU8gL3RtcC9hd3MtZWZhLWluc3RhbGxlci1sYXRlc3QudGFyLmd6CgpjbG91ZC1pbml0LXBlciBvbmNlIHRhcl9lZmEgdGFyIC14ZiAvdG1wL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLUMgL3RtcAoKcHVzaGQgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgpjbG91ZC1pbml0LXBlciBvbmNlIGluc3RhbGxfZWZhIC4vZWZhX2luc3RhbGxlci5zaCAteQpwb3AgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgoKY2xvdWQtaW5pdC1wZXIgb25jZSBlZmFfaW5mbyAvb3B0L2FtYXpvbi9lZmEvYmluL2ZpX2luZm8gLXAgZWZhCgotLT09TVlCT1VOREFSWT09LS0K"
}
}

Create a launch template from that ﬁle:

$ aws ec2 create-launch-template --cli-input-json file://launch_template.json
{
"LaunchTemplate": {
"LatestVersionNumber": 1,
"LaunchTemplateId": "lt-*****************", "LaunchTemplateName": "EFA-Batch-LaunchTemplate", "DefaultVersionNumber": 1,
"CreatedBy": "arn:aws:iam::************:user/desktop-user", "CreateTime": "2019-09-23T13:00:21.000Z"
}
}

Creating a compute environment

Next, create an AWS Batch Compute Environment. This uses the information from the launch template

EFA-Batch-Launch-Template created earlier.

{
"computeEnvironmentName": "EFA-Batch-ComputeEnvironment",
"type": "MANAGED",
"state": "ENABLED",
"computeResources": { 
"type": "EC2", 
"minvCpus": 0,
"maxvCpus": 2088,
"desiredvCpus": 0, 
"instanceTypes": [ 
"c5n.18xlarge"
],
"subnets": [
"<same-subnet-as-in-LaunchTemplate>"
],
"instanceRole": "arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole", 
"launchTemplate": {
"launchTemplateName": "EFA-Batch-LaunchTemplate", 
"version": "$Latest"
}
},
"serviceRole": "arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole"
}

Now, create the compute environment:

$ aws batch create-compute-environment --cli-input-json file://compute_environment.json
{
"computeEnvironmentName": "EFA-Batch-ComputeEnvironment", "computeEnvironmentArn": "arn:aws:batch:us-east-1:<Account Id>:compute-environment”
}

Building the container image

To build the container, clone the repository that contains the Dockerﬁle used in this example.

First, install git:

$ git clone https://github.com/aws-samples/aws-batch-efa.git

In that repository, there are several ﬁles, one of which is the following Dockerfile.

FROM amazonlinux:1 
ENV USER efauser

RUN yum update -y
RUN yum install -y which util-linux make tar.x86_64 iproute2 gcc-gfortran openssh-serv 
RUN pip-2.7 install supervisor

RUN useradd -ms /bin/bash $USER 
ENV HOME /home/$USER

##################################################### 
## SSH SETUP
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \ 
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo "	IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \ 
&& echo "	StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo "	UserKnownHostsFile /dev/null" >> ${SSHDIR}/config \ 
&& echo "	Port 2022" >> ${SSHDIR}/config \
&& echo 'Port 2022' >> ${SSHDIR}/sshd_config \
&& echo 'UsePrivilegeSeparation no' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \ && echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/* \
&& chown -R ${USER}:${USER} ${SSHDIR}/

# check if ssh agent is running or not, if not, run RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

################################################# 
## EFA and MPI SETUP
RUN curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-1. && tar -xf aws-efa-installer-1.5.0.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod --skip-limit-conf --no-verify

RUN wget https://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz \ 
&& tar xzf NPB3.3.1.tar.gz
COPY make.def_efa /NPB3.3.1/NPB3.3-MPI/config/make.def COPY suite.def	/NPB3.3.1/NPB3.3-MPI/config/suite.def

RUN cd /NPB3.3.1/NPB3.3-MPI \
&& make suite \
&& chmod -R 755 /NPB3.3.1/NPB3.3-MPI/

###################################################
## supervisor container startup

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh
RUN chmod 755 supervised-scripts/mpi-run.sh

EXPOSE 2022
ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh 
RUN chmod 755 batch-runtime-scripts/entry-point.sh

CMD /batch-runtime-scripts/entry-point.sh

To build this Dockerﬁle, run the included Makerfile with:

make

Now, push the created container image to Amazon Elastic Container Registry (ECR), so you can use it in your AWS Batch JobDeﬁnition:

From the AWS CLI, create an ECR repository, we’ll call it aws-batch-efa:

$ aws ecr create-repository --repository-name aws-batch-efa

{

"repository": {

"registryId": "<Account-Id>",
"repositoryName": "aws-batch-efa",
"repositoryArn": "arn:aws:ecr:us-east-2:<Account-Id>:repository/aws-batch-efa",
"createdAt": 1568154893.0, 
"repositoryUri": "<Account-Id>.dkr.ecr.us-east-2.amazonaws.com/aws-batch-efa"
}
}

Edit the top of the makeﬁle and add your AWS account number and AWS Region.

AWS_REGION=<REGION>
ACCOUNT_ID=<ACCOUNT-ID>

To push the image to the ECR repository, run:

make tag
make push

Run the application

To run the application using AWS Batch multi-node parallel jobs, follow these steps.

Setting up the AWS Batch multi-node job definition

Set up the AWS Batch multi-node job definition and expose the EFA device to the container by following these steps.

First, create a ﬁle called job_definition.json with the following contents. This ﬁle holds the conﬁgurations for the AWS Batch JobDeﬁnition. Speciﬁcally, this JobDeﬁnition uses the newly supported ﬁeld LinuxParameters.Devices to expose a particular device—in this case, the EFA device path /dev/infiniband/uverbs0—to the container. Be sure to substitute the image URI with the one you pushed to ECR in the previous step. This is used to start the container.

{
"jobDefinitionName": "EFA-MPI-JobDefinition", 
"type": "multinode",
"nodeProperties": { 
"numNodes": 8,
"mainNode": 0, 
"nodeRangeProperties": [
{
"targetNodes": "0:", 
"container": {
"user": "efauser",
"image": "<Docker Image From Previous Section>", 
"vcpus": 72,
"memory": 184320, 
"linuxParameters": {
"devices": [
{
"hostPath": "/dev/infiniband/uverbs0"
}
]
},
"ulimits": [
{
"hardLimit": -1,
"name": "memlock", 
"softLimit": -1
}
]
}
}
]
}
}

Now register the job definition

$ aws batch register-job-definition --cli-input-json file://job_definition.json
{
"jobDefinitionArn": "arn:aws:batch:us-east-1:<account-id>:job-definition/EFA-MPI-JobDefinition”,
"jobDefinitionName": "EFA-MPI-JobDefinition",
"revision": 1
}

Run the job

Next, create a job queue. This job queue points at the compute environment created before. When jobs are submitted to it, they queue until instances are available to run them.

{
"jobQueueName": "EFA-Batch-JobQueue", 
"state": "ENABLED",
"priority": 10, 
"computeEnvironmentOrder": [
{
"order": 1,
"computeEnvironment": "EFA-Batch-ComputeEnvironment"
}
]
}

aws	batch	create-job-queue --cli-input-json	file://job_queue.json

Now that you’ve created all the resources, submit the job. The numNodes=8 parameter tells the job deﬁnition to use eight nodes.

aws	batch	submit-job --job-name	example-mpi-job --job-queue	EFA-Batch-JobQueue --job-definition EFA-MPI-JobDefinition --node-overrides numNodes=8

NPB overview

NPB is a small set of benchmarks derived from computational fluid dynamics (CFD) applications. They consist of five kernels and three pseudo-applications. This example runs the 3D Fast Fourier Transform (FFT) benchmark, as it tests all-to-all communication. For this run, use c5n.18xlarge, as conﬁgured in the AWS compute environment earlier. This is an excellent choice for this workload as it has an Intel Skylake processor (72 hyperthreaded cores) and 100 GB Enhanced Networking (ENA), which you can take advantage of with EFA.

This test runs the FT “C” Benchmark with eight nodes * 72 vcpus = 576 vcpus.

NAS Parallel Benchmarks 3.3 -- FT Benchmark
No input file inputft.data. Using compiled defaults Size : 512x 512x 512
Iterations : 20
Number of processes : 512 Processor array : 1x 512 Layout type : 1D

Initialization time = 1.3533580760000063

T = 1 Checksum = 5.195078707457D+02 5.149019699238D+02
T = 2 Checksum = 5.155422171134D+02 5.127578201997D+02 
T = 3 Checksum = 5.144678022222D+02 5.122251847514D+02 
T = 4 Checksum = 5.140150594328D+02 5.121090289018D+02 
T = 5 Checksum = 5.137550426810D+02 5.121143685824D+02 
T = 6 Checksum = 5.135811056728D+02 5.121496764568D+02 
T = 7 Checksum = 5.134569343165D+02 5.121870921893D+02 
T = 8 Checksum = 5.133651975661D+02 5.122193250322D+02 
T = 9 Checksum = 5.132955192805D+02 5.122454735794D+02 
T = 10 Checksum = 5.132410471738D+02 5.122663649603D+02 
T = 11 Checksum = 5.131971141679D+02 5.122830879827D+02 
T = 12 Checksum = 5.131605205716D+02 5.122965869718D+02 
T = 13 Checksum = 5.131290734194D+02 5.123075927445D+02 
T = 14 Checksum = 5.131012720314D+02 5.123166486553D+02 
T = 15 Checksum = 5.130760908195D+02 5.123241541685D+02 
T = 16 Checksum = 5.130528295923D+02 5.123304037599D+02 
T = 17 Checksum = 5.130310107773D+02 5.123356167976D+02 
T = 18 Checksum = 5.130103090133D+02 5.123399592211D+02 
T = 19 Checksum = 5.129905029333D+02 5.123435588985D+02 
T = 20 Checksum = 5.129714421109D+02 5.123465164008D+02

Result verification successful class = C

FT Benchmark Completed. Class = C
Size = 512x 512x 512 Iterations = 20
Time in seconds = 1.92 Total processes = 512 Compiled procs = 512 Mop/s total = 206949.17 Mop/s/process = 404.20
Operation type = floating point Verification = SUCCESSFUL

Summary

In this post, we covered how to run MPI Batch jobs with an EFA-enabled elastic network interface using AWS Batch multi-node parallel jobs and an EC2 launch template. We used a launch template to conﬁgure the AWS Batch compute environment to launch an instance with the EFA device installed. We showed you how to expose the EFA device to the container. You also learned how to package an MPI benchmarking application, the NPB, as a Docker container, and how to run the application as an AWS Batch multi-node parallel job.

We hope you found the information in this post helpful and encouraging as to all the possibilities for HPC on AWS.

AWS Compute Blog

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

Prerequisites

Conﬁguring your compute environment

Creating an EC2 placement group

Creating an EC2 launch template

Creating a compute environment

Building the container image

Run the application

Setting up the AWS Batch multi-node job definition

NPB overview

Summary

Resources

Follow