Elastic Fabric Adapter (EFA)
Q: Why should I use EFA?
EFA brings the scalability, flexibility, and elasticity of cloud to tightly-coupled HPC applications. With EFA, tightly-coupled HPC applications have access to lower and more consistent latency and higher throughput than traditional TCP channels, enabling them to scale better. EFA support can be enabled dynamically, on-demand on any supported EC2 instance without pre-reservation, giving you the flexibility to respond to changing business/workload priorities.
Q: What types of applications can benefit from using EFA?
High Performance Computing (HPC) applications distribute computational workloads across a cluster of instances for parallel processing. Examples of HPC applications include computational fluid dynamics (CFD), crash simulations, and weather simulations. HPC applications are generally written using the Message Passing Interface (MPI) and impose stringent requirements for inter-instance communication in terms of both latency and bandwidth. Applications using MPI and other HPC middleware which supports the libfabric communication stack can benefit from EFA.
Q: How does EFA communication work?
EFA devices provide all ENA devices functionalities plus a new OS bypass hardware interface that allows user-space applications to communicate directly with the hardware-provided reliable transport functionality. Most applications will use existing middleware, such as the Message Passing Interface (MPI), to interface with EFA. AWS has worked with a number of middleware providers to ensure support for the OS bypass functionality of EFA. Please note that communication using the OS bypass functionality is limited to instances within a single subnet of a Virtual Private Cloud (VPC).
Q: Which instance types support EFA?
EFA is currently available on c5n.18xlarge, c5n.metal, i3en.24xlarge, i3en.metal, inf1.24xlarge, m5dn.24xlarge, m5n.24xlarge, r5dn.24xlarge, r5n.24xlarge, p3dn.24xlarge, p4d, m6i.32xlarge, m6i.metal, c6i.32xlarge, c6i.metal, r6i.32xlarge, r6i.metal, x2iezn.12xlarge, x2iezn.metal, and hpc6a.48xlarge instances.
Q: What are the differences between an EFA ENI and an ENA ENI?
An ENA ENI provides traditional IP networking features necessary to support VPC networking. An EFA ENI provides all the functionality of an ENA ENI, plus hardware support for applications to communicate directly with the EFA ENI without involving the instance kernel (OS-bypass communication) using an extended programming interface. Due to the advanced capabilities of the EFA ENI, EFA ENIs can only be attached at launch or to stopped instances.
Q. EFA and ENA Express both use SRD, how do they differ at the transport layer?
EFA and ENA Express both use the Scalable Reliable Datagram (SRD) protocol, built by AWS. EFA is purpose built for tightly coupled workloads to have direct hardware-provide transport communication to the application layer. ENA Express is designed to use the SRD protocol transparently for traditional networking applications that use TCP and UDP protocols.
Q. Elastic Fabric Adapter (EFA) and Elastic Network Adapter (ENA) Express both use Scalable Reliable Diagram (SRD). How do they differ at the transport layer?
EFA and ENA Express both use the SRD protocol, built by AWS. EFA is purpose -built for tightly coupled workloads to have direct hardware-provide transport communication to the application layer. ENA Express is designed to use the SRD protocol for traditional networking applications that use TCP and UDP protocols.
Q: What are the pre-requisites to enabling EFA on an instance?
EFA support can be enabled either at the launch of the instance or added to a stopped instance. EFA devices cannot be attached to a running instance.
Q: Why should I use NICE DCV?
NICE DCV is a graphics-optimized streaming protocol that is well suited for a wide range of usage scenarios ranging from streaming productivity applications on mobile devices to HPC simulation visualization. On the server side, NICE DCV supports Windows and Linux. On the client side, it supports Windows, Linux, and MacOS as well as provides a Web Client for HTML5 browser based access across devices.
Q: Do I need to download a native client to use NICE DCV?
No. NICE DCV works with any HTML5 web browser. However, native clients support additional features such as multi-monitor support, with the Windows native client also supporting USB support for 3D mice, storage devices and smart cards. For workflows needing these features, you can download NICE DCV native clients for Windows, Linux, and MacOS here.
Q: What types of applications can benefit from using NICE DCV?
While NICE DCV's performance is application agnostic, customers observe a perceptible streaming performance benefit when using NICE DCV with 3D graphics-intensive applications that require low latency. HPC applications like seismic and reservoir simulations, computational fluid dynamics (CFD) analyses, 3D molecular modeling, VFX compositing, and Game Engine based 3D rendering are some examples of applications wherein NICE DCV's performance benefit is apparent.
Q: Which instance types support NICE DCV?
NICE DCV is supported on all Amazon EC2 x86-64 architecture based instance types. When used with NVIDIA GRID compatible GPU instances (such as G2, G3, and G4), NICE DCV will leverage hardware encoding to improve performance and reduce system load.
Enabling NICE DCV
Q: Do I need to install a NICE DCV license server when using NICE DCV on Amazon EC2?
No, you do not need a license server to install and use the NICE DCV server on an EC2 instance. However, you need to configure your instance to guarantee access to an Amazon S3 bucket. The NICE DCV server automatically detects that it is running on an Amazon EC2 instance and periodically connects to the Amazon S3 bucket to determine whether a valid license is available. For further instructions on NICE DCV license setup on Amazon EC2, refer to the document here.
Q: Can I enable NICE DCV on a running instance?
Yes. NICE DCV is a downloadable software that can be downloaded and installed on running sessions. Link to the NICE DCV download page is here.
Q: What Windows and Linux distributions does NICE DCV server support?
NICE DCV server's OS support is documented here.
Using NICE DCV
Q: How can I monitor NICE DCV’s real-time performance?
NICE DCV clients have tool bar ribbon displayed on the top of the remote session when not in full screen mode. Click on Settings >> Streaming Mode. This pops up a window allowing users to choose between “Best responsiveness (default) and “Best quality”. Click on “Display Streaming Metrics” at the bottom of the pop-up window to monitor real-time performance framerate, network latency and bandwidth usage.
Q: How do I manage the NICE DCV server?
The NICE DCV server runs as an operating system service. You must be logged in as the administrator (Windows) or root (Linux) to start, stop, or configure the NICE DCV server. For more information refer to the document here.
Q: What port does NICE DCV communicate on?
By default, the NICE DCV server is configured to communicate over port 8443. You can specify a custom TCP port after you have installed the NICE DCV server. The port must be greater than 1024.
Q: How do I enable GPU sharing on Linux USING NICE DCV?
GPU sharing enables you to share one or more physical GPUs between multiple NICE DCV virtual sessions. Using GPU sharing enables you to use a single NICE DCV server and host multiple virtual sessions that share the server's physical GPU resources. For more details on how to enable GPU sharing refer to the document here.
Q: Is NICE DCV's GPU sharing feature available in Windows?
No, NICE DCV GPU sharing is only available on Linux NICE DCV servers.
Q: What are virtual sessions and how do I manage them?
Virtual sessions are supported on Linux NICE DCV servers only. A NICE DCV server can host multiple virtual sessions simultaneously. Virtual sessions are created and managed by NICE DCV users. NICE DCV users can only manage sessions that they have created. The root user can manage all virtual sessions that are currently running on the NICE DCV server. For instructions on managing virtual sessions, refer to the document here.
Q: Why should I use AWS ParallelCluster?
You should use AWS ParallelCluster if you want to run High Performance Computing (HPC) workloads on AWS. You can use AWS ParallelCluster to rapidly build test environments for HPC applications as well as use it as the starting point for building HPC infrastructure in the Cloud. AWS ParallelCluster minimizes operational overhead of cluster management and simplifies the migration path to the cloud.
Q: What types of applications can benefit from using AWS ParallelCluster?
High performance computing applications which require a familiar cluster-like environment in the Cloud, such as MPI applications, and machine learning applications using NCCL are most likely to benefit from AWS ParallelCluster.
Q: How does AWS ParallelCluster relate to/work with other AWS services?
AWS ParallelCluster is integrated with AWS Batch, a fully managed AWS batch scheduler. AWS Batch can be thought of as a "cloud native" replacement for on-premises batch schedulers, with the added benefit of resource provisioning.
AWS ParallelCluster also integrates with Elastic Fabric Adapter (EFA) for applications that require low-latency networking between nodes of HPC clusters. AWS ParallelCluster is also integrated with Amazon FSx for Lustre, a high-performance file system with scalable storage for compute workloads, and Amazon Elastic File System.
Q: What does AWS ParallelCluster create when it builds a cluster?
AWS ParallelCluster provisions a head node for build and control, a cluster of compute instances, a shared filesystem, and a batch scheduler. You can also extend and customize your use cases using custom pre-install and post-install bootstrap actions.
Q: What batch schedulers work with AWS ParallelCluster?
AWS ParallelCluster supports AWS Batch—AWS’ fully managed, cloud-native batch scheduler—and is also compatible with SLURM.
Q: What Linux distributions are supported with AWS ParallelCluster?
AWS ParallelCluster is currently compatible with Amazon Linux 2, Ubuntu 18.04, CentOS 7, and CentOS 8. AWS ParallelCluster provides a list of default AMIs (one per compatible Linux distribution per region) for you to use. Note that Linux distribution availability is more limited in the GovCloud and China partitions. You can learn more about distribution compatibility by reviewing the AWS ParallelCluster User Guide at https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#base-os.
Additionally, while your cluster runs on Amazon Linux, you can run the AWS ParallelCluster command line tool to create and manage your clusters from any computer capable of running Python and downloading the AWS ParallelCluster package.
Q: Can I use my own AMI with AWS ParallelCluster?
There are three ways in which you can customize AWS ParallelCluster AMIs. You can take and modify an existing AWS ParallelCluster AMI, you can take your existing customized AMI and apply the changes needed by AWS ParallelCluster on top of it, or you can use your own custom AMI at runtime. For more information, please visit https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/02_ami_customization.html.
Q: Does AWS ParallelCluster support Windows?
AWS ParallelCluster does not support building Windows clusters. However, you can run the AWS ParallelCluster command line tool on your Windows machine. For more information, please visit https://docs.aws.amazon.com/parallelcluster/latest/ug/install-windows.html.
Q: Does AWS ParallelCluster support Reserved Instances and Spot Instances?
Yes. AWS ParallelCluster supports On-Demand, Reserved, and Spot Instances. Please note that work done on Spot instances can be interrupted. We recommend that you only use Spot instances for fault-tolerant and flexible applications.
Q: Can I have multiple instance types in my cluster’s compute nodes?
Yes. You can have multiple queues and multiple instances per queue.
Q: How large of a cluster can I build with AWS ParallelCluster?
There is no built-in limit to the size of the cluster you can build with AWS ParallelCluster. There are, however, some constraints you should consider such as the instance limits that exist for your account. For some instance types, the default limits may be smaller than expected HPC cluster sizes and limit increase requests will be necessary before building your cluster. For more information on EC2 limits, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html.
Q: Does AWS ParallelCluster support the use of placement groups?
Yes. Although AWS ParallelCluster doesn't use a placement group by default, you can enable it by either providing an existing placement group to AWS ParallelCluster or allowing AWS ParallelCluster to create a new placement group at launch. You can also configure the whole cluster or only the compute nodes to use the placement group. For more information, please see https://cfncluster.readthedocs.io/en/latest/configuration.html#placement-group.
Q: What kind of shared storage can I use with AWS ParallelCluster?
By default, AWS ParallelCluster automatically configures an external volume of 15 GB of Elastic Block Storage (EBS) attached to the cluster’s master node and exported to the cluster’s compute nodes via Network File System (NFS). You can learn more about configuring EBS storage at https://docs.aws.amazon.com/parallelcluster/latest/ug/ebs-section.html. The volume of this shared storage can be configured to suit your needs.
AWS ParallelCluster is also compatible with Amazon Elastic File System (EFS), RAID, and Amazon FSx for Lustre file systems. It is also possible to configure AWS ParallelCluster with Amazon S3 object storage as the source of job inputs or as a destination for job output. For more information on configuring all of these storage options with AWS ParallelCluster, please visit https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration.html.
Q: How much does AWS ParallelCluster cost?
AWS ParallelCluster is available at no additional charge, and you pay only for the AWS resources needed to run your applications.
Q: What regions is AWS ParallelCluster available in?
AWS ParallelCluster is available in the following regions US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), EU (Stockholm), EU (Paris), EU (London), EU (Frankfurt), EU (Ireland), EU (Milan), Africa (Cape Town), Middle East (Bahrain), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Hong Kong), AWS GovCloud (US-Gov-East), AWS GovCloud (US-Gov-West), China (Beijing), and China (Ningxia).
Q: How is AWS ParallelCluster supported?
You are responsible for operating the cluster, including required maintenance on EC2 instances and batch schedulers, security patching, user management, and MPI troubleshooting. AWS ParallelCluster support is limited to issues related to the build-out of the resources and AWS Batch integration. AWS Batch scheduler problems are supported by the AWS Batch service team. Questions regarding other non-AWS schedulers should be directed toward their own support communities. If you use a custom AMI instead of one of AWS ParallelCluster's default AMIs, please note that AWS ParallelCluster doesn't support any OS issues related to the use of a custom AMI.
Q: How is AWS ParallelCluster released?
AWS ParallelCluster is released via the Python Package Index (PyPI) and can be installed via pip. AWS ParallelCluster's source code is hosted on the Amazon Web Services on GitHub at https://github.com/aws/aws-parallelcluster.
Q: Why should I use EnginFrame?
You should use EnginFrame because it can increase the productivity of domain specialists (such as scientists, engineers, and analysts) by letting them easily extend their workflows to the cloud and reduce their time-to-results. EnginFrame reduces overhead for administrators in managing AWS resources, as well as your users’ permissions and access to those resources. These features will help save you time, reduce mistakes, and let your teams focus more on performing innovative research and development rather than worrying about infrastructure management.
Q: How do I enable EnginFrame in my on-premises environment?
EnginFrame AWS HPC Connector is supported in EnginFrame version 2021.0 or later. Once you install EnginFrame in your environment, administrators can begin defining AWS cluster configurations from the Administrator Portal.
Q: How can an EnginFrame administrator set up or configure AWS HPC environments?
EnginFrame administrators can use AWS ParallelCluster to create HPC clusters running on AWS ready to accept jobs from users. To do this within EnginFrame, administrators can start by creating, editing, or uploading a ParallelCluster cluster configuration. As part of the cluster creation step, administrators create a unique name for a given AWS cluster and specify whether it is accessible to all users, a specific set of users and/or user groups, or no users. Once an AWS cluster has been created, it remains available to accept submitted jobs until an administrator removes it. By default, an AWS cluster in the created state will use only the minimal set of required resources in order to be ready to accept submitted jobs and will scale up elastically as jobs are submitted.
Q: How can users choose between running their jobs on-premises or on AWS?
For EnginFrame services for which your administrator has enabled AWS as an option, you can use a drop-down menu to select from any of the available compute queues across on-premises and AWS. Administrators can include text descriptions to help you choose which of these queues is appropriate to run your workload.
Q: What job schedulers can I use with EnginFrame on AWS? Can I use a different job scheduler on-premises and on AWS?
EnginFrame supports Slurm for clusters that are created on AWS. You can also choose to use a different scheduler on-premises than on AWS (for example, use LSF on-premises and Slurm in AWS). In the case of EnginFrame services you set up to submit jobs both on-premises and in AWS using different job schedulers, administrators will need to ensure that any job submission scripts support submission through each of these schedulers.
Q: What operating systems can I use in AWS? Can I use a different operating system on-premises and on AWS?
EnginFrame supports Amazon Linux 2, CentOS 7, Ubuntu 18.04, and Ubuntu 20.04 operating systems on AWS. You can choose to use a different operating system on-premises than what you use on AWS. However, if you intend to use EnginFrame to run the same workload across both on-premises and AWS, we recommend using the same operating system to reduce environment difference and to simplify the portability of your workloads.
Q: How much does EnginFrame cost?
There is no additional charge for using EnginFrame on AWS. You pay for any AWS resources used to store and run your applications.
When using EnginFrame on-premises, you will be asked for a license file. To obtain an evaluation license, or to purchase new production licenses, please reach out one of the authorized NICE distributors or resellers who can provide sales, installation services, and support in your country.