The plumbing: best-practice infrastructure to facilitate HPC on AWS
If you want to build enterprise-grade high performance computing on AWS, what’s the best path to get started? Should you create a new AWS account and build from scratch?
To answer those questions, consider an analogy of someone washing their hands at a sink. Many focus on a person washing their hands, but that act is an outcome or result. We’ll argue that the plumbing – the water, the sink, and the drain pipes – is critical to facilitate that outcome.
The plumbing often fades to the background, its value under-appreciated until something fails. Yet it provides a best-practices foundation for countless use-cases. In similar fashion, enterprise-grade HPC depends on a lot of critical plumbing.
In this post, we’ll identify the hallmarks and best-practices for building a solid HPC foundation so you can get the plumbing right. We’ll assume you’re an HPC facilitator: a system admin, operator, or you work at an HPC center. Deep experience with AWS is not required.
Our plan is that you come away with an appreciation for shared responsibilities when building on AWS, and aware of solutions that make enterprise-grade HPC a good deal easier than you might have expected.
Many customers, especially individuals or small teams assume that operating on the cloud requires bearing the entire burden of their architecture. The weight of compliance and security, the complexity of networking and identity – those burdens are much larger than one person or one team.
Even more HPC customers stumble because they fear that once they’ve entered their credit card to create an account, they own every single decision involved in building HPC on the cloud.
If you’re a small business – say a startup – you wouldn’t break ground for a new facility on day one, run network, water, and power lines, build an office building, and then populate it. Instead, you might rent an office – or go to a coffee shop table – where network, water, and power exist for you to tap into. Good plumbing allows you to focus first on building what matters most: your business.
At AWS, we often talk about the “shared responsibility model“ as core to our well-architected framework, a six-pillar framework for best-practices in building secure, high-performing, resilient and efficient infrastructures. Most customers understand shared responsibility regarding the security pillar: customer responsibility in the cloud vs. AWS responsibility of the cloud. But shared responsibility also applies elsewhere – like sustainability.
It can also be multi-layered and differentiate between roles inside your own organization. For example, you as an HPC provider (a facilitator) might focus on building an HPC cluster, while other teams manage incident response, account guardrails and boundaries, or identity and access management.
To make it clear, let’s say you represent an HPC center, and you serve end-users campus-wide at a university or research institution. As an HPC provider, you might be embedded within the campus-wide IT team, or maybe you operate with autonomy while pieces of your infrastructure are tied back through the central IT services. On-premises, you already have the concept of “the plumbing” that you build on top of. You’re not operating alone. You focus effort on HPC, but your HPC system depends on enterprise shared services (like Active Directory where your users’ IDs and passwords live), your campus network, compliance reporting, and so on. This same pattern repeats itself whether you’re at a university, a national lab, or a non-profit organization. It’s true whether you’re public-sector or a company.
As a best-practice, HPC customers (especially those intending to operate enterprise-wide HPC clusters or HPC as a service offerings) leverage shared responsibility to build on top of good plumbing infrastructure on AWS. The most successful collaborate with their central IT services to layer HPC on top of a broader enterprise cloud strategy.
So what is foundational plumbing, and how does HPC attach to it?
Most enterprise customers establish an AWS Landing Zone. This is a well-architected, pre-defined architecture for enterprise infrastructure. A Landing Zone is the foundation for multi-account architecture, Identity and Access Management (IAM), governance, data security, network design, and logging. It’s typically defined by central IT within an enterprise, and helps align accounts to AWS best-practices so they can meet compliance frameworks. Any infrastructure built within an enterprise, including HPC, affixes to the Landing Zone.
Therefore, in your role as an HPC facilitator, step one should be to ask your central IT team whether a Landing Zone exists, and how to get started. Share responsibility with that team, and let them guide your HPC build.
The fastest path to establish a Landing Zone is through the Landing Zone Accelerator (LZA). This is an open source infrastructure as code solution that enables repeatable configuration and modification of a Landing Zone through low-code YAML files. AWS Professional Services and AWS Partners also offer fast-track options to build and support Landing Zones through LZA deployment and alternative solutions.
There is no one-size-fits-all Landing Zone configuration for HPC customers because HPC lives within every industry and compliance framework. However, Figure 1 shows a best-practices architecture pattern that’s repeated across reference LZA implementations from regulated industries like Health Care (e.g., HIPAA, C5, etc.), and country/regional-specific compliance requirements like State and Local Governments (e.g., FISMA), Education (e.g., NIST 800-53, NIST 800-171, ITAR, etc.), and US Federal and Department of Defense (e.g., ITAR, FedRAMP, CMMC, etc.). Other reference implementations for LZAs (including territories outside the USA) are inventoried here.
As a word of caution: linking to core infrastructure and adhering to constraints/guardrails imposed by a Landing Zone moves infrastructure closer to meeting compliance requirements, but may not be a complete solution for compliance. Just attaching an HPC cluster to a FedRAMP Landing Zone configuration doesn’t make it FedRAMP-compliant. Proper documentation of controls and internal audits (and in some cases, 3rd party audits) might be required. Here again, share responsibility with central IT services and your enterprise security and compliance office to satisfy enterprise requirements.
Commonly, a Landing Zone is comprised of a root, or management, organizational unit (OU) with an account that centrally steers the multi-account structure through mandatory and preventative guard rails and other requirements for how accounts are provisioned.
The root account may define Service Control Policies and/or a Permissions Boundary to govern how accounts/roles in the organization tree provision resources, call APIs, etc. While it’s possible to self-build and manage a Landing Zone, best-practices (including for Landing Zone Accelerator) leverage AWS Control Tower, which is a managed service purpose-built for this task.
Below the top-level OU, a Security OU with Audit and Logging accounts manages organizational-wide services for security (like threat detection) and centralized logs (such as archive and forensics).
A third OU dedicated to shared Infrastructure might contain services like Active Directory, as well as a core Network account for firewall and/or packet inspection appliances, and a shared Direct Connect attachment.
Lastly, the final OU is dedicated to workload accounts – this is where an HPC team can operate to build dev, test, and production clusters and link to the shared infrastructure.
HPC within a Landing Zone
In practice, HPC teams build on top of a Landing Zone with minimal awareness of how the plumbing infrastructure is built. They only need to know when to tie in shared services and how to build HPC infrastructure for themselves.
For a deeper look at common HPC scenarios and key elements to ensure architectural best-practices, you can refer to the HPC Lens for the AWS Well-Architected Framework.
The most significant opportunity for HPC facilitators when they’re building on AWS is to augment their HPC design to exercise the elasticity and flexibility of the cloud. Compute nodes spin up as needed, and down when not. The compute node memory size, core-count, accelerators (and more) can be adjusted in minutes rather than the cluster hardware staying set in stone for a three to five year lifecycle.
Your HPC architecture might be a single monolithic HPC system, reproducing the on-premises cluster experience. Or you might be extending it (like in a hybrid HPC scenario) so end-users share a single scheduler and common storage resources. As an alternative, tools and services like AWS ParallelCluster and AWS Batch present an opportunity to build smaller HPC solutions right-sized for workloads on a per-project or per-team basis – or even for individual users.
Whether you build one cluster or dozens, elasticity and flexibility are critical to conserve cost and evolve the end-user experience from “what questions can I ask with the cores I’ve got?” (science limited by scale), to “what do I need to do to get more cores?” (science is priority). The latter implies greater productivity and ability to innovate.
A Landing Zone amplifies flexibility in the case of multiple HPC clusters. For example, mapping users and workloads onto separate accounts in the multi-account structure is useful to partition data for residency and access constraints.
Second, a Landing Zone configuration dictates how new accounts are provisioned on creation, helping facilitators pre-warm dependencies for launching clusters (for example, required IAM roles, or a database that’s needed to enable accounting in the scheduler).
Third, Landing Zones can integrate AWS Service Catalog, which is a vending service for infrastructure as code. This creates a controlled mechanism for end-users to self-service, provision, and delete infrastructure in their account. Service Catalog Products define template configurations (a website, a database, or an HPC cluster) that are reviewed and approved by HPC facilitators or central IT services to satisfy compliance requirements or other guardrails. Optional parameters on Service Catalog Products give end-users control of facets of HPC relevant to their work, like instance-type selection, storage quota, or cost-center allocation. Through Service Catalog, facilitators distill the complexity of enterprise HPC design and management, but empower their users to innovate autonomously.
Landing Zones also support HPC facilitators with cost and billing strategies. Typically, central IT services receive consolidated bills and reconcile costs with workload accounts through internal cost recovery and charge-back processes. With consolidated billing, organizations combine usage across all workload accounts to more effectively scale Savings Plans or Volume discounts. Whereas one HPC cluster with episodic utilization might not benefit from a Savings Plan, utilization in the aggregate across all HPC accounts might look sustained – and therefore justify making a commitment with a Savings Plan, and thus saving up to 72% on the organization’s overall compute bill.
The same logic applies when utilization is aggregated for HPC and non-HPC workloads.
Consolidated billing doesn’t preclude charge-back to individuals (per-project, team, or user). Separate workload accounts provide a simple segmentation to track spend in the consolidated bill. Alternatively, cost allocation tags can be applied to resources within or across accounts, and differentiate between cost centers for utilization. Again, consult with your central IT services. They should be able to advise on how to best handle cost, billing, and tagging strategies for your HPC workload accounts.
Landing Zones are a best-practice “plumbing” for enterprise-grade HPC and the most natural starting point for most HPC customers. Whether you intend to offer end-users flexible HPC as a service, or build a single monolithic HPC cluster shared by all users, we recommend you exercise shared responsibility and collaborate with central IT services teams in your enterprise so you can layer HPC as a workload within the broader cloud strategy.