AWS HPC Blog

Securing HPC on AWS: implementing STIGs in AWS ParallelCluster

Today, we’ll discuss cloud-native methods that HPC customers can use to accelerate their process for creating Amazon Elastic Compute Cloud (Amazon EC2) images for AWS ParallelCluster that are compliant with Security Technical Implementation Guides (STIGs), a set of standards maintained by the US government.

In this post, we’ll walk you through the process of applying STIGs to your ParallelCluster environment, help you identify the decisions you need to make on the way, and show you some of the tools you can use to make it all easier.

What’s a STIG?

STIGs are maintained by a US government organization, and are simply a set of security standards that can be applied to different environments, like Amazon EC2. Think of STIGs as a checklist of items to apply to your EC2 instances where each checklist item has a corresponding severity level attached to it that says, “the risk of not doing x is a low, medium, or high security risk”. You’ll also see these security levels referred to as Category Codes (CAT) where CAT 1 corresponds to a high security risk, CAT 2 to medium, and CAT 3 to low.

For example, one high security-risk STIG checklist item for Red Hat Enterprise Linux (RHEL) 8 is to not allow accounts configured with blank or null passwords. To resolve this, an administrator can login to the operating system and manually configure accounts to not have a blank or null password. With hundreds of checklist items it is easy to see why this can quickly become a burdensome task. The process described in this post automates up to 87% of the otherwise manual STIG remediation process.

Why do customers want to implement STIGs?

In short, some want to, and some need to. Customers such as the U.S. Department of Defense (DoD) must adhere to stringent compliance standards for operating system hardening. Other customers may prefer to use STIGs as a benchmark to improve their security posture.

Customers like the DoD often operate in AWS without any access to the Internet. Organizational policy dictates the reason for why, which is usually to reduce the risk of sensitive data going places it shouldn’t. We address how customers with these network restrictions can accelerate STIG hardening using AWS cloud native tools.

Once you’ve “STIG’d” your ParallelCluster instances, how can you verify which checklist items you have crossed off? This is where OpenSCAP, an open-source security and compliance tool, comes into play. OpenSCAP automates continuous monitoring, vulnerability management, and reporting of security policy compliance data. While OpenSCAP is primarily designed to align with DoD security standards, it’s used to establish security baselines across many industries.

This post will focus on some supported ParallelCluster operating systems (OS): RHEL8, Amazon Linux 2 (AL2), and Ubuntu 20.04 (at the time of writing this, the DISA STIG document library didn’t contain a benchmark for Ubuntu 22.04 – which is why it’s not mentioned).

We worked through the process defined in this post using the AWS GovCloud West region, but you should be able to repeat it in other AWS regions.

For HPC customers completely new to AWS, we recommend reviewing this blog post which speaks about best practices for setting up a foundation in AWS to build your HPC workloads on.

AMIs for AWS ParallelCluster

An Amazon Machine Image (AMI) is a template that contains a software configuration (for example, an OS, an application server, and applications). From an AMI, you launch an EC2 instance, which is a copy of the AMI running as a virtual server in the cloud. AMIs used for ParallelCluster are unique because they have software installed on them necessary for operating the cluster management tool.

Customers can optionally choose to create custom AMIs for ParallelCluster using two methods, both of which we can use for achieving STIG compliance, depending on factors like Internet connectivity and OS choice.

The first option is the build image configuration process which you can trigger from a ParallelCluster CLI command: pcluster build-image. This process uses Amazon EC2 Image Builder to launch a build instance, apply the ParallelCluster cookbook, install the ParallelCluster software stack, and perform other necessary configuration tasks.

The second option involves taking a baseline ParallelCluster AMI (one produced by the ParallelCluster team themselves) and customizing it by performing manual modifications through AWS Systems Manager (SSM).

Process comparison

Should you take a baseline ParallelCluster image and then apply STIGs, or take an image that already has STIGs applied (a “golden image”), and then install ParallelCluster on top? The end result is fundamentally similar, but there are some trade-offs depending on which route you choose.

The benefit of applying STIGs after a ParallelCluster image is created is that you can minimize permissions attached to the EC2 instance’s role. There are additional AWS Identity and Access Management (IAM) permissions required to trigger the build image process and you can find them in our documentation. The tradeoff you’re making is that you would be standing up a new image build pipeline to accommodate security policy (STIG) enforcement starting from a baseline ParallelCluster image.

An advantage of taking a golden image and installing ParallelCluster is that you can maintain an already established image build pipeline that may accelerate internal compliance processes. However, this would require a wider permissions boundary in comparison to the previous example. There’s also a chance that installing new software could impact how STIG compliant your images are. For customers interested in trying this process on your own AMIs, you can follow along with any of the sections below depending on Internet connectivity and operating system requirements as the process is the same. In either case, we recommend performing compliance scans on your images.

Accelerating RHEL8, AL2, and Ubuntu 20.04 STIG compliance

Apart from the OS your use cases require, the process to achieve STIG compliance is determined by whether your Amazon EC2 instances have Internet connectivity or not. If your compliance requirements allow you the flexibility to choose, then it’s easier with Internet connectivity.

For users with Internet connectivity who want to use RHEL8 or AL2 operating systems, refer to the instructions in our GitHub repo that’s part of the HPC Recipes Library which will guide you through the EC2 Image Builder process.

For users without Internet connectivity who want to use RHEL8 or AL2 operating systems, refer to these instructions in the same repository. This type of connectivity scenario is perhaps more common amongst customers with STIG requirements. These customers can take advantage of AWS PrivateLink which is a feature of Virtual Private Cloud (VPC) and allows for private connectivity to AWS services. To take advantage of this technology for purposes of accelerating STIG compliance, ensure that you configure the required VPC endpoints to allow connectivity from your private subnet to SSM. You’ll also need the required VPC endpoints for ParallelCluster which will be used to launch your cluster with the resulting AMI.

The process for Ubuntu 20.04 includes an extra step compared to RHEL8 and AL2 operating systems because there are a couple of findings that Systems Manager cannot rectify during its run command. Due to this, we launch a baseline ParallelCluster Ubuntu 20.04 EC2 instance with a user data script that resolves findings V-219166, V-238237, and V-238218. As with RHEL8 and AL2 operating systems, customers without Internet connectivity should ensure they configure the required VPC endpoints to allow connectivity from your private subnet to SSM, and the required VPC endpoints for ParallelCluster. Instructions for Ubuntu 20.04 can be found in our HPC samples repository.

As previously mentioned, there are corresponding severity levels (high, medium, low) associated with STIG checklist items. Customers can choose which security level they want to apply to their Amazon EC2 instances which is described in our SSM documentation. We used the STIG High baseline which includes any vulnerability that can result in loss of confidentiality, availability, or integrity. Customers can optionally choose to make additional modifications to the AMIs after the STIG process of their choosing has been performed. In any event, we recommend testing AMI compatibility with your application prior to deploying to a production environment.

Results

Customers may be interested to find out what the effects of running the EC2 Image Builder STIG High component or Systems Manager STIG High document has on their respective operating systems.

We used OpenSCAP to perform compliance scanning to assess the security posture of our instances. It also uses the concept of profiles to determine which checks it will run and the profile can vary on mission requirements and OS.

For the purposes of maintaining a consistent benchmark for before and after assessments, we used the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile for RHEL8 and Ubuntu 20.04 operating systems, and stig-rhel7-disa on AL2.

Each of the ‘Baseline’ AMIs in the screenshots that follow refer to the baseline ParallelCluster AMI. In other words, these are the AMIs you would find by typing the CLI command: pcluster list-official-images. Note that the baseline and subsequent STIG high AMI results may change in future ParallelCluster releases.

Figure 1 - RHEL8 baseline ParallelCluster AMI OpenSCAP results from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 83 checks and failing 148 for a result of being 35.93% compliant with this profile.

Figure 1 – RHEL8 baseline ParallelCluster AMI OpenSCAP results from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 83 checks and failing 148 for a result of being 35.93% compliant with this profile.

Figure 2 - RHEL8 ParallelCluster AMI after running the Amazon STIG High runbook. OpenSCAP results are from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 201 checks and failing 30 for a result of being 87.01% compliant with this profile.

Figure 2 – RHEL8 ParallelCluster AMI after running the Amazon STIG High runbook. OpenSCAP results are from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 201 checks and failing 30 for a result of being 87.01% compliant with this profile.

Figure 3 - Amazon Linux 2 baseline ParallelCluster AMI OpenSCAP results from running the stig-rhel7-disa profile. This shows the EC2 instance as passing 54 checks and failing 160 for a result of being 58.88% compliant with this profile.

Figure 3 – Amazon Linux 2 baseline ParallelCluster AMI OpenSCAP results from running the stig-rhel7-disa profile. This shows the EC2 instance as passing 54 checks and failing 160 for a result of being 58.88% compliant with this profile.

Figure 4 - Amazon Linux 2 ParallelCluster AMI after running the Amazon STIG High runbook. OpenSCAP results are from running the stig-rhel7-disa profile. This shows the EC2 instance as passing 142 checks and failing 72 for a result of being 68.09% compliant with this profile.

Figure 4 – Amazon Linux 2 ParallelCluster AMI after running the Amazon STIG High runbook. OpenSCAP results are from running the stig-rhel7-disa profile. This shows the EC2 instance as passing 142 checks and failing 72 for a result of being 68.09% compliant with this profile.

Figure 5 - Ubuntu 20.04 baseline ParallelCluster AMI OpenSCAP results from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 17 checks and failing 92 for a result of being 15.6% compliant with this profile.

Figure 5 – Ubuntu 20.04 baseline ParallelCluster AMI OpenSCAP results from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 17 checks and failing 92 for a result of being 15.6% compliant with this profile.

Figure 6 - Ubuntu 20.04 ParallelCluster AMI after running the Amazon STIG High runbook. OpenSCAP results are from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 42 checks and failing 67 for a result of being 38.53% compliant with this profile.

Figure 6 – Ubuntu 20.04 ParallelCluster AMI after running the Amazon STIG High runbook. OpenSCAP results are from running the xccdf_mil.disa.stig_profile_MAC-2_Sensitive profile. This shows the EC2 instance as passing 42 checks and failing 67 for a result of being 38.53% compliant with this profile.

Running your own OpenSCAP scans

If you want to perform additional STIGs on ParallelCluster AMIs, you may want to run those images through the same OpenSCAP profiles used for this blog post.

We’ve stored the scripts for RHEL8, AL2, and Ubuntu 20.04 in our GitHub repo. These scripts do require Internet connectivity to run because they download a series of tools like the AWS CLI and OpenSCAP, and STIG benchmarks to the EC2 instance being evaluated.

You’ll need to create an S3 bucket, and update the name of the bucket inside the script where it saves the results of the evaluations. The scripts use EC2 instance metadata to dynamically name the output files in Amazon S3 after the instance, so they’re not overwritten as new instances are tested.

To run these scripts with minimal effort, you can run them as a user-data script upon launch and have the HTML results automatically sent to your S3 bucket. Inputting a user-data script follows the same logic as described under step 3 of the Ubuntu 20.04 section. For RHEL8 and Ubuntu 20.04 operating systems, it takes approximately 10 minutes from instance launch to see the results uploaded to your Amazon S3 bucket. AL2 takes approximately 20-25 minutes.

Using the resulting images

The STIG’d AMIs can be found in the EC2 section of the Management Console and referenced in a ParallelCluster configuration file. You can create clusters using the ParallelCluster CLI or the UI. For purposes of this post, we’ll show an example of placing the STIG’d AMI ID into the ParallelCluster configuration file for a cluster in GovCloud West.

Region: us-gov-west-1
Image:
  Os: rhel8
HeadNode:
  InstanceType: c5a.4xlarge
  Networking:
    SubnetId: {your-subnet-id}
  Ssh:
    KeyName: {your-keypair}
  Image:
    CustomAmi: {your-AMI-id}
SharedStorage:
  - MountDir: /fsx  
    Name: FSxExtData
    StorageType: FsxLustre
    FsxLustreSettings:
      StorageCapacity: 1200
      DeploymentType: PERSISTENT_1
      PerUnitStorageThroughput: 50
      DeletionPolicy: Delete
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    QueueUpdateStrategy: DRAIN
  SlurmQueues:
  - Name: queue1
    ComputeResources:
    - Name: compute
      Instances:
      - InstanceType: hpc7a.48xlarge
      MinCount: 1
      MaxCount: 10
      Efa:
       Enabled: true
    Networking:
      SubnetIds:
      - {your-subnet-id}
      PlacementGroup:
        Enabled: true

You should edit the Items enclosed in the {} to include your identifiers.

Once you’ve created the file you can launch the cluster using the command:

pcluster create-cluster --cluster-name <name> --cluster-configuration <file-name>.yml

You should see validation warning messages because you are using a custom AMI, however these messages can be ignored and will not impact the creation of the cluster. You can track the cluster creation status through the AWS CloudFormation console or by using the ParallelCluster CLI command:

pcluster list-clusters

Conclusion

Verifying levels of compliance for compute resources is a requirement in some industries, and desired in others. Throughout this post, we’ve discussed several different cloud-native methods HPC customers with compliance requirements can choose from to accelerate their STIG process in AWS ParallelCluster depending on their Internet connectivity (or lack thereof) and operating system choice.

We recommend validating that your output images work with your application in a development environment prior to running in production.

Alex Domijan

Alex Domijan

Alex Domijan is a Solutions Architect for Amazon Web Services (AWS) World Wide Public Sector's (WWPS) Department of Defense (DoD). He works with defense customers covering a wide variety of use cases to optimize their workloads for security, cost, and performance.

Rick Kidder

Rick Kidder

Rick Kidder is a Senior Security Compliance and Risk Consultant at AWS, based in Fort Lauderdale, Florida. He specializes in guiding customers through the complexities of security and compliance in cloud environments. With a deep-rooted passion for automating compliance processes, Rick aims to streamline operations, allowing professionals to focus their energies on more critical challenges. His expertise facilitates efficient compliance management, ensuring that security remains a top priority without overwhelming resources.

Scott Sizemore

Scott Sizemore

Scott Sizemore is a Senior Cloud Consultant on Amazon Web Services (AWS) World Wide Public Sector's (WWPS) Professional Services Department of Defense (DoD) Team. Prior to joining AWS, Scott was a DoD contractor supporting multiple agencies for over 20 years.

Kevin Sutherland

Kevin Sutherland

Kevin Sutherland is a Principal Domain Lead supporting HPC in specialized environments for AWS World Wide Public Sector's Professional Services US Federal team. Prior to AWS, Kevin supported HPC in the public sector across various contracts deploying in AWS, spearheading cloud migration for IT and Cloud Operations in the SatComm industry, as well as leveraging HPC for engineering, operations and equipment design in global minerals, mining and equipment. Kevin has led development for engineering and HPC transitions into the cloud as well as developing and deploying on-prem HPC approaches to support modeling and simulation driven mechanical equipment design and holds multiple international patents.