Microsoft Workloads on AWS

Testing AWS Managed Microsoft AD Resilience using AWS Fault Injection Service

In this blog post I show how AWS Fault Injection Service (FIS) can be used to test how your deployed services respond to the loss of access to a Microsoft Active Directory domain controller from the AWS Managed Microsoft AD service.

To understand the behaviour of infrastructure deployed in AWS and prove that your configuration is working as expected during failures, organisations need to have robust tools and processes for simulating different conditions and situations. Previously, infrastructure testing required physical hardware manipulation – disconnecting cables, powering down servers, or removing components. However, when using managed services in the cloud, this direct physical interaction isn’t possible, requiring different approaches to resilience testing.

AWS Fault Injection Service (FIS) bridges this gap by providing controlled chaos engineering capabilities. Chaos engineering, as a discipline, focuses on experimenting with systems to build confidence in their ability to withstand turbulent conditions in production.

AWS Directory Service, Managed Microsoft AD

AWS Managed Microsoft AD implements an enterprise grade architecture where domain controllers operate from AWS-managed accounts rather than within the customer’s account. When you create a directory through the AWS Management Console, CLI, or API, AWS provisions domain controllers across two Availability Zones (AZs) in the specified VPC. These domain controllers appear in your VPC as Elastic Network Interfaces (ENIs), while the actual server infrastructure is managed entirely by AWS. The service handles infrastructure management tasks including operating system patches, software updates, and backup operations. Each domain controller maintains its own storage volumes and networking configuration; all managed transparently by AWS. The multi-AZ deployment ensures continuous operation and high availability.

FIS can be used to test loss of AD service by understanding how domain controllers from the AWS Managed Microsoft AD service are connected to workloads.

The diagram illustrates an AWS architecture showing two VPCs - an AWS Managed VPC containing AWS Managed Microsoft AD, and a Customer VPC containing customer workloads in a private subnet. The connectivity between these VPCs is established through an endpoint, with AWS Fault Injection Service positioned to simulate disruptions to this connection. The components are arranged in a logical flow with numbered indicators (1-5), demonstrating the relationship between the managed AD service and customer workloads.

Figure 1: Example AWS Architecture of the connection between Customer VPC and FIS.

The architecture of the connection looks like this, where in 1) there are workloads running in customer VPC which are making use of the domain controller from the AWS Managed Microsoft AD service in 3) an AWS managed VPC which is not visible in the customers AWS account. The domain controllers connect 4) via an endpoint into the VPC of the customers AWS account where the workloads are located. 5) AWS Fault Injection Service acts on the Endpoint connection to disrupt its connectivity to the workloads in the customer VPC.

Testing Strategy with FIS

Identify domain controller endpoints in your VPC as the first step of testing. These endpoints appear as ENIs and are discoverable through AWS CloudShell or AWS CLI. The following command provides detailed interface information.

Note: Please replace <VPC-ID> in the example with your VPC ID.

aws ec2 describe-network-interfaces \

--filters Name=<VPC-ID>,Values= Name=description,Values='*AWS created*' \

--query 'NetworkInterfaces[*].{ENI:NetworkInterfaceId,Description:Description,PrivateIP:PrivateIpAddress,AZ:AvailabilityZone,Status:Status}' \
--output table

FIS provides three distinct methods for testing network connectivity disruption. These methods are All, Availability-Zone and Prefix-List.

  1. The “all” traffic denial option implements complete subnet isolation by adding deny rules to the Network ACL.
  2. The “availability-zone” isolation option specifically targets cross-AZ communication. This allows testing of your application’s behaviour when communication between AZs is disrupted, while maintaining intra-AZ connectivity.
  3. The “prefix-list” targeting option offers the most precise control. It requires creating a managed prefix list containing the specific IP addresses of your domain controllers using /32 CIDR notation. This approach allows for selective disruption of AD traffic while maintaining other network communications.

Observability while Testing

End to end observability is essential during FIS experiments to help understand the application behaviour during the testing. Amazon CloudWatch metrics should be configured to track authentication attempts per second, along with success and failure ratios. Additional metrics should include LDAP bind times, Kerberos ticket requests, and account lockout events, as these metrics provide essential visibility into the authentication system’s health during experiments.

Monitoring of network traffic is useful in the testing lifecycle. Additionally, detailed application logging for authentication events could be added. Furthermore, security events require specific attention in Windows Event monitoring. Examples: Event ID 4625 for failed logon attempts, Event ID 4624 for successful logons, Event ID 5805 for directory service errors, and Event ID 1000 for application errors.

To visualise, create a CloudWatch dashboard that combines authentication metrics, network performance data, application health indicators, security event counts, and custom metrics for authentication timing. This consolidated view would enabled quick identification of issues during experiments and ensures comprehensive oversight of the system during testing.

Ensure there is a Testing Plan

A complete testing plan requires baseline performance metrics, success criteria definition, detailed rollback procedures, and established communication routes. The testing process should be initiated in development environments and monitoring tools must be properly configured before any testing begins. Document the system’s current state and expected behaviour. Test and verify rollback procedures before starting experiments.

Throughout the testing process, it’s crucial to maintain comprehensive records documenting test scenarios, observed behaviours and responses, with encountered issues also being captured. Any lessons learned, performance metrics, and information on recovery times. This documentation serves as a valuable resource for future testing and system improvements.

Begin with a progressive testing approach. Increase test durations as confidence grows, then incorporate additional services and dependencies into testing scenarios as the program matures. Move to production testing only after completing successful tests in development environments, establishing appropriate safeguards and documenting recovery procedures.

Testing a Single Domain Controller Disruption

The test begins by identifying the target ENI and creating a FIS experiment template. Your FSI template should specify the aws:network:disrupt-connectivity action targeting the subnet containing the selected domain controller. Then configure the duration parameter to allow sufficient time for monitoring authentication behaviour while maintaining a safety margin.

Once the target domain controller’s ENI is identified, create a managed prefix list containing the IP address of the target domain controller using /32 CIDR notation.

Here is an example configuration. In this example items in angular brackets, such as <region>, <account-id>, <subnet-id> and <prefix-list-id> should be updated with values from your AWS account.

{
  "description": "Test AWS Managed AD resilience",
  "targets": {
    "dc-subnets": {
      "resourceType": "aws:ec2:subnet",
      "resourceArns": [
        "arn:aws:ec2:<region>:<account-id>:subnet/<subnet-id>"
      ],
      "selectionMode": "ALL"
    }
  },
  "actions": {
    "disrupt-dc": {
      "actionId": "aws:network:disrupt-connectivity",
      "parameters": {
        "duration": "PT10M",
        "prefixListIdentifier": "<prefix-list-id>",
        "scope": "prefix-list"
      },
      "targets": {
        "Subnets": "dc-subnets"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:<region>:<account-id>:alarm:AuthenticationFailureAlarm"
    }
  ],
  "roleArn": "arn:aws:iam::<account-id>:role/service-role/FISExperimentRole",
  "tags": {},
  "experimentOptions": {
    "accountTargeting": "single-account",
    "emptyTargetResolutionMode": "fail"
  }
}

The experiment also requires an IAM role with specific permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateNetworkAclEntry",
                "ec2:DeleteNetworkAclEntry",
                "ec2:DescribeNetworkAcls",
                "ec2:DescribeSubnets"
            ],
            "Resource": "*"
        }
    ]
}

During the experiment execution, the FIS service adds Network ACL rules to block traffic to the specified domain controller.

For a complete Directory Service isolation test, add all subnets where the domain controllers are connected to the subnet target list. The following snippet shows how the template can be extended to cover all subnets where domain controllers are connected.

      "resourceArns": [
        "arn:aws:ec2:<region>:<account-id>:subnet/<subnet-id>",
        "arn:aws:ec2:<region>:<account-id>:subnet/<subnet-id>"
      ],

Conclusion

AWS Fault Injection Service enables controlled testing of AWS Managed Microsoft AD domain controller connectivity, allowing organisations to validate application behaviour during AD disruptions. While initial testing should be conducted in non-production environments, regular controlled failure testing should be integrated into disaster recovery planning. This proactive approach helps teams validate system resilience, identify potential issues early, and develop effective incident response procedures.

If you would like to get started with AWS Fault Injection Service (FIS), you can access it through the AWS Resilience Hub, where you can create controlled chaos engineering experiments on your AWS workloads.

Stephen Glasgow

Stephen Glasgow

Stephen Glasgow is a Senior Partner Solutions Architect at AWS. He supports AWS partners as they develop capability in AWS technologies to create effective and transformative solutions for their customers, with particular focus on public sector initiatives. Stephen brings more than 20 years of experience in delivering and architecting enterprise-scale transformations, with deep domain expertise in resilience and business continuity.