AWS for Industries
DORA scenario testing with AWS Fault Injection Service
The Digital Operational Resilience Act (DORA) is a pan-European legislative framework on operational resilience and cyber resilience in the financial sector. A key component of DORA is the requirement for regulated entities to perform digital operational resilience testing in order to help improve the resilience of their workloads. This blog post outlines how you can use AWS Fault Injection Service (FIS) to support the DORA requirements around scenario-based testing through a structured, iterative process of identifying failure scenarios, planning and executing chaos engineering experiments, reporting on the results, and using the information learned to improve operational resilience.
What is chaos engineering?
Chaos engineering is a methodical approach for testing and validating the resilience of complex systems by deliberately introducing failures and disruptions. Its purpose goes beyond checking known conditions and instead uncover unknown weaknesses and vulnerabilities that only manifest under real-world complex scenarios. This approach can reveal flaws that conventional testing methods, such as functional and integration tests, often miss.
Introduction to DORA
The overall goal of DORA is to ensure the financial sector can withstand, respond to, and recover from Information and Communication Technology (ICT)-related disruptions and cyber threats, in order to maintain the stability and integrity of the financial system. In 2024, AWS published the AWS User Guide to the Digital Operational Resilience Act and the AWS guide to building and operating financial services workloads for DORA (Level 2). DORA requires financial entities (FEs) to implement digital operational resilience testing to ensure the firm’s overall operational resilience in the face of ICT-related disruptions. This resilience testing involves the ability of systems, tools, processes, and people to continue operating under stress or in degraded conditions.
Chaos engineering to support DORA scenario-based testing requirements
DORA requires FEs to identify their critical or important functions and to implement a comprehensive digital operational resilience testing program for these functions. This testing will include scenario-based testing as described in Article 25 of the DORA regulations. From a system resilience perspective, FEs will need to evaluate their workloads that support critical or important functions can continue operating under failure conditions. Systems will fail gracefully to alternate technical or manual capabilities whilst ensuring end-users of these systems are not adversely affected.
Some FEs have existing testing mechanisms in place that can help validate the resilience of their systems under the impact of failures. This includes execution of table-top exercises to simulate and analyse failure scenarios. Chaos engineering complements these exercises with a more practical approach that tests how systems react and respond to failure under real-live conditions. Chaos experiments introduce failures affecting both the technical and people aspects of organisations, as well as the process guiding them. Organisations typically include mitigations in the design and implementation of their systems for such failures. Chaos engineering helps validate the effectiveness and comprehensiveness of these mitigations.
Applying the AWS resilience lifecycle framework for DORA scenario testing
AWS developed a resilience lifecycle framework that captures resilience lessons and best practices from years of working with customers and internal teams designing, building and operating resilient systems. The framework outlines five key stages that are illustrated in the following diagram. At each stage, you can use strategies, services, and mechanisms to improve your resilience posture. We recommend customers operationalise the resilience lifecycle framework for critical and important functions and their supporting systems. This integrates resilience as an ongoing process throughout the software development lifecycle (SDLC), rather than a one-time activity.
Figure 1: Resilience lifecycle framework and Figure 2: Evaluate and Test phase
The Evaluate and test phase of the framework verifies the system meets its resilience requirements. We recommend the following iterative 5-stage failure scenario-based approach to perform this validation. Each iteration increases your understanding of the system’s resilience posture by validating or disproving design assumptions about resilience.
By repeatedly going through these stages using testing tools like FIS, financial entities can systematically identify and address resilience gaps in systems supporting critical or important functions to help meet DORA requirements.
Stage 1: Identify failure scenarios
To start the process, identify a set of failure scenarios structured as a hypothesis on how the system reacts to specific failures being injected. FEs need to test the resilience of systems for failures, such as loss of critical infrastructure or when the system encounters degraded operating conditions. Most large distributed and complex systems could have hundreds of failure scenarios in scope, and it’s impractical to identify and test all these scenarios. The DORA regulation does not explicitly call out specific scenarios to test. You need to identify which scenarios to test based on the likelihood, impact, cost, and feasibility of testing, and the potential to gain insights and make improvements based on the test results. AWS recommends using the following mechanisms to identify and prioritise failure scenarios to test and report for DORA compliance.
1. Identify critical functionalities – DORA requires customers to identify business critical functionality to help identify key priority functionalities to target. For example, in an online digital banking application, the most important aspect will be the focus on failures that can affect processing of payments or those that can affect critical databases hosting the payments data.
2. Consider failure modes – You can’t predict everything. To avoid wasting time, consider the following key failure modes systems can exhibit.
- Bimodal behaviour – Bimodal behaviour refers to a system or component of a system that exhibits two distinct modes of behaviour under different conditions. For example, a system that uses caching in front of a database can be bimodal. The system’s behaviour, response time, and resource utilisation can differ between these two modes.
- Dependencies – Dependencies refer to the relationships between different components or modules within a system, where one component relies on the presence or functionality provided by another component or service. Managing dependencies is crucial because changes or updates to one component can potentially break or impact the components that depend on it. Failure scenarios that test the impact of these dependencies, whether they are partially or fully degraded, will provide valuable insights into the resilience of the system. Reviewing system design documents can help identify these dependencies.
- Fault isolation boundaries – Fault isolation boundaries are design techniques we use to prevent faults or errors from propagating and affecting other components or subsystems. In simpler terms, they create isolated boundaries within a system, limiting the impact of failures to that specific area and preventing them from affecting the entire system. They implement these boundaries through various mechanisms such as process isolation, virtual machine isolation, or hardware-based isolation techniques, ensuring that if one component fails, the rest of the system can continue to operate normally. Include failure scenarios that validate the effectiveness of your system’s fault isolation mechanisms.
- Use formal mechanisms such as Failure Mode Effects Analysis (FMEA) – FMEA can help FEs identify potential single points of failure and design mitigations for these failures. These are used to identify and prioritise failure scenarios. Review the AWS Well-Architected Framework, which includes prescriptive guidance and detailed implementation steps for using FMEA to identify failure scenarios.
3. Review previous incidents – Review Root Cause Analyses (RCAs) and post-mortems from previous incidents. Most organisations use such mechanisms to analyse comprehensively and document incidents, their business, and technical impact, and what caused them to occur. Tagging and classifying these incidents can identify common failures that are likely to reoccur and therefore to include as part of the failure testing exercise for DORA.
4. Review Operational runbooks – Operational runbooks will include recovery procedures and incident response processes which chaos engineering can help prove work as expected.
It’s important to involve key stakeholders from business, technical, and operational teams when prioritising failure scenarios. This collaborative approach ensures a comprehensive assessment of each scenario’s impact on system resilience, aligning testing efforts with diverse organizational perspectives.
Stage 2: Plan failure testing
This stage involves setting up the target environment to be used for chaos engineering and making necessary system modifications to enable chaos engineering experiments. This includes ensuring you can measure the system’s performance and reliability using observable metrics. You will need to improve your systems observability to confirm these metrics are measurable and observable. You also might have to change your system’s security setup if you intend to use chaos engineering tooling, such as FIS for injecting failures. You need to ensure access controls and permissions are setup for teams responsible for executing chaos experiments. Test environments will be similar to live environments in terms of infrastructure configuration and user traffic. Finally, have a solid rollback plan and mitigation strategies prepared to revert quickly experiments if unexpected consequences arise
Stage 3: Execute failure scenarios
When executing failure scenarios through chaos engineering experiments, it is crucial to take a controlled and gradual approach, particularly in production environments. Start in a pre-production environment or Isolate experiments from customer traffic as much as possible, targeting off-peak hours or using techniques like A/B testing or canary releases. Start with a small blast radius, such as a single host, and increase gradually the scope as you gain confidence. Introduce randomness in timing, experiment types, and parameters to uncover a wide variety of issues. Ensure teams have proper training, tools for monitoring and logging, and are prepared to intervene if needed. Prioritise fixing high-risk issues discovered before moving to the next chaos experiment, and make chaos experiments a continuous practice.
FIS is a fully managed service for running fault injection experiments. For a detailed explanation of core concepts related to chaos engineering and how to use FIS for conducting chaos experiments, see Verify the resilience of your workloads using chaos engineering.
FIS delivers a pre-built collection of typical failure scenarios and can help speed up this stage. These scenarios address common resilience testing needs of customers, such as simulating Availability Zone (AZ) impairment and cross-region connectivity issues.
Stage 4: Report on scenario testing
After executing chaos engineering experiments, generate a failure scenario test report. The report will provide a comprehensive data-driven audit of testing activities and include the following elements:
1. Successful outcomes. Highlight aspects of the system that showed resilient behaviour during testing.
2. Identified resilience gaps. Detail areas where the systems fell short or unexpected behaviours occurred.
3. Next steps and recommendations. Outline specific actions for improvement based on test results, prioritised by criticality, potential blast radius, and level of effort.
4. Metrics and Key Performance Indicators (KPIs). Include quantitative measures of system performance during tests, aligned with key business metrics.
Share these reports with relevant stakeholders, including engineering teams, operations staff business leaders, and potentially regulators or auditors. They serve as living documents, updated regularly to reflect ongoing resilience efforts and changing system dynamics.
Below is AWS guidance on the report format that can be used to quantitatively demonstrate validated system resilience.
Scenario-based failure testing – report template
The test report summarises key objectives of the scenario-based test, failure scenarios tested, and test results, including test audits. We recommend using the following key elements to include in your test report:
1. Test objective. Key resilience goal to achieve.
2. Failure scenario. Comprehensive description of failure scenario in terms of actions to be executed against people/process/tech components, business metrics with expected and observed impact. Each scenario includes multiple failures, and the report includes details of all failures introduced, technical/functional component affected by each failure, business, and technical metrics to be observed as well as impact expected and observed for each metric.
3. Test constraints. Key technical or business factors that have affected the scope/impact/execution of failure scenario.
4. Result summary. Summary of goals achieved/not achieved and next steps.
5. Test evidence. Test evidence of failure scenario testing. This could be from tools used to perform failure scenario testing such as AWS FIS and AWS governance and compliance services such as AWS CloudTrail that provide operational audit for activities in the customer’s AWS account. AWS FIS experiment reports can provide evidence of chaos testing and the system’s recovery response. The report summarises experiment actions and optionally captures application response from a CloudWatch dashboard that you specify.
Example test report
We provide an example of the preceding report structure populated for a critical payment processing system below for reference. The system is composed of APIs that perform payment processing and use AWS API Gateway, Amazon EKS, and Amazon RDS as core technical components. We deploy system components in multiple AZs on AWS to address resilience requirements. You want to validate the resilience of the system in the scenario of a single AZ being impaired.
- Test objective – For a payment system deployed in multiple AZs, validate system can continue operating with no impact on end-users or their capabilities to execute payments if 1 AZ is impaired.
- Failure scenario tested – In this example, you are using FIS to conduct failure scenario testing and using the AZ availability: Power interruption scenario from the FIS scenario library to perform this testing. This scenario can induce the expected symptoms of a complete interruption of power in an AZ.
- Test constraints – The testing only covers functional components of the system hosted on AWS and does not include introducing failures into on-prem hosted system components.
- Result summary – Scenario testing proved system performance degrades with one Availability Zone impaired. Failures in business services hosted on EKS triggered because of the AZ impairment cause upstream systems to queue payments. Blast radius of the failure was contained to components operating in the impaired AZ with customer traffic diverted within X secs to alternate AZ. There was a visible impact to end-users of the system with some in-flight payments affected because of the switchover. End-users had to retry failed payments and though most payments were processed, some payments failed to process.
- Evidence of testing – FIS-generated experiment reports can be attached to the test report to provide test evidence for the failure scenario. The report includes:
- Experiment details such as date and time of the execution, duration, and status of the test
- Details of components targeted in the experiment
- Timeline of execution for each failure action introduced into the system
- Graphs for key metrics observed during the experiment
We provide screenshots below from an example FIS experiment report for reference.
Stage 5: Remediate resilience issues
Perform post-experiment analyses to learn from failures and integrate these lessons into operational processes, such as operational readiness reviews for incident preparedness or runbooks for incident responses. A scenario-based testing approach helps confirm or challenge assumptions about a system’s resilience design. Conducting these experiments helps teams uncover hidden resilience gaps and develop targeted design improvements to enhance overall system reliability. After implementing resilience design mitigations and enhancing observability and operational capabilities, the failure scenario testing process will be iterated to validate improvements.
Conclusion
Using AWS FIS and the iterative chaos engineering process outlined in this blog post enables FEs to validate systematically system resilience for DORA’s scenario-based testing requirements. By methodically identifying, planning, executing, and reporting on failure scenarios, organizations can uncover hidden issues, assess, and improve their operational resilience posture. Making chaos engineering a regular practice integrated into the software delivery lifecycle helps organizations to identify proactively and address resilience gaps across their critical workloads. This data-driven approach to resilience testing provides the empirical evidence needed to support your DORA compliance.