AWS Partner Network (APN) Blog

Assessing Application Resilience: A getting started guide for AWS Partners

By: Anshu Kapoor, Sr. PSA – AWS
By: Diego Dalmolin, WW Resilience Principal PSA – AWS

 

Assessing the resilience posture is an important step to build reliable and highly available applications in the cloud. A resilient application is one that is capable to respond, withstand, and recover from failures. By evaluating the resilience posture of a business critical application; the right resilience strategies, services, and mechanisms are implemented to align with organization’s broader business resilience objective and continuity planning. In this blog, we explore four AWS resilience assessment mechanisms, and provide guidance on how to choose the appropriate approach based on your customer’s needs and maturity level.

Before we dive into the four resilience assessment mechanisms, let’s first establish a common understanding of what we mean by “cloud resilience”. Cloud resilience refers to the ability of an application to resist or recover from disruptions, including those related to infrastructure, dependent services, misconfigurations, transient network issues, and load spikes.

It’s important to note that cloud resilience is a shared responsibility between AWS and the customer. AWS has made significant investments in building and operating the world’s most resilient cloud, with a focus on global infrastructure, robust service design and deployment, and a strong operational culture. According to the resilience shared responsibility model, customers are responsible for designing, deploying, and operating their applications in a resilient manner, leveraging the resilience capabilities provided by AWS. This shared responsibility model is a key consideration when assessing the resilience posture of a critical application and that’s where AWS Partners can play an important role in helping customers.

Assessing Resilience

Based on years of working with customers and internal teams, AWS has developed resilience lifecycle framework that captures resilience learnings and best practices. The framework outlines five key stages that are illustrated in figure 1. At each stage you can use strategies, services, and mechanisms to improve your resilience posture.

resilience lifecycle framework

Figure 1: Resilience lifecycle framework

AWS provides multiple mechanisms to assess the resilience posture of a workload: Well-Architected Framework, AWS Resilience Hub, Resilient Application Readiness Assessment (RA2), and Resilience Core Program (RCP). These mechanisms align with different stages of the resilience lifecycle framework; Resilience modeling (used by RCP) and AWS Well-Architected Framework are recommended at design and implement stage; AWS Resilience Hub is recommended as a post deployment activity of evaluate & test phase; RA2 align with design & implement phase for new applications and a post deployment activity of evaluate & test phase. Operate phase recommends you iterate on the review of resilience posture of your application.

By using these assessment mechanisms, you can quickly identify risks and opportunities to improve the resilience of an application, which builds trust with the customer and allows you to recommend the appropriate resilience strategies and services to mitigate those risks. Resilience assessment is a key step in ensuring that the application is designed, deployed, and operated in a way that protects the business from impact of unexpected events.

We will now dive into each of the four mechanisms.

Well-Architected Framework

The AWS Well-Architected Framework is a foundational mechanism for assessing the resilience of applications. This framework provides a structured way to evaluate the design and implementation of applications across six key pillars (Reliability, Performance Efficiency, Security, Cost Optimization, Operational Excellence, and Sustainability).

We recommend focusing on all 6 pillars when building resilient applications, but when assessing resilience in applications we recommend increased scrutiny while reviewing the Reliability and Operational Excellence pillars. The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently. This includes understanding availability needs, fault-tolerance and Disaster Recovery (DR) planning. The Operational Excellence pillar looks at processes and procedures for maintaining application health, refine operations procedures frequently, anticipate failure, and learn from all operational failures.

Using the Well-Architected Framework to review applications identifies gaps in resilience practices and get guidance on how to improve them. We recommend it as a starting point for any customer looking to build resilient systems on AWS.

AWS Well Architected Framework

Figure 2: Pillars of AWS Well-Architected Framework

AWS Resilience Hub

AWS Resilience Hub is a service that assess, manage, and improve the resilience of applications running on AWS. It allows customers to define resilience goals, assess applications against those goals, and get actionable recommendations to improve resilience, by providing concrete steps to improve the application architecture, observability and operational procedures. Key features include the ability to describe applications, define resilience policies (measured in RPO – Recovery Point Objectives; and RTO – Recovery Time Objectives), assess resilience, get recommendations for improvement, and track the application resilience improvement using a scoring mechanism over time (Figure 3). AWS Resilience Hub provides a centralized place to continuously strengthen the resilience posture of applications. To learn more about how AWS Resilience Hub can be used to improve resilience posture, please check this Introduction to AWS Resilience Hub training.

AWS Resilience Hub

Figure 3: AWS Resilience Hub service console

Resilient Application Readiness Assessment (RA2)

The Resilient Application Readiness Assessment (RA2) is a deep dive resilience assessment built by AWS Professional Services. RA2 is an objective assessment with a consistent delivery model. RA2 enables to adhere to AWS best practices for high-availability and disaster recovery of customers’ applications in AWS Cloud.

RA2 allows AWS Partners to engage customers with a comprehensive technical and operational assessment of the resilience posture of their critical applications across 8 resilience principles (Figure 4): Disaster Recovery, Observability, Change Management, Redundancy, Durability, Operations, Testing, and Scalability. RA2 assessment consists of 85 questions across these areas, each with a maturity guide and rating to enable guided self-identification of current approaches for the target application. RA2 is available to all AWS Partners via AWS Assessment Tool (A2T). To learn more about RA2, refer to the sales training, delivery training and demo or contact your AWS Partner alliance team.

Resilience principles of RA2

Figure 4: RA2 resilience principles

RA2 helps partners to build their resilience practice in their journey towards an AWS Resilience Competency. Existing AWS Resilience Competency Partners use RA2 to strengthen their resilience practice. RA2 provides insights into risks, opportunities, and guidance on best practices for improving a customer’s application resiliency. The value of RA2 is the detailed recommendations along with guidance to improve resilience posture of applications. The follow-on opportunities generated from RA2 outweigh partner’s investment in conducting these assessments and offer a profitable business model for AWS Partners.

Resilience Core Program (RCP)

RCP takes the resilience best practices that Amazon uses to build systems, distills them down to the core resilience principles that generally apply to all customers, and uses a structured walk-though of a customer journey to identify critical components, dependencies, and potential failure modes of systems.

RCP uses a systems-based resilience modelling approach. This approach starts with business goals, drives down through the business processes, and into the technology layer. It reviews supporting functions, such as monitoring/observability and incident response, to identify potential areas of concern.

RCP produces resilience recommendations as tangible outcomes that partners can leverage as follow-on opportunities to engage their customers to address resilience gaps. To learn more about RCP refer to this Partner Led RCP – Guide or contact with your AWS Partner alliance team.

System based resilience modeling approach for RCP

Figure 5: RCP system-based resilience modeling

What is the right mechanism for my customer’s application?

The availability of multiple assessment mechanisms raises the question about which is the right mechanisms for your customers. The right choice of mechanism depends on customer’s journey, their maturity, and the criticality of the workload.
Here are some key highlights to help make the decision and pick the most applicable mechanism for your customer:

Well Architected Framework AWS Resilience Hub
  • A suitable starting point for resilience
  • Holistic view across all pillar, ex: Reliability vs Cost Optimization
  • Incentives to engage a Well-Architected partner
  • Investigate technical resiliency risks
  • Assess Resilience regularly
  • Build into CI/CD pipeline
  • Easily incorporate new recommendations and best practices
Resilient Application Readiness Assessment Resilience Core Program
  • Consultant-led offering
  • Deep-dive into Application Resiliency
  • Questions based discussion
  • Includes recommendations
  • Automated report generation
  • Consult-led offering
  • Works backwards from business outcomes to prevent loss
  • Reviews the resilience controls of a critical application/business need/business function
  • Analysis framework uses a systems-based resilience threat modeling approach
  • Produces a resilience-focused roadmap for customer critical applications

Summary

Evaluating the resilience posture of applications is an essential step in the process of developing reliable and highly available cloud-based systems. For customers, resilience assessment helps to identify risks, and implement the appropriate resilience strategies, services, and mechanisms. For AWS Partners, these resilience assessment mechanisms help to build and mature their resilience practice, earn trust with customers, and find new opportunities to implement resilience recommendations.

It is important to select the appropriate assessment mechanism to match your customer’s needs and maturity level. Start with the Well-Architected Framework and AWS Resilience Hub, and then consider the more in-depth options based on the complexity and criticality of the application.