AWS Architecture Blog

IT Resilience Within AWS Cloud, Part I: Mindset and Culture

As customers migrate to the cloud, many struggle to adapt business continuity and operational plans from their on-premises environments. This affects the resilience of critical business applications and can stall cloud adoption. This two-part blog series will provide guidance on implementing IT resilience strategies. In Part I, we’ll review challenges commonly experienced by executive builders. We’ll also explore the definition of resilience in the cloud and key considerations for adapting mindsets and organizational culture. In Part II, we’ll explore the technical considerations related to architecture and patterns.

Customer challenges

In a discussion about a data center exit strategy, a senior IT executive at a global financial services organization said, “I feel we’re more mature today to consider a business case for a data center exit […] to AWS. But my main concern is how I can ensure resilience in this hybrid environment while still meeting our regulatory compliance requirements.” After asking follow up questions, it became clear their concern was about how to implement a two-site data center disaster recovery (DR) in the cloud.

In the past 12 months, we’ve increasingly observed that business leaders identify, directly or indirectly, resilience as a primary area of concern. Often, this concern comes up in discussions about business continuity due to the COVID-19 pandemic. It’s also sometimes mixed with concerns about highly publicized security or outage events. Other times, it’s expressed during discussions around cloud technology due diligence. For the executive builders in these organizations, it has taken years to get to an IT “compliance equilibrium” with regulations. Transforming digital transformation efforts while simultaneously migrating legacy workloads to the cloud is operationally challenging. This challenge can disrupt the compliance equilibrium and stall cloud adoption.

Considerations to get started

Cloud adoption is often central to any large-scale digital transformation in a customer’s business. But transformations are disruptive, and disruption is unnerving. Building a resilience strategy and resilient infrastructure can give your organization peace of mind. But resilient systems need resilient organizations; the two go hand-in-hand. The following considerations will help executive builders get started.

1. Understanding resilience in the cloud

How should the executive builder think about resilience in the cloud? Resilience is a measure of how an infrastructure, workload, or platform can protect itself against disruption caused by adverse events and conditions. Like other architecture attributes, resilience is measured on a scale (that is, a degree to which a system is resilient). It is not measured as a binary feature (in other words, resilient vs. non-resilient).

Furthermore, resilience is an overarching attribute that is tied to other architecture attributes such as availability, security, and performance. Due to its business affinity (that is, business continuity), non-technology leaders often use the term “resilience” more broadly to mean any number of related architecture attributes. But for executive builders, we recommend centering your resilience strategies around availability, performance, and disaster recovery.

2. Practice and automate resilience strategies

How do executive builders ensure investment in resilience will pay off? Our answer: continuously subject your systems to conditions that build an organizational “muscle” that will support these systems during normal and abnormal times.

In the following subsections, we share effective practices we have observed from long-time cloud adopters to help you go beyond designing for resilience and employ an increasing degree of complexity and automation.

Architecture reviews 

The AWS Well-Architected Framework guides leaders through building and maintaining resilient infrastructures, applications, and data. At a minimum, we suggest incorporating AWS Well-Architected Reviews frequently in your lifecycle management and using the AWS Well-Architected Tool to sustain and improve resilience over time. We also suggest using the various AWS Well-Architected Lenses to consider critical workloads and technology domains such as analytics stacks or high performance computing (HPC) clusters. This practice will push resilience questions to the top of each discussion, such as “what happens when this fails?” where ”this” is any critical component of your environment.

Table-top incident simulations 

Just like routine fire drills, executive builders must periodically test their operational plan to respond to an incident.

Your operational recovery scenarios related to workloads, infrastructure, or data should be tested relative to the organization development lifecycle pace. We suggest starting with quarterly tests and working towards only testing during major lifecycle milestones.

For full disaster recovery scenarios, we suggest starting with an annual review because it may be required by compliance regulations. From there, we suggest performing a quarterly review to identify ways to strengthen your resilience.

Chaos engineering

Over time, people who have adopted cloud architecture can invest in automating many of the anticipated events and incidents that would challenge their system’s resilience. Principles of chaos engineering can be adopted to build these capabilities within your environment. For example, AWS Fault Injection Simulator can be deployed to make it easier for teams to discover weaknesses in their environments at scale. This practice will help your team adopt an “everything fails eventually” mindset, which will help them prioritize resilient design patterns.

3. Think big. Start small

Transitioning traditional IT infrastructure models to the cloud and then building in resilient processes is difficult, especially if you aim to do it all at once. However, it can be done. We have seen the most success when executive builders start with a manageable scope, iterate, then scale, as follows:

  • First, classify your technology assets according to business criticality. A technology asset may be a single application or a vital system like a customer relationship management solution that applications depend on. We see many customers use terms like “Tier 0,” “Red,” or “Mission Critical” to describe their critical assets.
  • Next, implement a resilience plan for a single critical asset or a small set of related assets.
    • You’ll need a cross-functional team to agree on the availability and performance requirements and help translate these requirements into a work backlog.
    • The team should analyze the technology asset for weaknesses using the principles of chaos engineering.
    • Business stakeholders should help capture business metrics like a reduction in unprocessed orders or an improvement in customer satisfaction.

The team that implements resilience in your first asset (or set of related assets) will form the core of a new resilience center. This team will likely be eager to share their knowledge and best practices across the organization. We suggest giving them a platform, such as a quarterly resilience review, to celebrate their success and encourage other teams to follow their example.

Conclusion

Executive builders are responsible for assuring business leaders that their IT assets are resilient and also leading their teams to achieve resilience. In this blog, we provided guidance to help these leaders align with how business stakeholders express resilience concerns, and to lead builders to approach resilience design differently in the cloud. In a follow-up article, we’ll dive into more technical resilience considerations.

Three considerations for building resilience strategy and practices

Figure 1. Three considerations for building resilience strategy and practices

Randy DeFauw

Randy DeFauw

I’m an electrical engineer by training who’s been working in technology for 23 years at companies ranging from start-ups to large defense firms. A fascination with distributed consensus systems led me into the Big Data space, where I discovered a passion for analytics and machine learning. I started using AWS in my Hadoop days, where I saw how easy it was to set up large complex infrastructure, and then realized that the cloud solved some of the challenges I saw with Hadoop. I picked up an MBA so I could learn how business leaders think and talk, and found that the ‘soft skill’ classes were some of the most interesting ones I took. Lately I’ve been dabbling with reinforcement learning as a way to tackle optimization problems, and re-reading Martin Kleppmann’s book on data intensive design.

Amine Chigani

Amine Chigani

Amine Chigani is an Enterprise Technologist/Strategist at Amazon Web Services (AWS). In this role, Amine works with enterprise customers to share experiences and strategies on cloud adoption, agile organizations, and innovation through AI/ML. He brings the AWS cloud platform and programs to help his customers drive product quality, reduce technology risk, and deliver digital transformation value. Prior to AWS, Amine held senior technology leadership roles at Sentient Science and General Electric. Amine has a Ph.D. in Computer Science from Virginia Tech.

Nigel Harris

Nigel Harris

Nigel Harris is an Enterprise Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on AWS architectures.