AWS Cloud Enterprise Strategy Blog

A Culture of Resilience

There’s something problematic about planning for resilience.

On the one hand, we know it’s necessary. If the COVID pandemic taught us nothing else, we learned that unexpected major disruptions can and will occur. And those disruptions will involve not just technology but, in some cases, business processes, personnel, communication, and even the fundamental market characteristics of supply and demand. Then there are the smaller (or seemingly smaller) disruptions: boats blocking canal shipping lanes, ransomware-related outages, and factory fires.

The next disruptive event may be a very different sort of pandemic. Or it might not be a pandemic at all. It might be the consequence of global warming, a war, a trade war, or a zombie apocalypse. Resilience isn’t about planning for one of these. It is about building the resilience to handle any unknown disruption—or at least an exceptionally broad range of possibilities. Given the pace at which disruptions, large and small, occur today, we know we must design resilience into everything we do. We know this, and we have the tools today to come closer to what is admittedly an impossible goal.

But how do you justify investing in resilience? I recently wrote a blog post about priorities, in which I tried to show that ranking investment priorities and choosing only the top subset of them for execution is problematic. There is always too long a list of functional priorities—the things a company needs to do to continue operating or to grow and improve in ways that increase the bottom line. It is a given in governance processes that there will always be more demand than supply for IT work. As the thinking goes, these must be prioritized based on expected return.

Now let’s be honest: there is no “return” from an investment in resilience (or agility). There is a possible return. I know. There’s supposed to be an expected return because you multiply the probability of the disruption by the business value the resilience maintains if that disruption occurs—the increment to the bottom line if a disaster occurs. But I’m talking about designing for the truly unexpected disruption, the one you can’t reasonably assign a probability to. I’m talking about building in resilience as a general principle or long-term strategy. So how do you know where it fits in the ranking of priorities?

I think there is actually a much deeper problem when planning for resilience. Today’s enterprises are data-driven, focused on quantifiable targets and measurement against those targets. Employees are incentivized or at least motivated to deliver on specific, concrete goals. Almost always those goals are aligned to profitability growth. What company sets their employee’s targets for increased resilience? And we know the targets will drive employee behavior; that’s why we set targets in the first place.

Another problem is that adding resilience into legacy systems seems like an admission that things were done wrong in the past. Why weren’t the systems built to be resilient in the first place? Who screwed up? Why are we suddenly forced to spend more money on things we’ve already completed?

Resilience’s benefits unroll over the longish term, but companies are strongly incentivized to deliver short-term results. In the short term, the effect of investments in resilience will probably be negative as dollars go out, yet economic fundamentals don’t change. Investments in resilience are defensive.

It’s true that managers and employees may be taking personal risks by not ensuring resilience. But people generally don’t manage or assess risk well. And chances are that the risk is not really personal. When things go haywire, it will be blamed on unavoidable and unexpected circumstances, a rare combination of factors that couldn’t be predicted.

In short, the only thing standing in the way of resilience is everything we know about corporate governance and human motivation.

I’m exaggerating. There are ways to make investments in resilience. We just have to think a bit differently about the prioritization process.

I like to think of resilience as an aspect of quality. In the early days of IT, our standards and needs for resilience were lower, so it was acceptable to deliver IT capabilities that were not truly resilient. Today it isn’t. Just as it has become unacceptable to deliver IT capabilities with security vulnerabilities or insufficient scaling capabilities, delivering systems that can’t survive small or large disruptions is no longer acceptable. In this case the cost of resilience is simply built into the delivery of new IT capabilities; happily it will also be lower because building in resilience is less expensive than bolting it on later. Our quality bar has simply gone up. Resilience is no longer a project that must be prioritized alongside functional priorities.

Resilience then becomes a cultural matter. A culture of resilience entails everyone making it part of their job, considering anything nonresilient as poor quality, and collectively addressing potential failure conditions and working to mitigate them. It is not a goal as much as an uncompromisable standard for how things are done, which is reinforced by the work team and reviewed by the manager. Resilience is the subject of ongoing risk assessments, and its absence is considered a defect.

But that doesn’t address the question of legacy systems. How can we invest in raising their resilience? In some cases it is straightforward: many legacy systems still in use are actively updated. And any new updates that are deployed need to meet this new quality bar. Logically speaking, that requires some refactoring (improvements to the legacy code). It will have a cost and sometimes slow the delivery of new features, but today’s best practices include deploying new code with zero known defects (i.e., code that passes all its tests). Resilience can simply become one aspect of that quality threshold.

A second technique for increasing resilience is lowering its cost. The cloud is your friend, as is the automation that comes with good DevOps practices. Figure out the resilience architectures and design patterns that will work best for you (e.g., multiple availability zones, clustered microservices, data replication, backups into low-cost storage, etc.) Many techniques you use to increase your agility and availability will double as resilience boosters.

While you’re at it, there may be low-hanging fruit in decommissioning systems that are no longer used or used infrequently. Doing so can reduce costs and maintenance efforts as well as increase resilience. Why not?

A third way to think about prioritizing resilience-related activities is in terms of risk. Today the organization bears risks everywhere that its resilience standards are unmet. Some of these are material risks to the business, whether must be disclosed to shareholders or not. The audit committee of the board, risk officers, and the CFO should be informed of these risks and should consider actively investing to reduce them. Governance for risk reduction is (potentially) a separate process from the governance that ranks projects by expected return.

For this to work, the IT organization must be good at framing and explaining risks so that good decisions can be made. Not every risk needs to be addressed immediately, but risks accumulate. An organization with many sources of resilience risk is, in effect, poorly positioned in the market, and some risks will need to be addressed quickly. An IT leader should make their organization aware of the risks in a balanced way, showing their potential impacts and the scenarios in which they might be activated, along with a good plan for addressing them. In my experience, aside from some references to technical debt that can sound like IT whining about its inadequate budget, this conversation does not happen often enough.

The risk discussion is an opportunity to go beyond technical dangers. One risk that many established enterprises bear today is their dependence on one or two experts in an old technology or system and the impossibility of finding new employees who can take over for them. That seemingly small risk can have tremendous business implications in certain scenarios. Another overlooked risk category is associated with layers of business management becoming unavailable. Yes, there might be a continuity plan where others step into those management roles. But do the substitute managers have access to the data they will need and the ability to commit funds when necessary?

A talented CIO working in partnership with a talented CFO can add substantial long-term value for their organizations by identifying these risks to resilience, deciding which to address, and finding ways to commit resources to the effort.

Now for the good news—investments in increasing resilience will, as a side effect, often increase agility and nimbleness as well.