Resilience, Part One: Preparing for Unknown Unknowns
The important lesson to learn from COVID-19 is not that we need to be prepared for pandemics: it’s that we need to be prepared for unexpected, high-impact events, whatever they may be. Any particular one of these events is too unlikely to plan for, but taken collectively, there’s a reasonable chance that one of them will occur. Because these events are unexpected and individually improbable, and because their details are unknown, it’s hard to justify investments to prepare for them. And when budgets are tight—which they generally are in a competitive economy—preparations for unlikely events are often the investments that don’t happen.
Essentially, we have to find some way to plan for unknown unknowns—situations that not only probably won’t happen, but for which we inherently don’t even know what it is that might not happen! It seems impossible—but the good news is that it is not. The tools available to us today in the digital world allow us to radically reduce the risk of unknown unknowns and prepare us to respond effectively.
Interestingly, this need to cope with unexpected crises resembles our everyday challenges in the digital economy, where we already know we’ll need to respond to unanticipated competitor moves, disruptive newly funded startups, quickly changing consumer behavior, game-changing new technologies, and unpredictable usage surges on the internet. Organizations, and especially IT departments, have already been acquiring the agility to better cope with uncertainty. The question today is how we can apply that same thinking to high-impact crises like the present one.
Traditionally, we’ve thought of disaster recovery and business continuity as something very different from day-to-day agility and resilience. To prepare for disasters each enterprise made two key decisions: how quickly they would need their IT systems back in the event of a disaster (recovery time objective, or RTO) and how many minutes of data they were prepared to lose or have to recreate (recovery point objective, or RPO). Based on that, they would select an IT architecture (the more expensive promise shorter RTOs and RPOs) and build a business continuity plan around it. They’d then test the disaster response and business continuity plans occasionally (probably too occasionally).
It’s still important to think about RPOs and RTOs. But in the digital age, your ability to respond to unknown unknowns can be significantly bolstered through your everyday practices. The key to dealing with unknown unknowns is to build resilience and agility (and security, by the way) into everything you do. I like to think of it as an aspect of quality—a deployed IT capability must work right, and must also be designed for resilience and ease of change. The same applies to business processes, which should be designed for continuity in the face of changed circumstances that arise unexpectedly. Both technical and business architectures should be agile and resilient as a matter of course. And tools like the cloud and practices like DevOps, infrastructure as code, and chaos engineering can help you get there.
An important principle we’ve learned from contemporary IT approaches like DevOps is that when we develop a new IT capability, it’s important to begin using it (and thereby testing it in real circumstances) immediately and frequently thereafter. The occasional testing of DR plans is a risk; it’s much better if we can exercise our resilience and agility constantly. That’s an important strategy for unknown unknowns as well: erasing the difference between crisis and normality. Our need for everyday, ordinary agility converges with our need to respond to extraordinary events.
Here are some strategies you can use to be prepared for unknown unknowns.
Mobilize the Workforce
The COVID-19 crisis has shown us that this is both necessary and possible with today’s tools. I put it first on my list because it is the basis for all agility and resilience in a crisis: if employees can’t work, they can’t respond to the new situation. Not all work can be done remotely, but as a general rule white collar work and a great deal of service delivery can be. Amazon Web Services provides tools that can help here, such as Amazon Connect, our call center in the cloud, which can direct calls to remote workers and scale up and down as necessary; and Amazon Workspaces, which can be used to quickly and securely give employees access to applications they usually use from inside an office. As I noted above, though, mobilization is not something you do just in a disaster; it’s a way of working that naturally prepares you for the unexpected.
Manage Cash Agilely
In a crisis, spending might need to be reduced, or redeployed, or even increased if your business can take advantage of new opportunities. Inflexible costs become an economic liability as they reduce your flexibility. The cloud, of course, is one pillar of flexible spending: it gives you the elasticity to decrease your spending when you need to, redeploy your spending, or increase it in pace with demand. It also gives you easy and immediate access to other technical capabilities if you need them, including machine learning, IoT, and analytics. To prepare for unknown unknowns, you should go well beyond this: building flexibility into your contracts, your supply chain, and your project oversight. Again, agile cash management is useful not just in the emergency, but every day.
Design IT Systems for Resilience and Agility
This is no more than today’s best practices. Infrastructure as code lets you redeploy a damaged architecture quickly, repeatably, and securely through automation. Clustered microservices make it possible to scale up and down, replace damaged nodes, and deploy new capabilities safely and quickly. With the cloud, your architectures can span multiple availability zones (AZs), where each AZ is designed to fail separately—that is, the AZs are based on different power supplies and internet connections and are geographically separated. Asynchronous communication through queueing services can increase resilience, as of course can backups and data replication. Again, these are not steps to take specifically for a disaster, but everyday practices that put you in a much better position to handle the unexpected.
The best way to test your resilience and agility is to use them every day. If your technical architecture scales up and down automatically, then you should be able to watch it do so every day in response to changes in usage. If your application balances its load across multiple availability zones, you can watch it do so every day. You can see that your workforce is able to accomplish their work remotely, that self-healing parts of your architecture self-heal. You test your agility by freely making and deploying the changes that are a normal part of your business and ensuring that they are fast and successful.
Exercise (Test) Your Resilience in Focused Exercises
There is still a place for periodic, focused resilience exercises. At AWS we use game days as a way to try to break systems and simulate problems. The Department of Homeland Security (DHS) leads the federal government every year in Eagle Horizon, an exercise to test the government’s resilience to different disaster scenarios. Because IT systems have become so complex and interdependent, a discipline called Chaos Engineering has arisen, a systematic process for injecting faults into complex IT systems to watch and learn how they behave. Periodic exercises can be used to test non-technical aspects of business continuity: leaders can practice their crisis leadership skills, for example.
Make Data Agilely Available
Making decisions in a crisis often requires access to data. But who knows which data will be important in an unknown unknown situation? As leaders face tough decisions, they discover their data needs; it is the quick access to the data that makes decisive, appropriate action possible. Data silos work against you in a crisis. In keeping with our theme here, they also make your day-to-day operations more difficult.
Drive Planning Through Scenarios
The techniques I’ve described help build resilience and agility into everyday activities. They can’t account for all aspects of disaster response, such as business-specific needs and the good judgment of leaders. A powerful way to extend your everyday resilience is through scenario planning, a strategic planning technique developed by Shell Oil to imagine and prepare for high-impact geopolitical scenarios. You can’t directly plan for the infinite number of unknown unknowns, but what you can do is imagine a small set of plausible scenarios and mentally test your resilience against them. Scenarios are concretely fleshed-out futures along with plausible stories about how we they might develop. Scenarios broaden our consideration of possibilities beyond those we can forecast. Today a company might want to consider scenarios based on pandemics, climate change, and trade wars.
Agility and Resilience as Values
In many ways, agility and resilience are the same, or at least closely related. To be resilient in the face of an unknown unknown crisis, an organization needs agility. Some of that agility comes from architecting everyday processes and technical systems for resilience, and some of it comes from rigorously testing and exercising those capabilities against scenarios and scientifically generated chaos. The excellent news is that we have so many tools and techniques available to use in preparation for a response to crisis—even if the crisis is an unknown unknown.
 As always, it’s important to qualify this to data whose availability is consistent with privacy and security limitations.
 There is a lot more to say about scenarios. You might want to consult The Art of the Long View by Peter Schwartz for a sense of how scenarios can be used for strategic planning.