AWS Storage Blog
Resilience by design: Building an effective ransomware recovery strategy
Ransomware events have become a board room priority for modern organizations. The data shows a clear trend: ransomware events have more than doubled since the pandemic began, with the financial services sector experiencing particularly high targeting rates. At AWS, our cross-field collaboration with global financial services customers, regulators, governing bodies and industry partners has resulted in an accredited architecture for a Cloud Hosted Data Vault (CHDV), or vault for short.
A vault is a key consideration in enhancing operational resilience in the face of a large-scale cyber event. The vault serves as an isolated, last line of defense for your organization’s most mission-critical assets. It allows your business to rebuild if traditional high availability (HA), business continuity (BC), disaster recovery (DR), and backup mechanisms fail to provide a recovery route. Incorporating a vault into your existing operational resilience practices requires careful planning across multiple fronts. In this blog post, we look at those planning considerations, and how you can start adding more layers of protection to your already established HA, BC/DR, and backup solutions. We look at some of the key considerations for not just deploying technology, but also considering the people, processes, and practical elements that make a vault work as a solution.
1. What needs to be vaulted?
Every line of business owner will tell you it’s everything, everywhere. It’s not everything, or at least not immediately. When thinking about a cyber event at scale, it is tempting to label all data, applications, and services as vital and necessary to be vaulted. You may assume that it wouldn’t be in the production environment if it wasn’t needed. However, attempting to vault everything without some level of initial scrutiny can result in huge volumes of data, cost, and unintended recovery delays.
In a true cyber event, you need to ask yourself what are the core IT functions and operational services that you need to restart the business in the next 12, 24, 48 hours, and just as crucially, what isn’t needed? Often referred to as the Minimum Viable Business (MVB), these core functions and services are those offered not only to your customers, but also to second, third, and fourth parties.
Figure 1: What needs to be vaulted? – Focus on core IT functions and operational services.
This question isn’t answered overnight. A vault is more than a technology solution—it involves input from many areas. Determining your MVB and how best to protect it depends on the input, experience, and knowledge from multiple teams: IT, security, legal, lines of business owners, application owners and senior executives.
2. What does recovery look like?
Cyber events are designed to make it a difficult, if not impossible, to bring your systems or services online through the standard operational recovery mechanisms (HA, BC, DR, and Backup). Maximum disruption with minimal chance of recovery is what the bad actors need to assure payment. A cyber event at scale cannot solely rely on your normal operational recovery methods, and as such there are some key considerations that you need to have thought about ahead of time:
- Decision Time Objective (DTO): How long does it take to invoke recovery from the vault? Highly automated recovery processes in secure, trusted environments may have been breached, thus causing trust uncertainty or rendered completely unusable due to destructive actions. It is crucial to align people and key decision points on when to invoke full or partial recovery from the vault.
- Cyber-Recovery Time Objective (C-RTO) and Cyber-Recovery Point Objective (C-RPO): Shift recovery time expectations to consider the increased impact and recovery time of a cyber event. What used to take seconds could now take days, and the business needs to know how to respond to this.
- Minimum Acceptable Service Offering (MASO): While the MVB focuses on your business, a MASO asks what is an acceptable level of service to offer your customers and third parties? It may be limited functionality of the primary service, or a backup system accessed through alternate means.
Figure 2: What does recovery look like? – What normally happens in seconds could take days, plan ahead.
3. How will vaults be partitioned?
Whether it’s everything in one vault or a vault per service, the key is striking a balance between manageability, practicality, and security.
- Simplicity: Keep the vaults as clear as possible to understand, navigate, and restore from. Complexity adds delay.
- Independence: Each vault stands alone to minimize waiting times and increase parallel operations.
- Service mapping: Where is the recovery starting point and what is the optimum recovery order?
- Roles and responsibilities: Owners and administrators are aligned to service, application, and infrastructure.
- Maintenance: Vaults aren’t static, because applications are updated and business use cases change.
Figure 3: How will vaults be partitioned? – Balancing manageability, content, and the recovery process.
The elasticity that AWS offers is invaluable in enabling you to test, fail, and iterate on what works best without leaving any technical debt in the process. What works well on a whiteboard may fail at the first tabletop exercise, and the lessons learned can be rolled into your next vault iteration. Having the flexibility to adapt to the business helps drive confidence in the solution and recovery process. What goes into the vault isn’t just a question of “the data”—it necessitates input from application owners and lines of business.
4. How do you manage the vault?
Your vaults need to be kept out of the operational plane, so that any cyber event that moves laterally through your production environment is disrupted at the vault entrance. Without this degree of separation, the vault may also be compromised during a cyber event. Traditional air-gapped solutions include a physical separation by unplugging or removing media—creating a literal air gap between the protection mechanism and the operational environment. How can we translate this isolation into the cloud and the vault? Replicating these security measures in the cloud requires:
- Ingress zones: Ephemeral areas, services, and functions that are only available during times of access.
- Multi-factor authentication: Physical tokens accessible only by authorized personnel.
- Zero trust: Authentication at every step.
- AWS Identity and Access Management (IAM): Restricted roles and responsibilities.
- Change management: Process approved access to tokens and accounts.
Figure 4: How do you manage the vault? – Intentional isolation while maintaining visibility.
5. How do you plan for logistics and service providers?
Physical solutions are often overlooked when considering how you respond to a cyber event. Focusing on your core services, applications, and the underpinning infrastructure means that peripheral physical dependencies aren’t a focal point when it comes to thinking about how to restore them. For example:
- Internal and external network access: It doesn’t take a disaster to make the news; a loss of connectivity that causes service disruption is enough.
- Software repositories: Automated software distribution means physical media such as USB sticks sitting in a drawer that can be retrieved without any digital dependencies.
- Supply chain: When physical equipment is damaged beyond function, what are the plans and processes to get hardware where it needs to be, and at volume?
- Physical logistics: If events at scale need a response at scale, then how are people, equipment, and floor space going to be found?
Figure 5: Logistics and service providers – broaden planning to include external dependencies.
The sophistication of a cyber event may be to force multiple failures that, when individually factored into normal business operations, are expected and minimal in terms of impact. When these failures happen at scale, they can cause a compound effect adding complexity and delay.
6. Who owns the vaulting process?
People will own the cyber vaulting process and administration, provide the best practices, and be on the front line if a recovery needs to take place. Selecting an ownership team is critical and not always straightforward. The vault tenets are driven by the fact that it covers many aspects of protection, planning, and process, and those need input from multiple sources, not just a single team.
Business stakeholders, not backup teams, should drive recovery prioritization based on their understanding of dependencies, compliance requirements, and business impact. Clear communication between technical and business teams is essential for effective cyber recovery planning.
Figure 6: People and processes – balancing these two elements drives good practice.
The result of this cross-functional collaboration is recovery plans that your business must adopt and adhere to. However, this can unwittingly create poor cyber event recovery scenarios due to stifling approaches that can lead to bad practice. Overly burdensome processes, a lack of agility baked in, and the tendency for people to take the route of least resistance, can lead to the best of plans not being implemented. Although this doesn’t affect immediate security and restore procedures, the wrong time to find out that vault tenets have not been applied is when you need them.To make sure of effective cyber event recovery, your organization must seamlessly integrate best practices into daily operations. This integration necessitates striking a delicate balance between robust recovery requirements and practical, sustainable processes that teams can consistently follow without compromising security or efficiency.
7. Why do you need sponsorship?
There’s no doubt that bringing in any extra work to an organization adds extra cost and effort, which diverts time and resources from other business development activities. There isn’t a Chief Financial Officer (CFO) that isn’t going to question why a technology solution is costing time, money, and effort and not generating any revenue. The vault isn’t going to be an active part of your core business operations. Therefore, it is crucial that its intrinsic value and use case is understood. This can only be achieved from “the top, down” with executive level leadership.
Figure 7: Sponsorship – a top-down approach guides the overall business.
Getting started with AWS
AWS offers multiple layers of defence against ransomware events. For immediate data protection across multiple AWS data services, AWS Backup provides centralized backup management with immutable backups and logical air-gapping that not only prevent unauthorized modification but also offer rapid recovery options. Amazon S3 with versioning and Object Lock and Amazon FSx for NetApp ONTAP create tamper-proof storage for critical data.
For detection, Amazon GuardDuty monitors for suspicious activity, while AWS Security Hub provides a comprehensive security posture view. Amazon Macie identifies sensitive data that might be targeted, while AWS Shield and AWS WAF protect against DDoS events and web exploits that could serve as entry points. AWS Network Firewall filters malicious traffic at the network level. Partners such as elastio integrate with AWS Backup to allow organizations to verify data integrity in near real-time, thus enabling clean recovery from any event with minimal downtime and data loss.
For identity protection, IAM implements least-privilege access, while AWS Organizations enables centralized security policy management across accounts. AWS Config and AWS CloudTrail provide visibility into configuration changes and API activity, which is essential for forensic analysis after incidents.
Conclusion
Cyber events, namely ransomware, are increasing yearly. The shift from protecting against accidental incident to targeted disruption, means you and your businesses need to pivot quickly to reduce the risk window in which a severe but plausible event could not only cripple services, but also permanently close the business. A technology solution in response to increasing cyber events is part of the equation, but it is not everything. Planning how you respond to a targeted disruption is part of removing the ingrained assumptions that we all grow accustomed to in daily operations. A cyber event is not a daily operation and as such needs the appropriate preparation and planning so that critical decisions aren’t being developed on the fly. Effective cyber resilience necessitates a comprehensive, organization-wide approach. Begin by defining what successful recovery looks like for your specific business operations, then develop and prepare for worst-case scenarios through detailed planning. Regularly iterate on these plans, conducting thorough testing and validation of recovery processes and procedures to make sure that they work when needed.Most critically, cyber resilience extends far beyond your IT department—it’s fundamentally a business challenge that needs enterprise-wide engagement. Every team, from operations and finance to leadership and frontline staff, plays a vital role in both building resilience and executing recovery when incidents occur. Success depends on making cyber resilience a shared responsibility across the entire organization, from the “CEO to the sys admin.”