How energy and utility companies can recover from ransomware and other disasters using infrastructure as code on AWS

According to the US Cybersecurity and Infrastructure Security Agency, “There are 16 critical infrastructure sectors whose assets, systems, and networks, whether physical or virtual, are considered so vital to the United States that their incapacitation or destruction would have a debilitating effect on security, national economic security, national public health or safety, or any combination thereof.” The energy and utility sector is one of these 16 critical sectors in the United States and in most nations. Operational technology (OT) and cybersystems play an important role in upholding the reliability of critical infrastructure, be it oil and gas pipelines, city water supplies, or electricity generation, transmission, and distribution environments. These systems present unique security and availability challenges, including data center equipment failures, natural disasters, malicious software, attacks from nation-states, and “insider” attacks.

In the past year, we have witnessed several weather events that have disrupted critical infrastructure operations, as well as malware and ransomware attacks that have done the same. In October 2021, a White House fact sheet stated, “The global economic losses from ransomware are significant. Ransomware payments reached over $400 million globally in 2020, and topped $81 million in the first quarter of 2021.” Although preventing such attacks is of extreme importance, preparing to recover is equally critical. This blog will focus on the use of infrastructure as code (IaC) on the Amazon Web Services (AWS) Cloud to recover from a disaster, describes benefits like resilience, availability, and flexible networking, and provides the steps that you can take to prepare for recovery.

A traditional OT setup for disaster recovery

Typically, utilities and many industries have their OT systems deployed in data centers or localized control facilities that are owned and operated on premises. Industrial control systems (ICSs) in these facilities tend to use bare metal or virtual servers. Generally, a secondary or disaster recovery data center exists to offer continuity of operations in the event that a disaster—such as a fire, flood, or earthquake—makes the primary data center inoperable. There are different configurations in which these dual data centers operate, often referred to by descriptions like “hot-hot,” “hot-cold,” or “pilot-light.”

Hot-hot is the most common strategy behind maintaining two data centers. In this strategy, utilities replicate machines and data from the primary data center to the secondary data center. As a result, servers in both locations are identical, and one can fail over from primary to secondary in a matter of seconds. Such a setup can work well for natural disasters where when one data center is damaged or incapacitated, all systems fail over to the other data center. However, it is not uncommon to hear of large weather events that have incapacitated both the primary and secondary data centers of an organization.

In addition to large weather events, cyberthreats are of growing concern. What if a data center is infected with malware or ransomware? In such cases, the cross–data center replication service can spread or replicate the malware from the primary to the secondary data centers, rendering them both inoperable. Once impacted, a utility’s response plan needs to switch from detection and prevention to recovery.

Benefits of using AWS for recovery

The geographic infrastructure of AWS supports resilience

Cloud technology helps customers choose resiliency options to meet the needs of their critical systems. The AWS global cloud infrastructure offers (as of this writing) 25 AWS Regions and 81 Availability Zones globally, of which there are 4 commercial US Regions and 2 Regions for AWS GovCloud (US), which gives government customers and their partners the flexibility to architect secure cloud solutions. Additionally, in the United States, AWS offers 17 AWS Wavelength Zones for AWS Wavelength, a service that embeds AWS compute and storage services within 5G networks, providing mobile edge computing infrastructure for developing, deploying, and scaling ultralow-latency applications. In the United States, AWS also offers 14 AWS Local Zones—a type of infrastructure deployment that places compute, storage, database, and other select AWS services close to large population and industry centers. AWS Local Zones offer AWS infrastructure and services in single-digit milliseconds for high-speed computing. Using AWS infrastructure can greatly benefit utilities and organizations that want to take advantage of a global footprint.

AWS also works with several telecommunications providers to offer private dedicated network connectivity and bandwidth to its cloud Regions. On AWS, utilities can build highly resilient critical systems that take advantage of the geographic spread of AWS and the high speed, reliable networking support, elasticity, automation, and on-demand nature of AWS services.

AWS automation and DevSecOps support recovery

DevOps is a combination of cultural philosophies, practices, and tools that combine software development with IT operations. These combined practices help companies deliver new application features and improved services to customers at a higher velocity. DevSecOps takes this a step further, integrating security into DevOps. With DevSecOps, you can deliver secure and compliant application changes rapidly while running operations consistently with automation.

The ability to reliably and repeatedly back up, build, and deploy your systems on newer, updated, and patched versions of operating systems and other software gives you added resilience and protection from malware attacks such as ransomware. You can easily decommission and isolate the compromised system and replace it with another with data restored from your backups.

There are many recovery options to choose from. One of the most common options uses previous images—or backups—of a machine to build new ones, but this is less reliable because the image itself might be compromised. Rebuilding your systems from scratch using automation is the most reliable recovery option.

One key difference between cloud and on-premises setups is that in the cloud on AWS, you can use automation (through scripts, templates, and code) to build your entire infrastructure. Once the automation is tested, you can use it to repeat the build process reliably across multiple geographic regions and across different AWS accounts, thereby giving your business a high level of flexibility for recovery. This ability to use code to build and maintain your infrastructure is referred to as “infrastructure as code (IaC).” AWS offers services that are geared toward automating the deployment and configuration of your infrastructure and systems in the cloud, such as the following:

AWS CloudFormation: a service that lets you model, provision, and manage AWS and third-party resources
AWS Cloud Development Kit (AWS CDK): an open-source software development framework
AWS Software Development Kits (AWS SDKs): tools that let you access and manage AWS services with your preferred development language or platform
Various DevSecOps services

AWS networking supports recovery

In the power utility industry, OT is not concentrated in one data center. In fact, it can be highly distributed between power generation locations, substations, and other locations. Utilities often rely on privately owned network cables that connect their remote locations to their data centers—an approach that offers them more control over their networking. However, this approach also limits a utility’s agility and ability to change in the event of a cyberattack. The common perception that “if the network is isolated, then it cannot be attacked” might hold true for a distributed denial of service attack but not for malware introduced intentionally or unintentionally by an internal resource or for an attack that used social engineering to gain access to the utility’s network. You will be unable to operate if your data centers are compromised and if your networking is hardwired only to serve these data centers because you will not have the ability to reroute traffic to another destination. This is why network agility is critical.

AWS offers various services that provide secure and flexible options for networking:

AWS Direct Connect lets you create a dedicated network connection to AWS. A utility can use it to establish a high-bandwidth fiber connection from its assets to AWS.
Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS) web service. A utility can use it to manage its DNS and reroute traffic in seconds.
AWS Global Accelerator is a networking service that improves the performance of your users’ traffic by up to 60 percent. A utility can use it to reroute traffic from one AWS location to another.

With these options, a utility is not limited to a single or dual internet protocol (IP) destination, which when compromised would cause a complete failure.

Recovery on AWS infrastructure

To understand how to recover on AWS, let’s start by recognizing that utilities can have several power generation plants that vary in technology (for example, coal, oil, natural gas, nuclear, wind, and solar plants) and that electricity travels across long distances over transmission lines that are controlled by hundreds of substations and then is distributed to end users over a distribution network that can have hundreds to thousands of substations. Communications and networking are a critical aspect of a utility’s OT or ICS.

The Purdue Enterprise Reference Architecture (PERA) model, dating back to the 1990s, is one way to understand the operational landscape, and we will use it to demonstrate how using AWS infrastructure can help. The PERA architecture breaks OT or ICSs into five levels.

Figure One: PERA model graphic courtesy Dragos

Level 0

This is the physical process or action that the machine(s) or device(s) performs at a fixed location. The other four levels of PERA control and protect this level.

Level 1

Level 1 is composed of intelligent electronic devices (IEDs) that control the devices at level 0. IEDs can be sensors that read information from level 0 and make rapid decisions on how level 0 should operate. Level 1 also includes applications that receive commands from level 2 systems to direct how level 0 should operate. Level 1 systems are often embedded in device hardware. These systems reside close to level 0 devices to provide near-zero latency in invoking commands or reacting to events.

Level 1 systems are commonly built by the original equipment manufacturers (OEMs) who produce the level 0 devices. On rare occasions, an electric utility will build a level 1 system. OEMs can use several AWS technologies to build and support level 1 systems that are installed on premises to monitor their health and prevent unauthorized changes to them. All of the following AWS offerings can be used to gather information on the state of a device and prevent changes by permitting communications only over an encrypted channel and by authenticating using X.509 certificates:

FreeRTOS: an open-source near-real-time operating system for microcontrollers
AWS IoT Greengrass: an open-source edge runtime and cloud service for building, deploying, and managing device software
AWS IoT services like AWS IoT Device Defender: a fully managed service that helps you secure your fleet of Internet of Things (IoT) devices

Utilities can also install sensors to monitor traffic between levels 1 and 0 and between levels 2 and 1 for anomaly detection. Amazon SageMaker—which lets you build, train, and deploy machine learning models for any use case—makes it easy to create models for anomaly detection.

Level 2

This is where the control systems exist, including the human-machine interfaces (HMIs) and the supervisory control and data acquisition (SCADA) systems that make real-time control decisions using algorithms or as a result of human instruction. These systems typically run on popular operating systems like Windows and Linux and tend to reside at the same facility as the machines that they control or at a nearby data center. However, with the advent of faster networking technology (for example, fiber, LTE, and 5G technologies) there is a movement to data centers and also the cloud.

For level 2 systems, a utility is responsible for protecting the SCADA and HMI servers and requires resilient, reliable, and flexible networking capabilities in redirecting traffic to protect and recover systems in the event that all systems or data centers are compromised. The following AWS technologies and approaches can be used to recover from compromised level 2 systems and networking:

For onsite systems, a utility—using AWS CloudFormation and automation—can deploy using the AWS Outposts Family (a family of fully managed solutions delivering AWS infrastructure and services to virtually any on-premises or edge location) or AWS Snowball (an edge computing, data migration, and edge storage device). This way, even if the servers and their associated account are compromised, a new device can be built in minutes to hours.
For systems in the cloud, recovery can be even faster because automation through AWS CloudFormation or AWS Auto Scaling (which automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost) can be used to immediately rebuild and recover a compromised server. If the entire data center—in the cloud or on premises—is compromised, recovery can occur in a new account and/or Region.
For network recovery, a utility must be able to route traffic from power generation facilities or substations to another location. It can rely on LTE or 5G providers, use AWS Direct Connect to route traffic to the cloud, and use Amazon Route 53 (if it uses DNS) or AWS Global Accelerator to route traffic to different endpoints within AWS.

Levels 3 and 4

These levels host operations (level 3) and enterprise planning (level 4). Protecting these systems is as critical as protecting the systems at other levels because these systems make decisions and plan for all operations. The loss of these systems will prevent a utility from having visibility on current operations and from making decisions.

Should level 4 systems be compromised, automation in the cloud can help rapid recovery. By using AWS CloudFormation or other IaC methodologies, a utility can rebuild all systems at levels 3 and 4 from scratch, and this removes the risk of rebuilding with a vulnerability. Rebuilding systems using automation can take minutes to hours depending on the complexity of the system(s), which is far less than the weeks to months it would take to rebuild an on-premises environment.

A resilient architecture: How to prepare for recovery

With an understanding of the different levels and operational needs (as defined by the PERA model) and the way that AWS services can help recover systems within these levels, you can build a resilient architecture to protect yourself from the impact of a natural or cyber disaster. Here’s how:

Understand your vendor licenses. Talk to your vendors to make sure that your licensing agreement includes provisions for licensing mobility in the event of a disaster. Many licensing strategies are based on media access control (MAC) addresses, other hardware features, or dongles. Such restrictive licensing methods reduce the agility of your systems and prevent you from recovering in the event of a disaster. Your licensing should be transferable to the cloud, and it should be flexible so that you can test disaster recovery and recover in a new location.
Promote agility in your networking. If you have to relocate your data center or rebuild it in the cloud or in a different cloud Region, you need to make sure that all your remote locations (like substations) can redirect traffic to the new location. You can use AWS Private 5G for creating private cellular networks and other services like AWS Wavelength, Amazon Route 53, and AWS Global Accelerator to improve network agility.
Follow AWS security and governance best practices, including the following:
- Set up these AWS services:
  - AWS Control Tower: a service that provides an easy way to set up and govern a secure multi-account AWS environment
  - AWS Organizations: a service that helps you centrally manage and govern your environment as you grow and scale your AWS resources
  - AWS Single Sign-On (AWS SSO): a service that lets you centrally manage access to multiple accounts or applications
- Implement guardrails and service control policies.
- Implement the AWS Foundational Security Best Practices in AWS Security Hub (a cloud security posture management service).
- Use multiple accounts—at least one each for your levels 2, 3, and 4 systems.
- Contact your account team to learn more.
Implement automation. Make sure that the systems in all environments can be rebuilt using automation without using old images of your systems. A rebuild should use the latest version of the host operating systems, and it should configure, harden, and install your software using scripts. Build your entire AWS infrastructure using AWS CloudFormation. Make sure you have arrangements with OEMs and software vendors to access the latest patched version of your operational software in the event of a disaster.
Set up an account to host backups with no human access.
- Routinely (such as every minute or hour, based on your recovery time objective and recovery point objective) back up all the information required to rebuild and redeploy all systems across all levels. This includes the configuration settings of each device and all operational data. Pay special attention to the backup of passwords and certificates. Use the following AWS services:
  - Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Use it to store your backups.
  - AWS Secrets Manager helps you protect the secrets that are needed to access your applications, services, and IT resources.
  - AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. Use it to store passwords, configurations, connections strings, and more across multiple Regions.
  - AWS Key Management Service (AWS KMS) makes it easy for you to create and manage cryptographic keys and control their use across a wide range of AWS services and in your applications. Use it to encrypt your backups.
- Make sure that the backups are being done to a Region different from where your production systems are. Remember, you choose and control the location for your data.
- Use S3 Object Lock, which lets you store objects using a write once, read many (WORM) model. By using the WORM model, you can read your backups but never edit them. S3 Object Lock implements a legal/compliance lock. No matter what permissions you have, you cannot delete a file with an object lock. This means that your backups are immutable.
- Use AWS Lambda—a serverless, event-driven compute service—to run a script that validates backups every few minutes.
- Use Amazon Simple Notification Service (Amazon SNS), a fully managed messaging service, to notify key personnel through SMS or email the moment that data cannot be validated.
Test disaster recovery at least once a week. On AWS, this is a cost-effective process. Using AWS CloudFormation, you can delete everything that you built after you are finished with it, so your disaster recovery environment exists only for a few hours—or maybe even minutes. This minimizes the cost while facilitating practical experience. By testing, you’ll know that you will experience reliable recovery from a disaster, and you can learn how much time recovery takes.
Protect your IaC code base. Take the AWS CloudFormation templates and the scripts that you used to deploy code and configure your servers with source control (through AWS CodeCommit, a secure, highly scalable, managed source control service), and place them in a separate account (in a different Region) dedicated to the IaC code base.
- Any change in your code should be explained, authorized, and tracked in a requirement traceability matrix.
- In Amazon S3, keep a release version (current copy) of all scripts and code that have an object lock, making your recovery scripts immutable and safe from an attacker.
Set up endpoints for the IaC recovery account and Region. Configure endpoints to receive SCADA data and communicate with other necessary systems. Alternatively, make sure your IPs can be moved to this account and Region.
Implement flexible network routing. Use Amazon Route 53 to make DNS changes for rerouting traffic or AWS Global Accelerator to move an IP.

Figure two: IaC recovery overview

Conclusion

Rapid recovery from a natural disaster or a ransomware attack is vital for any business. For the energy and utilities industry, it is an issue of protecting the critical infrastructure of a nation. AWS offers many options to help you build and run secure and resilient systems. Customers should consider IaC as a key way to build resilience into their operations and to support rapid recovery in the event of a natural or cyber disaster.

For more information about how AWS is empowering utilities to improve reliability and safeguard critical infrastructure, visit compliance and security for utilities.