AWS Database Blog
How to choose the best disaster recovery option for your Amazon Neptune database
In today’s digital landscape, data has become a pivotal piece for organizations, driving critical decision-making and powering mission-critical applications. With an increasing reliance on data, organizations also need robust disaster recovery (DR) strategies as customer focus shifts towards always available and always on applications. Industry segments like financial services and healthcare have stringent regulatory guidelines to have a DR strategy as part of their regulatory and compliance audits.
Amazon Neptune database provides multiple disaster recovery options to maintain the continuity of your business operations. Whether it’s a natural disaster, a human-caused incident, or a system failure, the ability to quickly restore your data and resume normal operations helps reduce your Recovery Time Objective (RTO). By implementing a well-designed DR strategy, you can safeguard your valuable data, minimize downtime, and maintain the confidence of your customers and stakeholders.
In this post, we explore the key considerations and best practices for implementing effective DR strategies for your Amazon Neptune database deployments.
Understanding RTO and RPO
Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are two key metrics to consider when developing a DR plan. RTO represents how much time does it takes you to return to a working state after a disaster. RPO, which can also be expressed in hours, represents how much data you could lose when a disaster happens. For example, an RPO of 1 hour means that you could lose up to 1 hours’ worth of data when a disaster occurs. Different features of Amazon Neptune database support different RTOs and RPOs at different cost points.
Disaster recovery strategies for Amazon Neptune database
Amazon Neptune database offers a range of DR strategies to suit your needs, from simple backup solutions to multi-Region architectures. We can broadly categorized these strategies into four approaches, ranging from the low cost and low complexity of making backups to more complex strategies using multiple active AWS Regions. Over the next sections, we walk you through using the aforementioned features and show how you can use them to set up a DR strategy for your Amazon Neptune databases.
Backup and restore using snapshots
Backup and restore is a suitable approach for mitigating data loss or corruption and for workloads where RPO and RTO is in hours. This is applicable for Amazon Neptune databases used for low criticality and priority with higher RPO and RTO, typically in minutes to hours. You can also use this approach to mitigate against a regional disaster by replicating data to other Regions, or to mitigate lack of redundancy for workloads deployed to a single Availability Zone. With manual snapshots you can design a low RPO strategy down to minutes whereas with the automatic snapshots you are limited to at least 24 hours of RPO.
This approach involves provisioning the Amazon Neptune database and instance for restoring after a disaster event and provides a low-cost backup option to protect your data. You can use automated snapshots and manual snapshots to create backups of the Amazon Neptune database. Automatic snapshots have a default naming convention with timestamp, they are created on a pre-defined schedule. Automatic snapshots are deleted when you delete the automatic snapshot’s cluster. You can’t manually delete an automatic snapshot. A manual snapshot is deleted only when you explicitly delete it using either the Amazon Neptune database console or AWS CLI. A manual snapshot is not deleted when you delete its cluster. Automatic snapshots cannot be shared across accounts or Regions. For cross Region usage, you can copy manual snapshots across Regions and AWS accounts to protect data from a Region-wide disruption.
Amazon Neptune database backups are continuous and incremental, so you can quickly restore to any point within the backup retention period. No performance impact or interruption of database service occurs as backup data is being written. Because database updates are incrementally recorded, you can restore your cluster to any point in time within the backup retention period. On the other hand, for manual snapshots the naming convention does not add any timestamp automatically, hence this has to be managed based on the frequency that you decide to capture the snapshots. When a manual snapshot is taken, a full backup of your cluster’s data is created first time and stored. The subsequent manual snapshots are incremental and stored. Another key consideration to keep in mind is that you are limited to a maximum of 100 manual snapshots per AWS Region. For more information, see Backing up and restoring an Amazon Neptune DB cluster.
Integrating with AWS Backup
You can integrate Amazon Neptune DB snapshots with AWS Backup. AWS Backup is a fully managed backup service that you can use to centralize and automate the backup of data across AWS services in the cloud and on premises. With AWS Backup, you can centrally configure backup policies and monitor backup activity for your AWS resources. For more information, see Centralizing data protection and compliance for Amazon Neptune with AWS Backup.
Using Amazon Neptune database Read Replica
Amazon Neptune database supports Read Replica for a database cluster. This allows us to hosts one or more read-replica instances available in different Availability Zones can increase availability, because read-replicas serve as failover targets for the primary instance. That is, if the primary instance fails, Neptune promotes a read-replica instance to become the primary instance. When this happens, there is a brief interruption while the promoted instance is rebooted, during which read and write requests made to the primary instance fail with an exception. In contrast, if your DB cluster doesn’t include any read-replica instances, your DB cluster remains unavailable when the primary instance fails until it has been re-created. Re-creating the primary instance takes considerably longer than promoting a read-replica.
To ensure high availability, we recommend that you create one or more read-replica instances that have the same DB instance class as the primary instance and are located in different Availability Zones than the primary instance. For more details, you check fault tolerance for a Neptune DB cluster.
Using Amazon Neptune Global Database
An Amazon Neptune Global Database spans multiple AWS Regions, enabling low-latency global reads and providing fast recovery in the rare case where an event causes the workload to run in a degraded state in one region. An Amazon Neptune global database consists of a primary DB cluster in one Region and up to five secondary DB clusters in different Regions. Writes can only occur in the primary Region. Secondary Regions only support reads. Each secondary Region can have up to 16 reader instances. The secondary DB clusters allow you to move the primary cluster to a new Region more quickly, with lower RTO and RPO than traditional replication solutions. The RPO would depend on the replication lag (the time it takes for the primary DB to replicate the data to the secondary DB using dedicated AWS global infrastructure), with latency typically under a second.
For a Amazon Neptune global database, to recover from an unplanned outage or to run disaster recovery testing, you can perform a cross-Region detach-and-promote on one of the secondary DB clusters in the global database. The RTO for this manual process depends on how quickly you can perform the steps. When automated, the failover to the secondary database can happen in a few minutes. The RPO is typically a number of seconds, but this depends on the storage replication lag across the network at the time of the failure. You can use the Amazon CloudWatch metric for Amazon Neptune database, GlobalDbProgressLag, to monitor the replication lag.
Amazon Neptune global databases allow you to adopt a warm standby approach, which makes sure there is a scaled-down but fully functional copy of your production environment in another Region. This approach extends the pilot light concept and decreases the time to recovery because your workload is always on in another Region. The distinction is that the pilot light method can’t process requests without additional action taken first, whereas warm standby can handle traffic (at reduced capacity levels) immediately. The pilot light approach requires you to turn on servers, possibly deploy additional (non-core) infrastructure, and scale up, whereas warm standby only requires you to scale up (everything is already deployed and running). For more information, see using Amazon Neptune Global Databases for disaster recovery.
Using Amazon Neptune Streams
Amazon Neptune Streams provides a way to replicate data from one Amazon Neptune cluster to another. If a failure occurs, your database fails over to the backup cluster. This reduces RPO and RTO to minutes, because data is constantly being copied to the backup cluster, which is immediately available as a failover target at any time.
However, the trade-off to consider with this approach is that the operational overhead required to maintain the replication components and the cost of having a second Amazon Neptune DB cluster online all of the time can be significant. In addition to this, during the restoration, the failover needs to be invoked manually by pointing the application to secondary cluster’s end point. It is also crucial to keep in mind that while downsizing advantages are available for your application compute, for database this could lead to increased replication backlog in streams leading the two clusters to be out of sync, hence sizing of Primary and Secondary cluster’s writers should be preferably of the same size. For more information, see Using Amazon Neptune streams cross-region replication for disaster recovery.
Amazon Neptune Streams allows you to adopt a pilot light approach, in which you replicate your data from one Region to another and provision a copy of your core workload infrastructure. Resources required to support data replication and backup, such as databases and object storage, are always on. Other elements, such as application servers, are loaded with application code and configurations, but are “switched off” and are only used during testing or when disaster recovery failover is invoked. This is applicable for Amazon Neptune databases used for medium criticality and priority with RPO and RTO requirements in tens of minutes.
Using Amazon Neptune Export
In certain DR requirements, you might need to keep an exported CSV files of nodes and edges exported using Amazon Neptune Database export feature. This export can be used as a backup in a S3 bucket that can be used to restore into a new Amazon Neptune database cluster during a disaster event. The restoration time will depend on the size of your graph database – the more nodes and edges you have, the longer the restoration may take. While this approach can be an effective way to safeguard your critical data, it’s important to note that it does come with some operational overhead. You’ll need to ensure the export processes are running smoothly and on schedule, without any issues. There’s also a cost consideration, as the frequent IOPS required for the exports can add up.
Testing your DR plan
A DR plan is helpful only if it’s periodically tested and validated. Testing your DR plan helps you identify potential issues or gaps so you can take corrective action. A full DR plan includes not only your database resources, but all of the applicable IAM access policies, Encryption Keys and your application infrastructure. Although a full DR plan test can take a significant amount of time and resources, it helps provide confidence that it will work when needed.
Summary
In this post, we shared some best practices for implementing a DR strategy using Amazon Neptune database. This post provided a framework that you can implement on Amazon Neptune database for disaster recovery using automated backups, manual backups, Amazon Neptune Streams, and Amazon Neptune Global Databases.
Feature | RTO | RPO | Cost | Scope | Complexity |
Automated backups | Hours | Hours | Low | Single Region | Low |
Manual snapshots | Hours | Hours** | Medium | Single Region, Cross-Region | Low |
Read Replica | Minutes | Minutes | Medium | Single Region | Low |
Amazon Neptune Global Databases | Minutes | Seconds | High | Cross-Region | Medium |
Amazon Neptune Streams | Minutes | Minutes | High | Single Region, Cross-Region | High |
**For Manual Snapshots, the RPO is mentioned in Hours assuming the manual snapshots are schedule every hour, this can be minutes or days based on the frequency of your manual snapshots.
Have follow-up questions or feedback? Let us know by creating a technical support ticket, posting your question in the AWS forums, or leaving a comment. We’d love to hear your thoughts and suggestions.
About the Authors
Yogish Kutkunje Pai is a Senior Solutions Architect at AWS with over 20 years of experience in the technology industry. In his current role, Yogish helps large global enterprises build solutions on AWS. With expertise in application development, databases, and generative AI applications, Yogish brings a wealth of technical knowledge to every customer engagement. Outside of work, Yogish enjoys experimenting with new technologies and cycling.
Ankush Agarwal works as a Solutions Architect at AWS, where he focuses on developing resilient workloads on AWS. He has experience in creating, implementing, and enhancing data analytics workloads using AWS services. Ankush spends his free time exploring urban forests and watching science fiction films.