AWS Partner Network (APN) Blog

Adopting Multi-Region Architecture to Support ADP’s High Availability Requirements

By Anil Dhotre, Director, Technology Architect – ADP
By Rafael Rost, Principal Application Developer – ADP
By Zhaohan Yan, Application Developer – ADP
By Anjan Dave, Principal Solutions Architect – AWS

ADP-AWS-Partners
ADP
Connect with ADP-1

As a comprehensive global provider of cloud-based human capital management (HCM) solutions, ADP’s products unite HR, payroll, talent, timetax, and benefits administration.

Serving more than one million clients in 140 countries, ADP pays over 41 million workers worldwide, including one in six workers in the United States. With an unparalleled view of the workforce, ADP’s unique lens into the employee experience powers its innovative solutions, which are designed to meet the workforce’s changing needs.

ADP Voice of Employee empowers employers to collect feedback throughout the employee lifecycle and make more informed decisions to help their people feel valued and connected to their work. The offering provides research-reviewed and scientifically-validated templates so employers can confidently ask questions that generate trustworthy data. The questionnaires are quick to configure and deploy directly from the ADP HR solution.

Like other ADP offerings, ADP Voice of Employee prioritizes security, availability, and resilience. ADP built this offering on the Amazon Web Services (AWS) cloud to support these needs, and the application has a high availability requirement so the workload must be resilient in any AWS region or availability zone (AZ).

In this post, we share how ADP met these requirements by adopting a multi-region architecture in an active-active configuration. ADP is an AWS Partner and HCM technology provider.

Solution Overview and Requirements

The ADP Voice of Employee application design must adhere to ADP’s recovery time objective (RTO) and recovery point objective (RPO) business objectives. Although balancing costs and availability is important, the application must be secure, highly available, resilient, scalable, and performant.

For the application to meet the high availability and RPO/RTO requirements, ADP adopted a multi-region architecture in an active-active configuration. This mode uses two or more servers to aggregate the network traffic load and distribute it to network servers.

To meet or exceed the availability goals and RPO/RTO objectives, the API serverless stack is deployed in both a designated primary and a secondary AWS region.

Using Amazon Route 53 health checks and a weighted routing policy, the API traffic is split across primary and secondary regions. Most of the traffic is routed to the primary region, while routing a small amount of traffic to the secondary region.

The secondary region’s AWS Lambda process connects to the primary region’s writer, since the writer database instance runs in the primary region. With the round trip to the primary region adding network latency, the application needs to be fine-tuned to meet response times.

Recommendations include running performance tests and comparing network latency between primary and secondary regions and reducing database calls by optimizing the query. This is to ensure both the regions are functional with the latest application software version, to minimize code drift, and to be ready for fast failover at either side. The multi-region active-active architecture is illustrated below.

ADP-VOE-1

Figure 1 – Overall multi-region active-active application architecture.

ADP configured Amazon Route 53 to route API requests to region-specific secure and private Amazon API Gateway endpoints integrated with AWS Lambda, which are configured with least-privilege access to AWS resources and attached to at least two subnets within a virtual private cloud (VPC) for security.

Integrated with the Amazon Aurora global database, the primary region’s database cluster is configured in multi-AZ and with read replicas. Aurora stores multiple copies of the data across different AZs, providing resilience at the region level.

Leveraging the Aurora global database feature, the secondary region database cluster is configured headless (no compute) with only cross-region data replication enabled.

In most cases, the data replicated from the primary to the secondary AWS region experiences replication lag in milliseconds, meeting the quick recovery objective in the secondary region. This configuration delivers a cost-effective, multi-region data architecture.

Application health is monitored with a custom health check API, which monitors the health of the integrated services such as AWS Lambda, Amazon API Gateway, and Amazon Aurora. This custom health check is called upon by Amazon CloudWatch Synthetics to generate the metrics.

These metrics are monitored by Amazon CloudWatch alarms which are integrated with the Route 53 health checks. Such a health check configuration exists in both the primary and the secondary regions.

If the application stack reports anything other than a successful response in either region, the Route 53 health check stops routing the client requests to the associated Amazon API Gateway endpoint for that region, resulting in all traffic being routed to the other region. A CloudWatch alarm triggers an alert for the support team to start an investigation on what’s causing the health check to fail.

Below is a typical application workflow in the active-active strategy:

  • Following the Amazon Route 53 weighted routing and health checks, client requests routed to the primary AWS region are served by the Amazon API Gateway endpoint for a certain web resource. The private API Gateway web resource does a synchronous call to AWS Lambda.
  • AWS Lambda processes the request by securely connecting to the Amazon Aurora global database instance to read or write the requested data. Data validations and authorization checks are applied, including row-level security, while processing the request.
  • Amazon Aurora replicates any updates from the primary database instance to the secondary region’s headless Aurora cluster.
  • Client requests routed to secondary region’s API endpoint are served similarly, by the API Gateway (private) endpoint that makes a synchronous call to Lambda, which processes the request by securely connecting to primary region’s Aurora database instance to read or write the requested data. This data flow adds network latency because it crosses regions.
  • Periodic health checks are performed in both primary and the secondary regions, leveraging Amazon CloudWatch Synthetics.

Database Failover

The Amazon Aurora global database is a fully managed relational database engine deployed in a multi-AZ configuration. However, in the event of an unplanned regional impairment of the database, a detach and promote process can be considered:

  • An automation process is triggered, removing the headless secondary Aurora cluster from the global database. This disables the data replication from the primary region to secondary region. To meet the RPO requirement, configure the Aurora RPO configuration parameter. If the data replication lag between regions is more than RPO, Aurora holds any further database writes until replication lag is within the RPO threshold.
  • Promote the secondary Aurora database instance to standalone and provision new Aurora writer and reader instances. Since the data is already replicated in the secondary region, the database is up in minutes, and the secondary region’s Aurora cluster is ready to serve client requests.
  • Since the primary region health checks will fail, Amazon Route 53 routes client requests to the secondary region’s API Gateway.
  • The secondary region’s API Gateway endpoint does a synchronous call to AWS Lambda, through the API Gateway endpoint configured with Lambda proxy configuration for the requested web resource.
  • The secondary region’s Lambda functions, previously configured earlier to route read and write calls to primary Aurora cluster, are now auto re-configured to route calls to the secondary region’s Aurora cluster. This completes the disaster recovery process, with application now up and running serving clients from secondary region. AWS Secrets Manager stores the Aurora database credentials, and Lambda functions are configured to read these secrets in order to connect to the database reader and writer instances.
  • Periodic health checks in the secondary region continue to run and create alerts in case of any unhealthy checks.

This process is illustrated below. For more information on Aurora global database failover, refer to the documentation.

ADP-VOE-2

Figure 2 – Illustration of database failover scenario.

Planned Switchover on Primary Region to Revert to Normal State

After the primary region is operating normally, a switchover activity is planned. During this process, a new primary region is added with read replicas to the secondary region’s Aurora global database, transforming the standalone Aurora cluster into a global cluster. This starts the data replication process from the secondary region’s Aurora cluster to the primary’s region’s cluster.

Cross-region data replication lag is monitored using Amazon CloudWatch’s “AuroraGlobalDBRPOLag” metric for the global database.

Once the replication lag is near real-time, the application is briefly interrupted and the replication lag is monitored until both regions’ data are in sync. Once data from both regions are in sync with no replication lag, the planned secondary cluster failover to the primary cluster is initiated.

Typically, this activity is planned during off-peak business hours for minimal business impact. With this action, the replication process turns around so the data replication now happens from the primary to secondary region as it was before the failover event.

Post-failover, an automaton job updates the AWS Secrets Manager with primary cluster writer and reader endpoints. After completing the switchover process, the application is back to an operational state across both primary and secondary regions.

ADP-VOE-3.1

Figure 3 – Detailed database failover and failback flow.

Recommendations

  • Create a runbook with tasks recorded for disaster recovery and switch-back processes.
  • Use infrastructure as code (IaC) tools like AWS CloudFormation to manage your cloud resources.
  • Use automation for disaster recovery and switch-back activities.
  • Perform chaos engineering game days to test your disaster recovery processes frequently, typically scheduled during weekends.

Conclusion

In this post, we shared how ADP Voice of Employee is able to meet its high availability requirement by deploying the application infrastructure across different AWS Availability Zones and regions using various AWS services and features.

To optimize costs, the solution leverages the AWS fully managed headless cluster data replication process, a feature of Amazon Aurora global database.

.
ADP-APN-Blog-Connect-2024
.


ADP – AWS Partner Spotlight

ADP is an AWS Partner and human capital management (HCM) technology provider with products that unite HR, payroll, talent, time, tax and benefits administration.

Contact ADP | Partner Overview