Journey to Adopt Cloud-Native Architecture Series: #3 – Improved Resilience and Standardized Observability

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

In the last blog, Maximizing System Throughput, we talked about design patterns you can adopt to address immediate scaling challenges to provide a better customer experience. In this blog, we talk about architecture patterns to improve system resiliency, why observability matters, and how to build a holistic observability solution.

As a refresher from previous blogs, our example ecommerce company’s “Shoppers” application runs in the cloud. It is a monolithic application (application server and web server) that runs on an Amazon Elastic Compute Cloud (Amazon EC2) instance. It connects with a PostgreSQL database running on Amazon Relational Database Service (Amazon RDS).

The monolith application is tightly coupled with the database. The transaction from front end to database has to complete within 30 milliseconds, otherwise user experience degrades. The application integrates with a number of external systems for enriching data, payment processing, and order fulfillment. Some of these external systems don’t provide Service License Agreements (SLAs) on response time but have median response time of 15 milliseconds. It takes around 30 minutes before a new server starts serving user traffic. To break it down, it takes 10 minutes for application startup, 5 minutes to create and test connections, 10 minutes to run sanity test, and 5 minutes to load application cache.

Increase resiliency

The complexity of monolith applications can present unknown failures. You cannot avoid failure by simply testing all application and infrastructure failure scenarios. Your application must be designed to handle any cascading failures to maintain a continuous user experience even when the backend systems are experiencing issues. In the following sections, we show you the steps we took to improve system resiliency for our example company.

Minimum business continuity for failover

Building disaster recovery (DR) strategies into your system requires you to work backwards from recovery point objective (RPO) and recovery time objective (RTO) requirements. Our business needs in this scenario required us to build high availability to prevent 30 minutes of continuous downtime (RTO) and prevent persistent user data loss (that is, a few minutes RPO).

To meet these goals, we developed a long-term business continuity plan per the Disaster Recovery of Workloads on AWS: Recovery in the Cloud whitepaper that consisted of the following:

We accepted data loss for the data that’s not persistent in the database.
Earlier, we were able to restore from the backup but wanted to improve availability further. We built a pilot light using point-in-time backups for data stores, cross-Region Amazon RDS read replicas, cross-Region Amazon S3 replication, and AWS CloudFormation templates.
We used AWS Backup to simplify backup and cross-Region copying of Amazon EC2, Amazon Elastic Block Store (Amazon EBS), and Amazon RDS to mitigate business continuity risks.
We formed pipeline to create “golden AMIs,” Amazon Machine Images that contain operating systems/packages to stand up consistent servers, using Amazon Image Builder.
Additionally, we automated application code deployment using AWS CodePipeline.

Due to the stateful nature of the application, taking EBS snapshots wasn’t going to prevent data loss. We updated our continuous integration and continuous delivery (CI/CD) pipeline. This enables us to deploy application code in different Regions based on the guidance from the Using AWS CodePipeline to Perform Multi-Region Deployments blog.

Improve load-balancing capability across servers

The monolith application handles all transactions, including long transactions that can take up to 3 minutes and short transactions that complete within 30 milliseconds.

To manage the resulting unbalanced traffic, we used the Least Outstanding Request algorithm in Application Load Balancer that distributes the load more uniformly.

Exponential backoff

After implementing retries and timeouts, we learned that backoffs are equally important. This is true especially for exponential backoff, where wait time increases exponentially after every retry attempt. We know from the Timeouts, retries, and backoff with jitter article that retries can become an anti-pattern for application resiliency.

We observed that database retries were overwhelming the database in the case of latency jitters. So, we implemented retry throttling within AWS SDK for Java per the Introducing Retry Throttling blog.

Decoupling integrations using event-driven design patterns

Our application experienced a cascading failure when the payment processing and order fulfillment API time out impacted business operations.

To fix this, we identified asynchronous messaging paths to improve customer experience. We updated our application logic to use the design pattern shown in this FIFO topics example use case:

The application sends requests to Amazon Simple Notification Service (Amazon SNS) topics subscribed by multiple Amazon Simple Queue Service (Amazon SQS) queues to fanout.
The AWS Lambda functions poll messages from these SQS queues and interact with external systems, throttling the number of calls to external API operations.
As a follow up, notifications are sent to the user when needed.

This pattern allows us to retry failed requests while using First-In-First-Out and deduplication features described in the Introducing Amazon SNS FIFO – First-In-First-Out Pub/Sub Messaging blog.

Predictive scaling for EC2

Due to its monolithic architecture, the application didn’t scale quickly with sudden increases in traffic because of its high bootstrap time.

To predict future demands, we optimized for application availability and allowed automatic scaling to use historical CPU utilization data. This approach was highlighted in the New – Predictive Scaling for EC2, Powered by Machine Learning blog. This allows us to adjust capacity needs by forecasting usage patterns along with configurable warm-up time for application bootstrap.

Standardize observability

Production outages are scary for everyone, but with the right system monitoring solution, they can be made less stressful. After few outages of our application, we realized we needed to re-think holistically and not add metrics on one-time basis.

Logging into each server and comparing logs can get overwhelming when you are troubleshooting to identify bottleneck for unknown behaviors. But we found the key to troubleshooting: observability and event correlation, as explained in the following sections.

Define application and infrastructure metrics

As part of standardizing monitoring and observability metrics, we followed the guidance provided in the Performance Efficiency pillar of the AWS Well-Architected Framework. In summary, we worked backwards from the customer experience to identify performance-related metrics for each system component and dashboards to provide consolidated views.

Centralized logging

When analyzing downtime events, we found that the time spent on correlating different events was leading to high Mean-Time-To-Resolve. Additionally, the performance baseline was missing. Our analysis also uncovered a gap in security monitoring.

To overcome these challenges, we:

Built a centralized logging solution using the guidance from the Visualizing AWS CloudTrail Events using Kibana blog.
Created dashboards in Kibana to visualize the log correlation insights and metrics.
Ran on-demand queries when we needed to troubleshoot performance issues to identify the root cause using Amazon Athena.
Used Amazon Elasticsearch Service (Amazon ES) to maintain real-time log insights.

We used this solution as a baseline and updated the configuration as shown in Amazon ES Service Best Practices to optimize for performance and scale. We took manual hourly snapshots of the ES cluster to Amazon S3 and used Amazon S3 Cross-Region Replication to restore the cluster in case of a DR event.

Manage service limits with Service Quota and adopt Multi-account strategy

We observed that most workloads were running in a single account, leading to service limits. We implemented two strategies to address this situation:

Adopted Service Quota to proactively identify service limits. As explained in the Introducing Service Quotas: View and manage your quotas for AWS services from one central location blog, it can help manage service quotas. Service Quota supports Amazon CloudWatch integration for some services. This enables alerts when we are close to reaching quotas. Based on that, operational teams can proactively submit tickets to increase quotas before system availability is impacted.
While using Service Quota, we realized that production and non-production environments should be separated into a multi-account framework for many reasons, including service limits. As discussed in Best Practices for Organizational Units with AWS Organizations, a multi-account environment allows for simplified billing, flexible security controls, governance at scale, and rapid growth and innovation.

Figure 1. Current Architecture with improved resiliency and standardized observability

Conclusion

In this blog, we talked about design patterns you can adopt to improve overall resiliency, meet your RPO and RTO objectives, scale monolith application based on forecasted usage patterns, enhance observability, and plan for multi-account framework. In the next blog, we will talk about more architecture patterns to evolve the current architecture.

AWS Architecture Blog