Assessing the Reliability of Your SaaS Environment with the AWS Well-Architected SaaS Lens
By Oren Reuveni, Sr. Partner Solutions Architect – AWS SaaS Factory
The reliability pillar of the AWS Well-Architected SaaS Lens focuses on the reliability posture of your software-as-a-service (SaaS) solution.
The SaaS Lens helps Amazon Web Services (AWS) customers assess the overall reliability of their SaaS architecture, providing prescriptive guidance that enables better alignment of their architecture with SaaS reliability best practices.
In multi-tenant environments, any outage or degraded experience could impact all customers. In this post, I will discuss how to gain visibility into the health of the SaaS environment and specific tenants, mitigate possible interruptions, and test the reliability of your system components.
Applying the best practices described in this pillar will be enable your SaaS application to handle unpredictable usage patterns of tenants, recover from infrastructure or service disruptions, and scale in order to meet the demand of resources. The goal here is to assure none of the tenants, nor other system components, are being affected in terms of service quality.
SaaS Applications Reliability Considerations
There are several key considerations to keep in mind when building or optimizing a SaaS solution for reliability. While some apply to all modern solutions, others are specific to a SaaS delivery model, and that’s the focus of this post.
The SaaS reliability pillar represents an extension of the existing Well-Architected reliability principles. To align with the full range of best practices, you’d want to be sure you included these foundational practices as part of your review. For more details, please refer to the Well-Architected Framework’s reliability pillar description.
The SaaS Lens reliability pillar questions are aligned with the general Well-Architected design principles above, but address unique challenges in SaaS solutions.
There are three main areas that we focus on in the SaaS Lens reliability pillar:
The workloads of tenents in a multi-tenant environment can be continually shifting. Tenants may impose different types of load on the system. New tenants with new workload profiles may also be continually added to the system. These factors can make it challenging for SaaS companies to create an architecture that’s resilient enough to react and respond to these evolving needs.
The “noisy neighbor” term describes a system user (a tenant, in our case) that places load on the system in such way that can degrade the service quality other users (tenants) get. A multi-tenant solution should have mechanisms in place that can detect such situations and mitigate them. By introducing proactive constructs to manage noisy neighbor conditions, your SaaS solution can ensure the service quality level for other tenants is not affected.
Tenant Health Visibility
While there is a general level of health for your system, your operational tooling must also enable you to view and assess the health of individual tenants and tiers. Gaining visibility into your tenant’s health allows you to discover and detect reliability issues, proactively prevent them, and maintain an adequate quality of service.
Building a proactive view of health in a multi-tenant environment requires you to surface additional reliability data that provides more detailed, tenant-aware insights into the health trends of your tenant workloads. These insights are used to identify tenant-specific trends, activities, or data points that can effectively capture condition that could impact the reliability for that tenant or the entire system.
Having this data allows you to build alarms, policies, and automation that can attempt to heal the system without incurring an outage.
Automated Reliability Tests
There are a number of different dimensions you’ll want to include as part of your multi-tenant testing strategy. In many cases, these tests are targeted at validating the constructs you have in place to address the scale, operations, and reliability footprint of your SaaS product.
It’s important to note that SaaS testing is often about simulating the extremes your application may encounter. You should be focused on building a suite of tests that can effectively model and evaluate how your system will respond to the expected and the unexpected.
As you introduce new constructs to manage your SaaS system’s reliability, you must also look at how you can create automated tests that continually evaluate how/if these constructs are working as expected. This can be done by building tests that exercise core processes such as tenant onboarding, system updates, and configuration changes.
Testing multi-tenant load and various activity patterns that are generated by multiple tenants is another type of testing which allows you to ensure your SaaS solution can detect, withstand, and mitigate intensive load or unpredicted activity that is generated by multiple tenants.
Well-Architected SaaS Lens Questions
Now that you’ve looked at some of the key areas of SaaS reliability, let’s look at the questions that the Well-Architected SaaS Lens uses to evaluate your alignment with these practices.
Each question is accompanied by a short summary of the recommended practices for each topic. It includes a Required, Good, and Best set of practices and reference to relevant content that’s related to the discussed topics.
Following is a high-level view of the scope and goals of each question. For more details, please refer to the reliability pillar section in the SaaS Lens Whitepaper and the AWS Well-Architected Tool itself. Guidance for improving your current posture can be found in the SaaS Lens improvement plan within the Well-Architected Tool.
Figure 1 – REL1 question in the AWS Well-Architected Tool.
SaaS REL 1: How do you limit an individual tenant’s ability to impose load that might impact availability for other tenants of your system?
A specific tenant should not be able to impact the system’s or another tenant’s quality of service. In order to achieve that, you’ll need to control your tenants activity in the system.
- Use throttling policies to limit the effect that noisy tenants have on the system. Identify tenants that are imposing excess load and use this data to apply throttling policies to help ensure the workloads of any one tenant do not impact the overall reliability of your system.
- Define SLAs for each tenant tier. Introduce SLAs that are configured for each tenant tier supported by your system. Use SLAs as part of a throttling strategy to tightly control the level of activity and load that tenants/tiers can place on the system. This method ensures, for example, that the basic tier in a SaaS solution does not impact the reliability of a premium tier tenant.
- Partition tenant load to limit the area of effect. Distribute and/or isolate your tenant loads, enabling the resources (compute or storage, for example) to effectively address the potentially spikey tenant loads. This method can be applied on specific system components that are more sensitive or more likely to experience high or unpredictable loads. It could also be applied to specific tenant tiers that are known to have excessive consumption or atypical usage patterns.
SaaS REL 2: How do you proactively detect and maintain tenant health?
A reliable SaaS solution supports tenant-aware operations that enable proactive detection and resolution of tenant and system health issues.
- Add tenant context to application logs to reactively manage tenant health. Ensure your log files contain tenant context. Use this context to analyze and proactively identify quality of service reliability issues. By being able to view logs with tenant context, you’ll be able to easily identify specific tenant activity that might be contributing to system or tenant specific service quality issues.
- Introduce detailed tenant insights to enhance health forensics. Publish detailed tenant activity, consumption, performance, and error data to a centralized repository that can be used to analyze any health issues or usage trends that may be impacting reliability. This data can also be used with other health data to diagnose and assess more challenging multi-tenant reliability events.
- Proactively identify tenant issues with policies and alarms. Combine rich tenant insights with policies to proactively surface issues before they impact the stability or availability of your SaaS environment. These policies may invoke self-healing strategies for individual tenants and surface critical alerts/alarms.
SaaS REL 3: How are you testing the multi-tenant capabilities of your SaaS application?
SaaS introduces multi-tenant mechanisms that add new dimensions to the testing footprint of your application. Onboarding, configuration, noisy neighbor—these are all areas that could impact the reliability of your system. Each of these areas can and should be included as part of the overall reliability testing strategy for your SaaS environment.
- Validate noisy neighbor scale and availability. Test various noisy neighbor conditions, assessing the system’s ability to identify and respond to scenarios where a subset of tenants place a disproportionate load on your system. Develop a suite of tests that assess the system’s ability to apply scaling, throttling, and tiering policies for a range of tenant tiers and profiles. The goal is to ensure your noisy neighbor policies are performing as expected.
- Exercise key workflows under multi-tenant load. Identify workflows that might be key to your customer’s experience and implement tests that validate your system’s ability to support the SLAs of these workload for a range of multi-tenant loads. Assess the system’s overall stability as tenants place a mix of loads at varying levels of tenant activity.
- Validate the scale and repeatability of tenant onboarding. Ensure your tenant onboarding experience can reliably and repeatably onboard tenants with varying patterns and configurations.
- Ensure tenancy configuration changes are successfully propagated. Validate the system is correctly applying and propagating changes to tenant configuration. For example, changes to account state, such as status (active/inactive) and tier (a tenant which upgraded the free tier to the premium tier, for instance) must be shared between the billing system and your SaaS environment.
- Validate tenant isolation. Simulate interactions with your system to validate your system’s tenant isolation policies and practices are being successfully applied. Include tests that examine scenarios where a developer’s multi-tenant code could unintentionally cross a tenant boundary.
For more details, please refer to the reliability pillar section in the SaaS Lens Whitepaper.
Get Started with the Well-Architected SaaS Lens
The AWS Well-Architected SaaS Lens focuses on SaaS workloads and is intended to drive critical thinking for developing and operating SaaS workloads. Each question in the lens has a list of best practices, and each best practice has a list of improvement plans to help guide you in implementing them.
The lens can be applied to existing workloads, or used for new workloads you define in the tool. You can use it to improve the application you are working on, or to get visibility into multiple workloads used by the department or area you are working with.
About AWS SaaS Factory
AWS SaaS Factory helps organizations at any stage of the SaaS journey. Whether looking to build new products, migrate existing applications, or optimize SaaS solutions on AWS, we can help. Visit the AWS SaaS Factory Insights Hub to discover more technical and business content and best practices.
SaaS builders are encouraged to reach out to their account representative to inquire about engagement models and to work with the AWS SaaS Factory team.
Sign up to stay informed about the latest SaaS on AWS news, resources, and events.