Journey to Adopt Cloud-Native Architecture Series: #1 – Preparing your Applications for Hypergrowth
In this blog series, we take an example ecommerce company and talk about their challenges due to hypergrowth. Their journey from running monolith applications to running cloud-native applications will provide you architecture patterns and strategies you can adopt to become more agile and innovative.
Later in the series, we show you how to address immediate challenges and make incremental architectural changes to achieve domain-driven cloud-native applications.
Ultimately, we aim to provide you with tools and strategies to decouple monolithic applications into microservices and build highly resilient and scalable applications.
In this first blog, we define hypergrowth, lay out the company’s current system design and their business priorities, and then discuss their technical challenges during the hypergrowth phase.
What is hypergrowth?
Hypergrowth happens at the steep part of the growth curve, after the initial product or service offerings are defined and before the business matures. This growth can be triggered during a planned event such as Black Friday sales or by an external factor such as a change in government policy.
Hypergrowth poses a number of platform and technology challenges, namely backend scaling without impacting users. This requires optimizing the system design at every layer, which means re-thinking the way the applications are architected. To do this, you need to apply modern application development principles, namely microservices; purpose-built databases; automated software release pipelines; a serverless operational model; and automated, continuous security.
Current system design
At our example ecommerce company, their “Shoppers” application runs in the cloud and is designed as follows:
- A monolithic application (application server and web server) runs on an Amazon Elastic Compute Cloud (Amazon EC2) instance. It connects with a PostgreSQL database running on Amazon Relational Database Service (Amazon RDS).
- The database has large tables (more than 200 GB) designed to reduce the number of calls to the database. The tables have hundreds of columns to minimize joining multiple tables and increase the system throughput. The database has Multi-AZ deployment for high availability.
- The monolith Java application is deployed as a web application archive (more than 5 GB) that includes all dependencies and required libraries. It takes 2–3 hours to build and test this monolith application every time a developer commits a code in the version control system based on a mono repository.
- The application is designed to automatically scale depending upon user traffic and other system metrics such as CPU utilization, memory utilization, and failure rates.
The system was tested to scale to meet three times the growth in user traffic. However, in a few weeks, user traffic increased by 10 times and kept growing exponentially, leading to hypergrowth.
While the application was prepared to handle three times the growth in user traffic, the hypergrowth event re-prioritized internal efforts. The company chose to focus on ensuring reliability and scalability by addressing the following challenges along with focusing on business growth.
Some of their key priorities to deal with hypergrowth include:
- Prevent loss of revenue – Because of the exponential increase in user traffic, system outages increased, which led to loss of revenue. Operational teams set goals to identify all bottlenecks and single points of failure that could lead to system outages.
- Reduce customer churn – The company evaluated the customer churn rate and ways to improve customer experience instead of focusing solely on acquiring new customers.
- Improve cost control and visibility – The financial teams closely watched expenses and examined ways to enhance visibility for spending across teams, features, and products.
- Increase time to market – Because of the growing customer base, the company needed to create and deliver new products and services. They set goals to reduce friction in the development life cycle and to develop mechanisms to improve deployment speed.
The following sections highlight challenges identified by operational and architectural reviews conducted by AWS technical teams.
- Tight coupling and dependency between system modules – The current application is designed to scale as a whole instead of scaling individual system modules. This unnecessarily scales some system modules, leading to underutilized system resources. Additionally, some system modules are not designed to scale with the same rate as others, leading to inherent failures.
- Ineffective automatic scaling – The application relies on Amazon EC2 Auto Scaling to add new instances when load increases. However, it takes at least 5 minutes for instances to become ready to serve user traffic. This happens due to required bootstrap dependencies, monolithic code startup time, and sanity checks on all services.
- Limitation on system throughput – The database is running on r5.24xlarge Amazon RDS Multi-AZ, which has almost reached its vertical scale limit. There are intermittent issues due to maximum connection limit, CPU bottleneck, and varying query execution times. The monolith application makes frequent read/write calls with an expected response time of less than 10 milliseconds.
- Long running transactions – The current database has tables with hundreds of columns. This design worked well in the past for desired system throughput. However, it has presented new challenges. For example, when updates are made, it requires updating up to hundreds of rows, causing read transactions to wait for locks to release. Read queries also fetch large amounts of data, including unnecessary rows.
- Longer release cycles – Multiple team members contribute to the application code and commit code to the main branch of source version control. The changes made are updated in Development, Test, and Production environments independently because we created a CI/CD pipeline to prevent undesired changes going to production. This required coordination across multiple teams to ensure the right version of a dependency is included in the production release and is certified by testing teams.
Security, operations, and monitoring
- Account quotas and security challenges – All environments (Development, Test, Pre-production, and Production) run in a single Amazon Web Services (AWS) account. Due to the company’s recent hypergrowth, we realized the importance of monitoring and planning for service limits. Because the network design didn’t change much from initial design (that is, one Amazon Virtual Private Cloud (Amazon VPC) for each environment), we hit soft limits on Amazon VPC. We hit a number of other service limits such as number of concurrent Amazon Simple Storage Service (Amazon S3) API calls, number of Identity and Access Management (IAM) users, number of EC2 On-Demand Instances launched, and number of launch configurations for EC2 Auto Scaling groups.
- Increase in troubleshooting and issue resolution time – The current logging solution and tools provide insights into individual system components but lack a coherent view of the data flow across systems. A major challenge is that logs cannot be correlated across system, adding complexity when troubleshooting production incidents.
- High availability – As mentioned in the Reliability Pillar whitepaper, overall system availability can be different than individual component availability. The calculated availability with hard dependencies is 99%. Given how critical the system has become to the business, the current system design doesn’t meet the new service level objective of 99.99% availability.
In this first blog, we defined hypergrowth and explained general challenges it can present to companies. In subsequent blogs, we will talk about architectural patterns, tools and strategies to address the challenges. We will also show how you can make incremental changes to achieve cloud native architecture. In the next blog, we talk about design patterns to maximize system throughput.
Other blogs in this series
- Journey to Adopt Cloud-Native Architecture Series: #2 – Maximizing System Throughput
- Journey to Adopt Cloud-Native Architecture Series: #3 – Improved Resilience and Standardized Observability
- Journey to Adopt Cloud-Native Architecture Series: #4 – Governing Security at Scale and IAM Baselining
- Journey to Adopt Cloud-Native Architecture Series #5 – Enhancing Threat Detection, Data Protection, and Incident Response