Journey to Adopt Cloud-Native Architecture Series: #2 – Maximizing System Throughput
In the last blog, Preparing your Applications for Hypergrowth, we talked about hypergrowth and the technical challenges it presents to companies. As a reminder, we presented an example ecommerce company running a monolithic application on Elastic Compute Cloud (Amazon EC2). This application connects with Amazon Relational Database Service (Amazon RDS).
The company recently experienced a hypergrowth event where user traffic grew exponentially (10 times) within a few days. During this event, we observed degraded performance at peak times. In this blog, we talk about improving performance by maximizing system throughput through incremental improvements to application and database layers.
Maximize system throughput
In the following subsections, we explain system improvements we made to maximize the company’s system throughput and get consistent performance from their application stack.
Configure synchronized transaction timeout
Transactions were timing out in the application, database, and network intermittently for different reasons:
- When a transaction starts at the load balancer (connection idle timeout)
- When it goes to application (SDK timeout)
- When it makes a database connection (connection timeout)
- When it runs a query (statement timeout)
We increased timeouts from backend to front end so rollbacks happen properly. For example, we increased the database connection timeout from 3 seconds from 10 seconds to accommodate long-running queries.
Introduce connection pooling
We identified that idle connections were consuming memory and CPU after reading this Resources Consumed by Idle PostgreSQL Connections blog post. The post talks about how having more connections doesn’t necessarily increase throughput and could instead result in idle connections. It has opposite effect and impacts performance, as explained in the Performance Impact of Idle PostgreSQL Connections blog post.
As a solution, we implemented connection pooling via Amazon RDS Proxy to improve overall performance and reduce unnecessary resource utilization. The proxy helps with the heavy lifting, so we don’t have to build this into our applications. It enables us to reuse database connections, which improved overall scalability and resiliency. For optimized performance, we tested applications against a small number of connections and stepped it up until the overall throughput reached a plateau. We observed adding more connections after the plateau had diminishing returns.
Introduce read replicas
To address read scaling in the database cluster, we added multiple read replicas to handle read-only queries. These read replicas also handle one-time processes that fetch data. This pattern is also known as Command Query Responsibility Segregation.
We updated the application code to determine read/write data access based on an API call. Based on the data access need, the application code will use different endpoints for running queries. This allows the application to offload read transactions to a different host and maximize system resources to increase write throughput. Introducing read replicas has boosted overall throughput and resource buffer (CPU and RAM) on the write database node.
We introduced a caching layer to improve performance and reduce latency when the user is waiting for a response. Though read replicas provided the ability to distribute read/write queries across different database instances, the queries still needed to be run. We needed an in-memory caching solution for submillisecond latency. After we reviewed the Performance at Scale with Amazon ElastiCache whitepaper, we were able to identify the right caching solution for different use cases.
We deployed Amazon ElastiCache for Redis in a lazy-loading pattern as mentioned in the Database Caching Strategies Using Redis whitepaper. Caching not only improved scale and performance but also helped to optimize the database cost.
Introduce purge and data archiving
The orders table is one of the most read/write heavy tables in the database. However, there was no business need to keep historical information in the transactional database for completed orders. So, we worked with our internal teams to define a data retention policy. We created tables partitioned by date, which allows us to analyze partitions with records that can be archived or purged based on the retention policy.
Table size was reduced by 60% by applying these techniques. We also created automated jobs to purge older partitions and implemented backup jobs to archive data. Additionally, we adopted a strategy to migrate older data into Aurora Serverless Postgres to offload infrequent data queries from the main database to optimize cost. This database supports users who want to view old orders on a one-time basis (Note: Aurora Serverless V2 is also now available in preview).
As we identified the different bottlenecks in the database throughput, we used Amazon RDS performance to find bottlenecks in SQL queries. We used metrics like “read time per call” and “write time per call” to recognize queries that took the most time to read and write. When we analyzed deeper, we observed that some queries were used by downstream online analytic processing (OLAP) processes. Before further optimizing the queries, we moved SELECT queries on a separate read replica to free up resources on the primary database.
Introduce database sharding
Since we run on the largest instance for one of our databases, we had to find a way to horizontally scale the database. “Sharding” allowed us to deal with having multiple writer nodes. As described in this Sharding with Amazon Relational Database Service blog post, there are several approaches to sharding, each with pros and cons.
In our case, the order management component had the most contention. We used “customer_id” as the partition key and split the data into different shards. We developed a proxy layer to decide the mapping between customers and shards.
In this blog, we talked about design patterns you can adopt to address scaling challenges that can arise due to hypergrowth. In the last blog, Preparing your Applications for Hypergrowth, we talked about hypergrowth and the technical challenges it presents to companies. These architecture patterns not only maximize system throughput but also prepare your applications for incremental improvements to build highly scalable and resilient architecture. In the next blog, Improved Resilience and Standardized Observability, we provide more design patterns to help you in your journey to adopt cloud-native architecture.
Other blogs in this series
- Journey to Adopt Cloud-Native Architecture Series: #1 – Preparing your Applications for Hypergrowth
- Journey to Adopt Cloud-Native Architecture Series: #3 – Improved Resilience and Standardized Observability
- Journey to Adopt Cloud-Native Architecture Series: #4 – Governing Security at Scale and IAM Baselining