How to plan for peak demand on an AWS serverless digital-commerce platform

Lessons from top retailers who manage Black Friday and other peak events on Amazon Web Services (AWS)

Traditionally, retailers experience only one or two peak shopping events per year—a common example is Black Friday. But in an increasingly unpredictable economy, organizations need agile platforms that can respond to demand increases whenever and wherever they arise. For example, during the COVID-19 pandemic, many retailers were forced to close their brick-and-mortar stores and move online to survive. In a digital-commerce environment, products and services can go viral in minutes, making it hard for retailers to keep up with surges in traffic.

In the past, retailers used traditional demand forecasting and capacity-planning techniques to manage demand swings—both up and down. In practice, however, retailers can easily overspend if demand turns out to be lower than predicted, and customer experience and sales can suffer if retailers don’t provision enough IT infrastructure to cover traffic spikes.

Fortunately, with the emergence of serverless technologies, retailers can automatically scale their ecommerce systems to meet almost any level of demand. Serverless technologies also feature a pay-for-use billing model to control costs and reduce infrastructure-management tasks, like capacity provisioning and patching. With these tools, retailers can spend more time delivering value to their customers.

Let’s look at how leading retailers run their ecommerce platform using serverless technologies to help them thrive through peak periods.

Streamline your peak planning architecture for serverless

When designing your architecture, we suggest following framework from Amazon Web Services (AWS), the AWS Well-Architected Framework, that consists of six best-practice pillars. For this blog, we’d like to focus on one critical pillar: reliability. The reliability pillar focuses on the ability of a workload to perform its intended functions and recover quickly from incidents to continue to meet demand.

Building a serverless architecture provides a lot of benefits out of the box, but it also requires you to follow best practices to perform at its best. By following AWS Well-Architected best practices, you can design a scalable and cost-efficient architecture so that you can build and run applications without thinking about servers. Using the Serverless Application Lens for the AWS-Well-Architected Framework, you can drill down into specific serverless design principles that help prepare your architecture for peak workloads.

Most serverless ecommerce architectures are composed of microservices. By design, microservices do not require any specific type of architecture, but many AWS customers use a combination of the following components:

Compute—such as AWS Fargate, a serverless, pay-as-you-go compute engine, or AWS Lambda, a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service
Storage—such as Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at virtually any scale, or Amazon Simple Storage Service (Amazon S3), an object storage service offering industry-leading scalability
Messaging—such as Amazon Simple Notification Service (Amazon SNS), which sends notifications two ways; Amazon Simple Queue Service (Amazon SQS), which lets you send, store, and receive messages between software components; Amazon EventBridge, a serverless event bus; or Amazon Kinesis, which makes it easy to collect, process, and analyze real-time, streaming data

What’s the best way to plan for expected (or unexpected) peaks in your ecommerce application? Here are four key steps that many of our retail customers follow who run their serverless ecommerce websites through high-volume events like Black Friday, large sales promotions, social media campaigns, and more.

1. Prepare your architecture

Prepare your serverless compute for peaks using AWS Lambda

To absorb an unexpected surge in demand, we recommend increasing your concurrency That means upping the amount of requests your function can serve at any given time (per account or region), giving you a significant buffer to handle sudden spikes in traffic. For example, UK supermarket leader Waitrose was able to sustain a massive peak in customer traffic during the early days of the COVID-19 lockdown by using AWS Lambda and other serverless services to quickly scale compute capacity.
Improve your code. This might sound obvious, but using inefficient code can have negative downstream impacts on system performance during peak workload events. If you reduce your AWS Lambda execution time, you can increase your potential transaction throughput using the same concurrency limit, which can boost application performance. Learn best practices for working with AWS Lambda functions here. Some customers have started using Amazon Codeguru Profiler to improve application performance by analyzing runtime and quickly finding the root causes of latency.
Improve orchestration. If you limit each of your AWS Lambda functions to perform only one task, it becomes easier to orchestrate them using AWS Step Functions, which is a visual workflow service, helping protect your dependencies (see the next section in this article). For example, in an ecommerce scenario, you can use AWS Step Functions to orchestrate the entire order-processing function, starting with capturing the order at checkout and then initiating the rest of the processing events (payment, stock reservation, email confirmation, and backend system processing). Watch this video to see how UK retailer River Island does this using AWS Lambda, AWS Step Functions, and Amazon DynamoDB.

Designing Amazon DynamoDB architecture for peak traffic

Choosing the right partition key is a crucial step in designing and building scalable and reliable applications on top of Amazon DynamoDB. By allowing your workload to be spread across partitions, your database table can quickly scale from zero to millions of requests per second. When storing orders, for example, you could use a high-cardinality attribute like OrderId as an effective partition key. Learn more in this article.
Use auto scaling and provisioned capacity to achieve superior performance. Monitor your read and write capacity units against the quota for each table and account. Using Amazon DynamoDB, Amazon seamlessly supported one of the busiest ecommerce events of the year—Amazon Prime Day 2022—peaking at 105.2 million requests per second. (Learn more about service, account, and table quotas in Amazon DynamoDB here.)
As much as possible, use Amazon DynamoDB for high-throughput transactional data only. To help out, you could offload long-term archives and analytics to Amazon S3 and then query the data using Amazon Athena, an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. You could also use purpose-built databasesto separate latency-critical ecommerce workloads (such as login, search, and product information) from less speed-dependent workloads (operational reporting, order fulfillment, and payment). Learn more about Amazon DynamoDB best practices here.

Plan for quotas and limits

Each AWS service comes with quotas, also known as limits, which can be set at fixed levels or increased as needed. Quotas exist to protect you from consuming too much of your resources, either from poor management or abuse. In the context of an ecommerce application, you should carefully monitor and control the throughput of your different components. For example, using Amazon EventBridge, you can set a soft limit on the number of requests per second, which can be upped if needed. Learn more about managing quotas and limits here.

2. Test your resilience

No matter how good your planning is, testing your hypotheses before and after go-live is crucial. Many customers already implement “classic” testing strategies as part of their deployment playbooks, including regression testing, load testing, and unit testing. But in distributed and complex environments, such as business-to-customer (B2C) and ecommerce platforms, this is not enough. By relying on only predictable testing, you can end up either impeding agility (by using large time-consuming testing routines for each release) or increasing risk (by not testing at all).

With chaos engineering, you can test application resiliency by experimenting with various scenarios to build confidence in a system’s ability to manage unexpected events. Many retailers are now investing in fault-injection experiments in ecommerce to improve application performance, observability, and resiliency, and to help uncover potential issues before they impact end users. Learn more about testing best practices here.

3. Protect your dependencies

In a distributed-ecommerce architecture, components can communicate with each other in multiple ways, including point-to-point or decoupled. Although both strategies are possible, by decoupling functionalities, you can often avoid overwhelming your system. For example, your search service won’t be impacted by a shopping cart issue; or a surge in booking slots for deliveries won’t impair cart functionality. Designing your communications using queues and streaming instead of direct access can facilitate this decoupling.

Another key design principle is to separate workloads into multiple AWS accounts, which can facilitate multiple benefits. In terms of performance and scaling, you can reduce issues using quotas, which can help limit the blast radius. If only a single component experiences a performance issue, the rest of your application could be unaffected and run normally.

For ecommerce retailers, building a resilient distributed system architecture is critical. Although each system component can scale, customers can protect each microservice with architectural patterns, like graceful degradation, which transforms hard dependencies into soft ones. This allows you to continue to display a product webpage even if your marketing promotions service is down. Similarly, by deploying a pattern called a circuit breaker, you can instruct an upstream system to stop sending requests to a service that is behaving improperly. You can now focus on responding to a customer quickly while protecting the misbehaving service from causing further issues. Read more about designing fault-tolerant distributed systems here, and check out this article, “Challenges with distributed systems,” in Amazon’s Builder’s Library.

4. Alert and monitor

All the learnings we discussed above are worth it only if you can monitor and issue alerts regarding your application state. Start by defining one (or very few) high-level business metrics that matter. In the world of ecommerce, this could mean the rate of orders per minute or the number of items added to a shopping cart. You can benefit from the Amazon CloudWatch Anomaly Detection feature to implement adaptive alarms for each metric. Starting from these high-level metrics, you can drill down into each component, and from there into technical subcomponents.

In addition, many retailers are implementing observability capabilities to help trace transactions across a distributed architecture, down to each component, and across multiple cloud and on-premises environments. You can use AWS X-Ray, which provides a complete view of requests as they travel through your application, on each component to facilitate distributed tracing and instrument your applications using AWS Distro for OpenTelemetry—a secure, production-ready, AWS-supported distribution of the OpenTelemetry project—to send correlated metrics and traces to multiple AWS and AWS Partner monitoring solutions.

Distributed architectures have many benefits but also introduce new complexities in how applications are monitored. Several best practices, such as structured logging, distributed traces, and monitoring of metrics, can all contribute toward achieving a high level of observability. Open-source tools, such as Lambda Powertools, can be used to implement these best practices.

Next steps

To learn more about architecture best practices for reliability, check out AWS Well-Architected Framework and the Amazon Builders’ Library.

Contact your AWS account team or AWS Support to discuss planning for peak events. AWS Infrastructure Event Management (AWS IEM) can offer architecture and scaling guidance and operational support during the preparation and initiation of planned events, such as shopping holidays, product launches, and migrations. For these events, AWS IEM will help you assess operational readiness, identify and mitigate risks, and initiate your event confidently with AWS experts by your side.