AWS Cloud Operations Blog
Top considerations for Flash sale events
Introduction
Flash sale events happen when online stores offer deep discounts, promotions on their products or sell unique inventory for a short period, and product launches. The inventory behind these sales is usually low, high demand and the promotions are valid for only the short sale period.
Flash sales often see a steep increase in website traffic, resulting in a surge of user requests. This can put a strain on your infrastructure, leading to slow loading times or even system crashes. An unprepared or under-prepared event could mean hitting the scaling bottlenecks during the event, unavailability of resources or, at the worst, buyers cannot access the site, resulting in brand and revenue impact. To prevent this, you need an infrastructure with the readiness level that can meet the demand.
We recommend customers use AWS Countdown for architectural and scaling guidance and operational support during preparation and execution of planned flash sales events. Some customers, because of a variety of business reasons, run flash-sale campaigns with limited notice to their engineering teams and do not have the head-room for executing an Infrastructure readiness check. This blog is for Site Reliability Engineers (SREs), infrastructure engineering team, and product managers who support flash sales with limited notice from their business partners.
Step 1: Well Architected DNA
A good architecture is necessary for a successful flash sales event. Well Architected Framework offers architectural best practices to help build secure, high-performing, resilient, and efficient infrastructure for applications and workloads in the cloud. Well Architected Framework should be part of an organization’s best practices, and new changes should be reviewed within this framework. Customers should factor the time and resources for this review in partnership with AWS team during their planning.
Step 2: Understanding Flash Sale Characteristics
A thorough understanding of business goals for flash sale will provide the SREs and product managers the best way to manage the challenge and make the right trade-offs during the event. The business reasons for flash sales vary from industry to industry. For example: a retailer may want to bump up sales, increase impulse purchase, or clear inventory. A ticket company may see a flash event during ticket sales of a popular artist. Additionally, business considerations may include Stock-keeping Units (SKUs), inventory size, SKU discovery behavior, customer purchase behavior, time taken to complete a purchase, duration of flash sales, traffic volume, scaling pattern, or channels of purchase (desktop web browser, mobile app, purchase via call-centers). While it’s difficult to gather or predict all factors, it helps SREs to identify technical products, systems, and architecture components that will be subject to stress during flash sales.
Step 3: Determining Resource Requirements
The first step in preparing your AWS infrastructure is to quickly assess the capacity of your current infrastructure and identifying what additional resources you need based on expected traffic volumes. This means estimating the high-water marks of resources and determining if these exceed service quota limits of the AWS services.
Early engagement of your AWS account team, Technical Account Managers (TAM) and Solution Architects (SA) is the key.
- Is the flash sale application’s AWS account, sharing AWS resources with non-flash sale application?
- What are the resource constraints, the shared non-flash sale application would impose on the overall AWS account service quota thresholds? Use Service quota console to view the existing quota limit status of your AWS account. Not all services are covered in the Service quota console. Use AWS support case to get a detailed view.
- Are there any third-party dependencies? What if, the SLAs of third-party operations would drive your infrastructure choices. e.g.: Is there a payment gateway link to the flash sales application, with limits on the payment service calls made, throttling the sales application?
- Is the sales event happening across multiple geographies? Is the application architected with Amazon CloudFront to account for the lags? Are the AWS services globalized vs localized, and are the service limit parity taken into consideration?
Step 4: Identifying Scaling Capabilities
AWS offers several ways to help manage scaling, both at individual services level and aggregate level, using AWS Auto Scaling. AWS Auto Scaling helps you setup scaling for Amazon EC2 Auto Scaling Groups, Amazon Elastic Container Services (ECS), Amazon EC2 Spot Fleets, Aurora Replicas, and Amazon DynamoDB through a unified interface. With variable traffic during sale events, AWS Auto scaling adjusts capacity to maintain steady, predictable performance and the lowest cost. At an individual service level, Amazon EC2 Auto Scaling integrates with ELB to facilitate scaling and load-balancing your application. The following list identifies some mechanisms for granular scaling at individual service level.
- ELB Prewarming – ELB prewarming helps you set a floor on ELB Capacity. Once prewarmed, ELBs will organically scale up but will not go below the prewarmed level.
- Lambda provisioned concurrency – Spiky nature of traffic may warrant AWS Lambda functions in the launch pipeline to be kept ‘warm’. Provisioned concurrency for lambda, creates the execution environment for lambda ready, so that the ‘cold start’ time is reduced. Post this, Lambda functions can start under double-digit milliseconds. Engineering teams supporting the launch can determine provisioned concurrency needed for the functions and enable them right in the AWS console.
- Autoscaling: DynamoDB, ECS, Fargate, EMR services offer auto scaling. Further, services in the launch path should be prewarmed in case of steep traffic forecasts. Some of the common services in the launch path are RDS, NAT Gateway, Appsync, API gateway, CloudFront Cache and DynamoDB.
Again, estimated traffic patterns will help decide whether auto scaling or prewarming or a combination of the two is the right option for your architecture. Note: Prewarming may result in additional costs for the duration of flash sales.
Step 5: Implementing Scaling
Once you have determined your resource requirements and autoscaling needs, it’s time to scale required AWS services. Typically, the launch engineering team/SREs work with AWS to get the resources provisioned, starting with the self-serve tools like AWS CloudFormation, AWS CDK and CI/CD AWS tools, which are already part of their launch pipeline.
You can open an AWS support case, with your AWS account numbers, event date, event start time, duration, and Amazon Resource Names (ARN) that are part of the flash sales applications. You can identify the high-water marks in collaboration with AWS TAM team to create a baseline for services which are already running close to your service limits. You should place requests for service quota limit increase based on baseline and expected traffic.
Step 6: Availability and Reliability
Businesses invest significant resources for campaign of flash sales. It is imperative to understand Amazon SLAs, evaluate components, service and AZ failure scenarios, including regional unavailability. Using Amazon RDS, DynamoDB, S3 that are designed to provide scalable storage, and DB services that can handle high traffic loads, across regions, can be considered for multi-region failover scenarios along with AWS CloudFront and Global Accelerator for application stack.
Additionally, sharing the flash sales calendar with AWS teams will help ensure the availability of resources and, if needed, additional resource requests can be placed prior to the flash event.
Step 7: Testing and Optimization
Performance testing provides data to make scaling decisions. If there isn’t time for a full-fledged performance testing before flash sale event, an abridged version can be done prior to the event for key systems components to understand scaling behavior, bottlenecks and resource requirements. Performance optimization involves fine-tuning your infrastructure under simulated load conditions. Findings from testing will drive actions such as optimize databases, implement caching and or tuning load balancing strategies. We recommend engaging AWS solution architects, who are a part of your account team, to perform a well-architected review of the workloads, prior to the flash sales season. This helps to bake in the architectural recommendations and determining the high watermark of the services involved.
AWS solutions like Distributed Load Testing (DLT) uses the common services involved in a launch pipeline and can be used at a recurring schedule or during AWS game days, to simulate the flash sale load.
Step 8: Monitoring
It is crucial to test your infrastructure under simulated load conditions prior to a flash sale event. However, this needs to happen before the actual event. AWS offers tools like Amazon CloudWatch and AWS X-Ray which allows you to monitor system performance and identify potential bottlenecks. Application telemetry provides insight into the state of the applications. It informs the way you operate, and evolve your flash sales mechanisms.
During a flash sale event, a launch runbook is enacted, and SRE teams along with the supporting tech partners are mobilized as a part of a war room. We recommend SREs and application teams, to create custom CloudWatch dashboards for critical services to be included as a part of their health monitoring. This can be shared in the war room with the AWS team of TAMs and SAs along with AWS service specialists on standby, as needed. Any services running hot can be monitored by the AWS TAMs and proactive measures can be taken to adjust the service quota ceiling and API thresholds.
Step 9: Retrospective
Post-event analysis helps identify sub-optimal processes and an opportunity to deep dive in issues during an event. For AWS workloads, we recommend you perform a post-launch review with the AWS SAs and TAM team, where items like, review of the launch playbook timelines, dependent services, health of involved services, ramp down of custom resource requests, monitoring effectiveness, communication plan and improvement areas are identified. These changes can be implemented prior to the next flash sale event.
Step 10: Continuous Improvement
Each event such as flash sales, creates an opportunity to review the growth areas. These learnings can be used to drive improvement in operational excellence and by sharing it across teams. Over time, SREs can develop specific SOPs for different categories of flash events. These Standard Operations Procedures (SOPs) include details on traffic forecast using business drivers, application scaling, monitoring dashboards, and escalation matrix. These reviews also result in investments in improving people, process, and technology. Some examples of technical investments include:
- Queue management: Incoming user requests on website during a flash sale can be buffered to manage the traffic bursts using Virtual Waiting Room on AWS or other queuing mechanisms.
- Block unwanted traffic: Malicious and unwanted traffic can be filtered using AWS Web Application Firewall (AWS WAF).
- Monitoring: Proactive monitoring of flash sales workloads using AWS Incident Detection and Response.
Conclusion
Successful flash sales rely on a robust and scalable infrastructure. You can ensure your AWS infrastructure is ready for your flash sale by estimating your resource requirements, leveraging scaling techniques, and continually monitoring system performance.
Call to action:
Here are a few resources that will get you started on your journey to a more robust flash sale event preparedness: