Rate Limiting Strategies for Serverless Applications

Serverless technologies reduce the work needed to set up and maintain computing resources, provide built-in scalability, and optimize agility, performance, cost, and security. The pay-as-you-go model is particularly liberating for developers. You can fail fast, experiment more, and do it fairly cheaply. However, serverless brings its own challenges. In this blog, we’ll examine how to avoid unnecessary costs and help you limit the impact on your applications:

Runaway costs. How do you keep track of and control costs in this development model? In a typical on-premises environment, the limits of your exposure are controlled by the hardware you provision. Serverless scales quickly, and a piece of poorly written code could rapidly scale up and consume resources, making it difficult to control runaway costs.
Overwhelming downstream components. A typical application combines serverless and non-serverless technologies. The serverless piece of the architecture can scale rapidly to match demand. But downstream components with fixed capacities (like a relational database with a set capacity of connections) could be overwhelmed. Implementing a throttling mechanism can prevent events from propagating to multiple downstream components with unexpected side effects.

Rate limiting on serverless applications

In this section, we show you two common strategies to use when rate limiting serverless applications.

Strategy 1 – Synchronous invocation

For synchronous invocation, Figure 1 shows a classic serverless architecture.

Figure 1. Synchronous invocation

On the compute layer, AWS Lambda, a serverless compute service, lets you run code without provisioning or managing servers. It automatically scales your application by running code in response to each event. However, the downstream systems that this Lambda function calls may not be scalable. To prevent the downstream systems from being overwhelmed by a large flood of events, you can put a rate limiting mechanism in front. This throttling mechanism can be implemented for different components as a cost saving solution, which will help make your system more robust.

Note that there is a concurrency limit for Lambda functions per Region. This is especially true if you have many Lambda functions for different workloads deployed in one Region. You will get throttled if you exceed the concurrency limit. The Managing AWS Lambda Function Concurrency blog post further explains the Lambda concurrency limit. Generally speaking, you want to configure the reserved concurrency limit per Lambda function to control throughput based on different workloads. Then you will use the unreserved limit as a common pool for non-critical workloads.

When you want to expose your Lambda function, you can create a web API with an HTTP endpoint for your Lambda function. You can configure rate limiting to your API through API Gateway throttles. This prevents your API from being overwhelmed by too many requests. Amazon API Gateway provides two basic types of throttling-related settings:

Server-side throttling limits are applied across all clients. These limit settings exist to prevent your API and your account from being overwhelmed by too many requests.
Per-client throttling limits are applied to clients that use API keys associated with your usage policy as client identifier.

Additionally, as shown in Figure 2, you may want to continuously enhance protection to your API by enabling AWS WAF. With AWS WAF, you can set up rate-based rules to specify the number of web requests that are allowed by each client IP in a trailing, continuously updated, 5-minute period.

Figure 2. Enabling AWS WAF in front of API

Strategy 2 – Asynchronous invocation

When you don’t need a response right away (like firing events), you can use an Amazon Simple Queue Service (Amazon SQS) queue as an event source. As shown in Figure 3, this prompts the Lambda function to process the messages in the queue.

Figure 3. Asynchronous invocation with SQS

When configuring AWS Lambda event source mapping, there is a configurable maximum batch size, which is the number of messages that are delivered on each function invocation. The SQS queue message processing rate is determined by event source maximum batch size and Lambda concurrency limit. When an SQS trigger is initially enabled, Lambda begins with a maximum of five concurrent invocations. Lambda functions with an SQS queue trigger scale up to a maximum of 1,000 concurrent invocations, the account concurrency limit, or the reserved concurrency limit if you have it configured. When Lambda invokes the target function, the event can contain multiple items (up to a configurable maximum batch size). To control the message consumption rate and avoid throttling failure, follow these best practices.

Figure 4 shows another approach: using an Amazon Kinesis Data Stream as an event source to prompt the Lambda function.

Figure 4. Asynchronous invocation with Kinesis Data Stream

To control the processing rate of data from the stream, you can configure the following two parameters:

The number of shards in the Kinesis Data Stream
Lambda event source batch size

Figure 5 shows another commonly used serverless compute service: AWS Fargate. Lambda is not designed for functions that last for more than 15 minutes. If you foresee batch jobs taking several minutes or hours to complete, you can set up an Amazon Elastic Container Service (Amazon ECS) Fargate cluster to process messages from a queue.

Figure 5. Serverless compute with Fargate

In this architecture, the message processing rate is driven by the size of the cluster. We can also use an Amazon CloudWatch alarm that will automatically scale the size of the cluster based on the depth of the queue.

When launching a Fargate task using the Amazon ECS run-task API, the call is throttled at 1 transaction per second by default with a burst rate of 10. This means that you can, at most, launch 10 tasks every 10 seconds. Because of this, we recommend that you use a backoff strategy (like retries and exponential backoffs) when launching tasks. Alternatively, you can use an Amazon ECS create-service call where Amazon ECS will ensure that all tasks are run in time while maintaining the throttle rate. Essentially, you could run 30 tasks concurrently. However, you couldn’t start all 30 tasks at the same time because the run-task API for Fargate will be throttled.

Proactive monitoring to track scaling overruns

Before applying any rate limiting configurations, it is important to understand your workloads and traffic. AWS provides many services to help you monitor and understand your applications. For example, enabling AWS X-Ray on API Gateway gives you a full view of the request coming from the user into your application. CloudWatch collects monitoring and operational data in logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services. You can understand the normal state of your applications by effectively monitoring them. Once you understand your workloads, you can determine which layers and attributes are best suited to be rate limited.

Conclusion

In this blog post, we examined various methods and patterns to throttle your serverless applications as a means to control costs as well as avoid overloading downstream components. Consider using the strategies in this blog to implement throttling mechanisms when designing your serverless applications.

AWS Architecture Blog