AWS Compute Blog

Efficiently processing batched data using parallelization in AWS Lambda

This post is written by Anton Aleksandrov, Principal Solutions Architect, AWS Serverless

Efficient message processing is crucial when handling large data volumes. By employing batching, distribution, and parallelization techniques, you can optimize the utilization of resources allocated to your AWS Lambda function. This post will demonstrate how to implement parallel data processing within the Lambda function handler, maximizing resource utilization and potentially reducing invocation duration and function concurrency requirements.

Overview

AWS Lambda integrates with various event sources, such as Amazon SQSApache Kafka, or Amazon Kinesis, using event-source mappings. When you configure an event-source mapping, Lambda continuously polls the event source and automatically invokes your function to process the retrieved data. Lambda makes more invocations of your function as the number of messages it reads from the event source increases. This can increase the utilized function concurrency and consume the available concurrency in your account. Click the links to learn more about how Lambda consumes messages from SQS queues and Kinesis streams.

To improve the data processing throughput, you can configure event-source mapping batch window and batch size. These settings ensure that your function is invoked only when a sufficient number of messages have accumulated in the event source. For example, if you configure a batch size of 100 messages and a batch window of 10 seconds, Lambda will invoke your function when either 100 messages have accumulated or 10 seconds have elapsed, whichever happens first.

Event source mapping event batching

Event source mapping event batching

By processing messages in batches, rather than individually, you can improve throughput and optimize costs by reducing the number of polling requests to event sources and the number of function invocations. For instance, processing a million messages without batching would require one million function invocations, but configuring a batch size of 100 messages can reduce the number of invocations to 10,000.

Optimizing batch processing within the Lambda execution environment

Each Lambda execution environment processes one event per invocation. With batching enabled, the event object Lambda sends to the function handler contains an array of messages retrieved and batched by the event-source mapping. Once an execution environment starts processing an event object containing a batch of messages, it won’t handle additional invocations until the current one is complete. However, simply iterating over the array of messages and processing them one by one may not fully utilize the allocated compute resources. This can lead to underutilized or idle compute resources, like CPU capacity, and hence longer overall processing times.

Underutilized Lambda environments

Underutilized Lambda environments

Underutilization of compute resources can be generally caused by two things – non-CPU-intensive blocking tasks, such as sending HTTP requests and waiting for responses, and single-threaded processing when you have more than one vCPU core. To address these concerns and maximize resource utilization, you can implement your functions to process data in parallel. This allows more efficient utilization of the allocated compute capacity, reducing invocation duration, time spent idle, and the total concurrency required. In addition, when you allocate more than 1.8GB of memory to your function, it also gets more than one vCPU, which allows threads to land on separate cores for even better performance and true parallel processing.

Improved concurrency in Lambda environment

Improved concurrency in Lambda environment

When processing messages sequentially with a low compute utilization rate, reducing memory allocation may seem intuitive to save costs. This, however, can result in slower performance due to less CPU capacity being allocated. When your function is parallelizing data processing within the execution environment, you’re getting a higher compute utilization rate, and since raising the memory allocation also provides additional CPU capacity, it can lead to better performance. Use the Lambda Power Tuning tool to find the optimal memory configuration, balancing cost with performance.

Understanding the Lambda execution environment lifecycle

After processing an invocation, the Lambda execution environment is “frozen” by the Lambda service. Lambda runtime considers the invocation complete and “freezes” the execution environment when your function handler returns.

When the Lambda service is looking for an execution environment to process a new incoming invocation, it will first try to “thaw” and use any available execution environments that were previously “frozen”. This cycle repeats until the execution environment is eventually shut down.

Lambda worker lifecycle over time

Lambda worker lifecycle over time

Implementing parallel processing within the Lambda execution environment

You can implement parallel processing by running multiple threads in your function handler, but if those threads are still running when the handler returns, then they will be “frozen” together with the execution environment until the next invocation. This can lead to unexpected behavior, where the execution environment is “thawed” to process a new invocation, however, it still has background threads running and processing data from previous invocations. If you do not handle this properly, the behavior can cascade across multiple invocations, leading to delayed or unfinished processing and complicated debugging.

Threads frozen before finishing

Threads frozen before finishing

To address this concern, you need to ensure that the background threads you spawn in the function handler are done processing data before returning from the handler. All threads spawned within a particular invocation must complete within the same invocation in order not to spill over to subsequent invocations. This is illustrated in the following diagram. You can see threads start and end within the same invocation, and only once all threads have finished, the function handler returns.

Threads returning before end of invoke

Threads returning before end of invoke

Sample code

Programming languages offer diverse techniques and terminology for parallel and concurrent processing. Java employs multi-threading and thread pools. Node.js, though single-threaded, provides event loop and promises (for async programming), as well as child processes and worker threads (for actual multi-threading). Python supports both multi-threading (subject to Global Interpreter Lock) and multi-processing. Concurrent routines is another technique gaining attention.

The following sample is provided for illustration purposes only and is based on Node.js promises running concurrently. The sample code uses a language-agnostic term “worker” to denote a unit of parallel processing. Your specific parallelization implementation depends on your choice of runtime language and frameworks. AWS recommends you use battle-tested frameworks like Powertools for AWS Lambda that implement concurrent batch processing when possible. Regardless of the programming language, it is crucial to ensure all background threads/workers/promises/routines/tasks spawned by the function handler are completed within the same invocation before the handler returns.

Sample implementation with Node.js

const NUMBER_OF_WORKERS = 4;

export const handler = async (event) => {
    const workers = []; 
    const messages = event.Records;
    
    // For handling partial batch processing errors
    const batchItemFailures = [];

    for (let i=0; i<NUMBER_OF_WORKERS;i++){
        // No await here! The waiting will happen later
        const worker = spawnWorker(i, messages, batchItemFailures);
        workers.push(worker);
    }
    
    // This line is crucial. This is where the handler
    // waits for all workers to complete their tasks
    const processingResults = await Promise.allSettled(workers);
    console.log('All done!');

    // Return messageIds of all messages that failed 
    // to process in order to retry
    return {batchItemFailures};
};

async function spawnWorker(id, messages, batchItemFailures){
    console.log(`worker.id=${id} spawning`);
    while (messages.length>0){
        const msg = messages.shift();
        console.log(`worker.id=${id} processing message`);
        try {
            // A blocking, but not CPU-intensive operation 
            await processMessage(msg);
        } catch (err){
            // If message processing failed, add it to 
            // the list of batch item failures
            batchItemFailures.push({ itemIdentifier: msg.messageId});
        }
    }
}

See the sample code and AWS Cloud Development Kit (CDK) stack at github.com.

Testing results

The following chart illustrates a Lambda function processing messages using an SQS event-source mapping. After enabling message processing with 4 workers, the invocation duration and concurrent executions dropped to 1/4th of the previous value, while still processing the same number of messages per second. Thanks to parallelization, the new function is faster and requires less concurrency.

Function performance dashboard

Function performance dashboard

Looking at the invocation log, you can see that the function handler has spawned four workers, and all of them were completed before the handler returned the result. You can also see that although the handler received 20 items, with each item taking 200ms to process, the overall duration is only 1000ms. This is because items were processed in parallel (20 items * 200ms / 4 workers = 1000ms total processing time).

START RequestId: (redacted)  Version: $LATEST
2024-06-18T03:18:03.049Z    INFO    Got messages from SQS
2024-06-18T03:18:03.049Z    INFO    messages.length=20
2024-06-18T03:18:03.049Z    INFO    worker.id=0 spawning
2024-06-18T03:18:03.049Z    INFO    worker.id=0 processing message
2024-06-18T03:18:03.049Z    INFO    worker.id=1 spawning
2024-06-18T03:18:03.049Z    INFO    worker.id=1 processing message
2024-06-18T03:18:03.050Z    INFO    worker.id=2 spawning
2024-06-18T03:18:03.050Z    INFO    worker.id=2 processing message
2024-06-18T03:18:03.050Z    INFO    worker.id=3 spawning
2024-06-18T03:18:03.050Z    INFO    worker.id=3 processing message
2024-06-18T03:18:03.250Z    INFO    worker.id=0 processing message
2024-06-18T03:18:03.250Z    INFO    worker.id=1 processing message
(redacted for brevity)
2024-06-18T03:18:03.852Z    INFO    worker.id=1 processing message
2024-06-18T03:18:03.852Z    INFO    worker.id=2 processing message
2024-06-18T03:18:03.852Z    INFO    worker.id=3 processing message
2024-06-18T03:18:04.052Z    INFO    All done!
END RequestId: (redacted)
REPORT RequestId: (redacted) Duration: 1004.48 ms

Considerations

  • The technique and samples described in this post assume unordered message processing. In case you use ordered event sources, such as SQS FIFO Queues, and require preserving message order, you will need to address that in your implementation code. One technique might be creating a separate thread for each messageGroupId.
  • While providing performance and cost benefits, multi-threading and parallel processing is an advanced technique that requires proper error handling. Lambda supports partial batch responses, where you can report back to the event source that specific messages from the batch failed to be processed so they can be retried. You can collect failed message IDs from each thread and return them as your function handler response. This is illustrated in the sample above. See Handling errors for an SQS event source in Lambda and Best Practices for implementing partial batch responses for additional details.

Conclusion

Efficiently processing large volumes of data implies efficient resource utilization. When processing batches of messages from event sources, validate whether your function would benefit from parallel or concurrent processing within the function handler thus increasing the compute capacity utilization rate. With a high compute capacity utilization rate, you can allocate more memory to your function, thus getting more CPU allocated as well, for faster and more efficient processing. Use frameworks like Powertools for AWS Lambda that implement concurrent batch processing when possible, and use the Lambda Power Tuning tool to find the best memory configuration for your functions, balancing performance and cost.

For more serverless learning resources, visit Serverless Land.