AWS Compute Blog

Building fault-tolerant applications with AWS Lambda durable functions

Business applications often coordinate multiple steps that need to run reliably or wait for extended periods, such as customer onboarding, payment processing, or orchestrating large language model inference. These critical processes require completion despite temporary disruptions or system failures. Developers currently spend significant time implementing mechanisms to track progress, handle failures, and manage resources when waiting for external events, shifting focus from business logic to undifferentiated tasks.

At re:Invent 2025, Amazon Web Services (AWS) launched AWS Lambda durable functions, a new capability extending Lambda’s event-driven programming model with built-in capabilities to build fault-tolerant multi-step applications and AI workflows using familiar programming languages. At its core, durable functions are regular Lambda functions, so your development and operational processes for Lambda continue to apply. However, when you create a Lambda function you can now enable durable execution, so that you can checkpoint progress, automatically recover from failures, and suspend execution for up to one year when waiting on long-running tasks, such as human-in-the-loop processes.

How Lambda durable functions work

When working with standard Lambda functions, your code runs from start to finish in a single invocation. If a failure occurs at any point during the execution, the entire function must be retried by the invoking event source. Any state that needs to be preserved between executions must be explicitly saved and retrieved. This is typically done by using external storage services such as Amazon DynamoDB or Amazon Simple Storage Service (Amazon S3). Furthermore, you must typically guard against duplicate (concurrent) invocations of the same event and have a strategy to safely deploy updates while continuing to process events.

In contrast, with Lambda durable functions, developers use durable operations such as “Steps” and “Waits” in the event handler to checkpoint progress, handle failures, and suspend execution during wait periods without incurring compute charges for on-demand functions. These durable operations and any optional state returned from them are automatically persisted by Lambda in a fully-managed durable execution backend. If failures occur during the execution, or if your function resumes its execution after being paused, Lambda invokes your function again, restoring (replaying) the previous state by executing the event handler from the start, but skipping over completed durable operations. To streamline this checkpoint/replay mechanism for developers, you can use the Lambda durable execution SDK to wrap or annotate your event handler, which enhances the existing Lambda context with several new methods like context.step() and context.wait(). Furthermore, you can use methods such as context.waitForCallback() to wait on external jobs or asynchronous processes, such as “human-in-the-loop” scenarios. The execution is paused until a SendDurableExecutionCallbackSuccess or SendDurableExecutionCallbackFailure response is sent to the Lambda API.

Getting started

Use the AWS Serverless Application Model (AWS SAM) to create a new durable function with sam init with an AWS Quick Start Template. Lambda durable functions are also supported by the AWS Cloud Development Kit (AWS CDK)AWS Command Line Interface (AWS CLI), AWS CloudFormation and other infrastructure as code (IaC) frameworks such as Terraform.

Consider the following function, which performs user onboarding. First, it creates a user profile based on some data, then it sends out an email for verification and waits until the user either confirms the email address, or a 24-hour timeout is reached. Finally, it sends out a confirmation.

import {
  DurableContext,
  withDurableExecution,
} from '@aws/durable-execution-sdk-js';
export const handler = withDurableExecution(
  async (event: OnboardingEvent, context: DurableContext) => {
    try {    
      // Create user profile
      const profile = await context.step("create-profile", async () =>
        createUserProfile(event.email, event.name)
      );
      // Wait for email verification via callback
      const verification = await context.waitForCallback(
        "wait-for-email-verification",
        async (callbackId) => {
          // Send email to user and pass callbackId
          await sendVerificationEmail(profile, callbackId);
        },
        {
          timeout: { hours: 24 } 
        }
      );
      // Send confirmation and welcome email
      const result = await context.step("complete-onboarding", async () => {
        if (!verification || !verification.verified) 
     return { ...profile, status: 'failed' };
        await sendWelcomeEmail(profile.email, profile.name);
        return { ...profile, status: 'active' };
      });
      return result;
    } catch (error) {
      // omitted 
    }
  }
);

Durable functions have built-in and fully customizable error handling for steps. For example, if the profile was successfully created and verified, but a temporary error occurred when sending out the confirmation, then the step is retried. The retry skips over any previously completed checkpoints, such as the profile creation and callback. Only the code within the send confirmation step is run again.

Next, you update the AWS SAM template to include your durable function. You create a Lambda durable function by including the DurableConfig setting for your function. Note that you currently cannot add a durable configuration to a function that was originally created without it. The ExecutionTimeout defines after which time the durable execution times out to protect against runaway or deadlock application bugs. This setting is separate from the invocation timeout, which defines for how long a single invocation can run. The maximum invocation timeout for a single function invocations remains unchanged at 15 minutes. With Lambda durable functions, you will typically see multiple invocations per durable execution, such as when using the wait capabilities in the SDK or automatic retries. You can set the ExecutionTimeout for up to one year when using asynchronous invocations.

The RetentionPeriodInDays defines how long the execution data of a durable execution is available to you after executions complete.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
 
Resources:
  UserOnboardingFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: UserOnboardingFunction
      CodeUri: ./src
      Handler: index.handler
      Runtime: nodejs24.x
      Architectures:
        - x86_64
      MemorySize: 256
      Timeout: 60		   // Timeout for an individual invocation
      DurableConfig:		   // This makes the function a durable function
        ExecutionTimeout: 90000 // 25h timeout for the durable execution overall
        RetentionPeriodInDays: 7 
UserOnboardingFunctionRole:
    Type: AWS::IAM::Role
    // omitted for brevity

You must include the necessary permissions for your function. For example, the AWSLambdaBasicDurableExecutionRole managed policy only allows the minimal AWS Identity and Access Management (IAM) actions to create/retrieve checkpoints and logs to increase security. Therefore, it does not include permissions to invoke other (durable) functions or manage callbacks. Refer to the documentation for more details.

Testing locally

Before deploying your function, you can test it locally using AWS SAM local invoke.

AWS SAM locally invokes your function and runs the event handler until it reaches the context.waitForCallback(). To complete callbacks, AWS SAM offers new commands to interact with your durable functions. In this example, you send a Success response to complete the callback. You can also include relevant data in the response. You can send the response directly using the on-screen guide or using another AWS SAM CLI command from another process.

sam local callback succeed <your-callback-id> --result '<your data>'

To inspect an execution, you can use AWS SAM to retrieve the durable execution history of your function, which includes details about steps, callbacks, and wait durations, as shown in the following example code.

sam local execution history <execution-arn>

Depending on your use case, you can instead send a Failure response to a callback and handle those errors in your code. For example, by performing compensation logic in a subsequent step:

sam local callback fail <your-callback-id> --result '<your data>'

Now that you have verified that your function works as intended, deploy it to AWS using sam deploy command.

Best practices and considerations

Invoking a Lambda durable function requires a qualified Amazon Resource Name (ARN), such as an alias or version. We recommend that you don’t use the $LATEST qualifier except for rapid prototyping or local testing. Using explicit versions ensures that replays always happen with the same code with which the execution was started. This is to ensure deterministic execution and prevent inconsistencies when updating your function code during executions.

We recommend bundling the durable execution SDK with your function code using your preferred package manager. The SDKs are fast-moving, so you can update dependencies as new features become available.

There are other durable operations in the Lambda durable functions SDK that you can use to build your application:

  • waitForCondition(): Pauses the execution of your function until a condition is met. For example, the status of a job polled with an API. For this to work, you provide the waitStrategy and a check function to poll the status.
  • parallel(): Runs multiple durable operations in parallel within the same function, with configurable options such as the maximum number of concurrent branches and desired failure behavior. This streamlines managing durability and checkpointing for simultaneous asynchronous actions.
  • map(): Creates a durable operation and checkpoint for each item of an array, based on the provided mapping function. The items are processed concurrently.
  • invoke(): Invokes another Lambda function and waits for its result. The SDK creates a checkpoint, invokes the target function, and resumes your function when the invocation completes. This enables function composition and workflow decomposition.

Refer to the developer guide for more details.

Lambda compute charges apply to all invocations, including any replays. When using wait operations, the function suspends execution and, for on-demand functions, doesn’t incur duration charges until execution resumes. You’re also charged for durable operations, data written, and data retention. To learn more about Lambda durable functions pricing, refer to the Lambda pricing page.

For the latest Region availability, visit the AWS Capabilities by Region page.

Conclusion

AWS Lambda durable functions extends the Lambda programming model to streamline building fault-tolerant, long-running applications using familiar programming patterns. You can use Lambda durable functions to write multi-step workflows in your preferred programming language, using built-in methods that automatically handle progress checkpointing and error recovery. This streamlines your architectures so that you can focus on your business logic, and optimize cost by charging only for active compute time.

You can build durable functions for Python or Node.js based Lambda functions using the Lambda API, AWS Management Console, AWS CLI, AWS CloudFormation, AWS SAM, AWS SDK, and AWS CDK.

To get started, visit the Lambda Developer Guide or watch the re:Invent breakout session.