AWS Compute Blog

Designing Serverless Integration Patterns for Large Language Models (LLMs)

This post is written by Josh Hart, Principal Solutions Architect and Thomas Moore, Senior Solutions Architect

This post explores best practice integration patterns for using large language models (LLMs) in serverless applications. These approaches optimize performance, resource utilization, and resilience when incorporating generative AI capabilities into your serverless architecture.

Overview of serverless, LLMs and example use case

Organizations of all shapes and sizes are harnessing LLMs to build generative AI applications to deliver new customer experiences. Serverless technologies such as AWS Lambda, AWS Step Functions and Amazon API Gateway enable you to move from idea to market faster without thinking about servers. The pay-for-use billing model also allows for increased agility at an optimal cost.

The examples in this post leverage Amazon Bedrock, a fully managed service to access foundation models (FMs). The same principles apply to LLMs hosted on other platforms such as Amazon SageMaker. Amazon Bedrock allows developers to consume LLMs via an API without the complexities of infrastructure management. Amazon SageMaker is a fully managed service to build, train and deploy machine learning models.

The example use-case in this post is leveraging LLMs to create compelling marketing content for the launch of a new family SUV. Images of the vehicle were pre-generated using Amazon Titan Image Generator in Amazon Bedrock, which are shown below.

Three different images of a new family SUV generated by Amazon Titan Image Generator.

Example use case images generated using Titan Image Generator

As organizations adopt LLMs to power generative AI applications, serverless architectures offer an attractive approach for rapid development and cost-effective scaling. The following sections explore several serverless integration patterns to build cost-effective, performant, and fault-tolerant generative AI applications.

Direct AWS Lambda call

Architecture diagram showing AWS Lambda invoking Amazon Bedrock using the InvokeModel API call.

Direct call to Amazon Bedrock from AWS Lambda

The simplest serverless integration pattern is directly calling Bedrock in Lambda using the AWS SDK. Below is an example Lambda function using the Python SDK (boto3), calling the Bedrock InvokeModel API.

import json
import boto3
brt = boto3.client(service_name='bedrock-runtime')

def lambda_handler(event, context):
    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [{
            "role": "user",
            "content": [{
                "type": "text",
                "text":"Create a 500 word car advert given these images and the following specification: \n {}".format(event['spec'])
            },
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/jpeg",
                    "data": event['image']
                }
            }]
        }]
    })

    modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
    accept = 'application/json'
    contentType = 'application/json'
    response = brt.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())

    return {
        'statusCode': 200,
        'body': response_body["content"][0]["text"]
    }

The above code requires the Lambda function execution role to have the correct AWS Identity and Access Management (IAM) permissions to Amazon Bedrock, specifically the bedrock:InvokeModel action.

The example uses the Anthropic Claude 3 Sonnet LLM and the Anthropic Claude Messages API for the payload. The InvokeModel call is synchronous and will therefore wait for a response from the LLM. Depending on the model and prompt, the call can take several seconds. Ensure your Lambda function timeout is set appropriately. In most cases it will need to be increased from the default of 3 seconds.

The boto3 client has a default timeout of 60 seconds. Depending on the use case, you may need to increase the boto3 client timeout as shown in the sample code below.

from botocore.config import Config
# Set the read timeout to 600 seconds (10 minutes)
config = Config(read_timeout=600)

# Create the Bedrock client with the custom read timeout configuration
boto3_bedrock = boto3.client(service_name='bedrock-runtime', config=config)

When working with LLMs, the generated text is often substantial, leading to increased response times or even timeouts. Amazon Bedrock provides the ability to stream responses using InvokeModelWithResponseStream which allows you to process and consume the generated text in chunks as it becomes available. This enables a faster response to the client and allows at least a partial response even if a timeout occurs.

When using response streaming with Lambda functions you should set the boto3 read_timeout to a lower value than the function execution timeout, meaning you will have the option to return at least some content. In some situations this is preferred to no response at all. For example, you might set your Lambda function timeout to 2 minutes and your boto3 read timeout to 90 seconds. This gives you 30 seconds to take additional action. Depending on the failure scenario, you might take various actions:

  • Transient errors such as rate limiting or service quotas: Consider backing off and retrying the request or load-balancing requests to another region with cross-region inference.
  • Timeout errors when the boto3 read timeout is hit: Decide whether to retry the request with a simplified prompt (or a shorter response length) or return a partial response.

Prompt chaining with AWS Step Functions    

The direct Lambda pattern works well for simple single-prompt inference. Accomplishing complex tasks with LLMs requires a technique called prompt chaining, where tasks are broken down into smaller well-defined subtask prompts and each prompt is fed to the LLM in a defined order.

Prompt chaining inside a single Lambda function can be time consuming, and may exceed the maximum Lambda timeout of 15 minutes in some cases. AWS Step Functions can be used to solve this issue by orchestrating calls to LLMs. Bedrock has an optimized integration for Step Functions which allows you to use Run as Job (.sync). This integration pattern means Step Functions will wait for the InvokeModel request to complete before progressing to the next state. With Step Functions Standard Workflows you only pay for state transitions, which reduces the cost for Lambda idle wait time.

The below example shows prompt chaining with Step Functions using direct integrations only. The example eliminates the need of custom Lambda code.

Workflow diagram for AWS Step Functions showing an example prompt chain to generate different text content for showroom vehicles.

Prompt chaining using AWS Step Functions

  1. The user input (vehicle description) is passed to Amazon Bedrock via the Step Functions optimized integration.
  2. The generated output of the InvokeModel API call is passed via the ResultPath to the next step.
  3. The state machine sets the input of the next step based on the output of the previous step using the Pass state.
  4. The output of each inference request continues to be passed between each step in the workflow.
  5. The last step runs an inference request and the final result is returned as the output of the state machine.

Another advantage to using AWS Step Functions to invoke the LLM is the built-in error handling. Step Functions can be setup to automatically retry on error and allows you to configure a backoff rate and add jitter to help control throttling. No custom coding is required.

View of the different error handling options in AWS Step Functions for a particular action. Including internal, max attempts, backoff rate, max delay and jitter.

Built-in error handling options for an action in an AWS Step Functions workflow

Handling throttling is particularly important when you are approaching the Bedrock service quota limits, such as the number of requests processed per minute for a particular model. Be aware that some limits are hard limits and cannot be adjusted. See the Bedrock service quotas documentation for the latest information.

Parallel prompts with AWS Step Functions

The performance of the application can be improved by breaking down tasks into smaller sub-tasks and running them in parallel. This can dramatically decrease the overall response time, especially for larger models and complex prompts. In the following example, parallel processing reduced the total execution time of the state machine from 30.8 seconds to 19.2 seconds, an improvement of 37.7% when compared to the same steps run in sequence.

The below example uses the Step Functions parallel state to perform Bedrock InvokeModel actions in parallel.

Example workflow showing prompt chaining using the AWS Step Functions parallel state.

Prompt chaining example using parallel state in AWS Step Functions

  1. The user input (vehicle description) is passed to Amazon Bedrock via the Step Functions optimized integration.
  2. The Step Functions parallel state allows branching logic to perform multiple steps in parallel.
  3. Complex inference tasks are run in parallel to reduce end-to-end execution time.
  4. Shorter tasks can be combined to balance branch execution time with longer running tasks.
  5. The generated output is combined and the final response returned.

In addition to the parallel state, the Step Functions map state can be used to run the same action multiple times in parallel with different inputs. For example if you wanted to generate marketing materials for 100 vehicles with data stored in Amazon S3 you could run the above workflow nested in a distributed map state.

Result caching

Generating text using LLMs can be a computationally intensive and a time-consuming process, especially for complex prompts or long content generation. To improve performance and reduce latency, caching should be used where possible by storing and reusing previously generated responses. This concept is explored in detail in Mastering LLM Caching for Next-Generation AI.

Caching can be implemented at different levels within your application architecture, each with its own advantages and trade-offs. Here are some examples:

  1. Caching inside the Lambda execution environment: if your Lambda function receives repeated prompts or inputs, you can store these results inside memory or the /tmp directory of a warmed execution environment.
  2. External caching services: to overcome the limitations of in-memory caching and leverage more robust caching solutions, you can integrate with external services to store previous results like Amazon ElastiCache (for Redis or Memcached) or Amazon DynamoDB.

The example below uses a Step Functions workflow to check for a cached response in DynamoDB before invoking the model. The cache key in this case could be the LLM prompt. This helps to reduce costs whilst improving performance. The example generates custom vehicle descriptions based on a particular persona, for example to focus on safety features and luggage space for a family, or performance specifications for a motorsport enthusiast.

Example AWS Step Functions workflow that uses Amazon DynamoDB to store and retrieve previously generated LLM responses.

Example AWS Step Functions that uses Amazon DynamoDB to cache LLM responses

When implementing caching, it is crucial to consider factors such as cache invalidation strategies, cache size limitations, and data consistency requirements. For example, if your LLM generates dynamic or personalized content, caching may not be suitable, as the responses could be stale or incorrect for different users or contexts.

Conclusion

This post explored integration patterns for consuming LLMs in serverless applications, enabling an efficient and reliable next generation experience for customers. Single-prompt inference can be achieved with AWS Lambda using the AWS SDK.

Responses from LLMs can be large and often leads to manipulating large text responses in memory, especially for Retrieval-Augmented Generation (RAG) use cases. It’s therefore important to select an optimal memory configuration for your function, and the recommended way to do this is using the AWS Lambda Power Tuning.

When more complex prompt chaining is required it’s best practice to explore Step Functions as a way to reduce idle wait time and avoid being limited by the Lambda 15 minute timeout. Step Functions also bring the benefits of an optimized integration for Bedrock, as well as the ability to handle errors and run tasks in parallel.

Remember that model choice is also an important consideration to balance cost, performance and output capabilities. This is discussed further in Choose the best foundational model for your AI applications.

To find more serverless patterns using Amazon Bedrock take a look at Serverless Land.