Build a Real-time, WebSockets API for Amazon Bedrock

Generative AI is transforming the way applications interface with data, which in turn is creating new challenges for application developers building with generative AI services like Amazon Bedrock. Amazon Bedrock is the easiest way to build and scale generative AI applications with foundation models (FMs). Amazon Bedrock, a fully managed service, offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications. This post describes how you can seamlessly connect your frontend web and mobile applications to Amazon Bedrock by leveraging AWS AppSync and real-time WebSockets, and how you can easily load and store conversation history between your application users and an FM using AWS AppSync pipeline resolvers.

Connecting Applications to Generative AI Presents New Challenges

Generative AI and FMs are transforming the way applications interface with data. For example, instead of filling out and submitting forms, users chat with FMs. And, these new data access paradigms create new challenges for application developers, including:

FMs can sometimes take seconds or even minutes to fully respond to a request. This means app users will experience long load times which may even exceed synchronous API connection limits. This post will show how connecting apps and FMs asynchronously can support long running generations and give you the tools you need to provide a compelling experience for users interfacing with FMs.
FMs (and end users) often require access to conversation history. In addition, agent system prompts, connected actions, user account metadata and more can also be necessary to provide the correct context to an FM. This post will present a pattern for simply storing and orchestrating FM conversation history.

The New Real-Time Data Requirements of Generative AI Applications

Let’s explore a sample generative AI chat-based application to see how and where these new challenges appear. In this example application users interact with an FM via a conversation or chat. Users expect a responsive experience with results shown to them immediately as they are generated. They may also want visibility into what the FM is doing and how it is progressing towards its generative goal.

Below is an architecture that attempts to address these requirements using a synchronous API to connect the frontend application to the FM.

architecture diagram of fullstack ai

The architecture flow above executes as follows:

A request is sent to the back end service to answer a given natural language question.
The back end service queries AWS DynamoDB for context data, the past history of the conversation, a system prompt, or other metadata needed to provide the back-end service a full view of the conversation. DynamoDB is used here as an example of how this data can be supplied through a collection of tables. This back-end service could be implemented with EC2 or Lambda and run custom code that loads data and queries data stores. A framework like LangChain can be used for managing the specific calls to the FM but still relies on manual data loading for chat history and some data store connections.
Using the user’s query and the conversation history, we run through a natural language model to fuzzily transform this request into a data lookup in a data store or into a direct result for the user. If we are using a Retrieval Augmented Generation (RAG) based approach this would encode text into a vector to lookup the needed metadata in a data store before generating a summary for the user response. If we are invoking an agent, this could involve multiple steps of a reason-observation-action (ReAct) pattern of prompting. Regardless, we wait for this process to finish.
The result is sent back to the user, which could be many seconds after their query. The user lacks visibility into the process as the data is pulled. For some applications this is acceptable, for others this results in frustrations and a poor user experience.

We see some clear friction points in the above design resulting from an architecture that was not designed for the new wave of Generative AI. The difficulties we run into are summarized below:

Poor user experience. FMs can take quite a while to respond to a given API prompt. Depending on the size of the model and the size of the output, its not uncommon to wait 30 seconds or even one minute for a single generation. For workflows with multiple agent responses, like ReAct, or Tree of Thought, there will be multiple such prompts, meaning a user might need to wait for minutes for a possible answer. Additionally, many synchronous API services cut off connections after 30 seconds, meaning that even if you are willing to let your user wait you can’t. Many FMs do provide the ability to stream back the response, but this stream will be sent to the invoker, in this case the back-end service. To provide application users a fluid, and near immediate experience, this stream needs to be delivered to the client instead.
Complicated code required to manage conversation history. FMs are more powerful when they have access to relevant metadata. This can include customer business data, but also includes past conversation histories, system prompts for agents, user preferences, and vector stores. This rich context is needed so FMs can provide personal and relevant data for each user request. Conversation history also needs to kept up to date with both user and FM messages as the conversation unfolds. If you use Amazon Bedrock Agents in your solution, they will manage much of this heavy lifting for you. If, however, you need to use a custom agent framework, loading this data and providing it to the FM as needed can be a cumbersome process to manage. Additionally, keeping it in sync as multiple systems consume or edit it means your logic is spread across many systems.
Backend service overhead. In the above design, the backend service needs to be managed. Servers, containers, or functions need to be running to be ready to handle new conversation messages. If WebSockets are added to connect the client to the server, server code needs to be written to manage the connection lifecycle for the entire user conversation. Authentication, rate limiting, and logging become unknown factors and the complexity of managing these systems is difficult to manage at scale.

AWS AppSync as a Connector to Generative AI

AWS AppSync is a managed GraphQL offering you can use to create flexible and secure real-time APIs on top of your data. AWS AppSync also natively exposes serverless WebSockets to developers to provide updates to application clients in real time as data changes.

AWS architecture of AppSync connecting to Dynamodb and Generative AI

This architecture leverages AWS AppSync and AWS Lambda to serverlessly connect frontend applications to Amazon Bedrock and FMs and provides some key advantages over our previous implementation. In this new architecture serverless WebSockets stream results to frontends applications, conversation metadata management is decoupled from our FM handling logic, and authentication to the entire API on a per-user basis is easy to set up.

AWS AppSync Serverless WebSockets for Realtime Tokens and Events

The backbone of this architecture is the serverless WebSocket subscriptions that are built into every AWS AppSync API. This allows clients to receive configurable real time updates on any data mutation operations performed on AWS AppSync. The specific flow looks like this:

A request is sent to AWS Appsync, which triggers a series of resolvers to load any metadata required and then invokes Lambda.
The Lambda function is triggered with an event containing the conversation history and any other relevant metadata.
The Lambda function invokes the FM, and a result stream is delivered back to the Lambda function.
As the Lambda function receives response tokens from Amazon Bedrock, it batches them and invokes the below mutation to the GraphQL API to send those tokens to the user via WebSockets.

# Example GRAPHQL query sent to Appsync from the Lambda function
mutation SendConversationStreamingUpdate (tokens: String!) {
    agentPublishMetadata(data: { agentPartialMessage: tokens }) {
        conversationId
        timestamp
        event {
            agentPartialMessage
       }
    }
}

This can publish objects like tokens or more complex objects like records of actions performed. Below is a mutation that could update a conversation about a specific action result taken by the an FM agent. (e.g. if the agent was prompted for and allowed to query the Amazon Relational Database Service).

# Example GRAPHQL query sent to Appsync from the Lambda function
mutation SendConversationUpdate($queryResult: String!) {
    agentPublishEvent(data: {rdsActionResult: $queryResult}) {
        conversationId
        timestamp
        event {
            rdsActionResult
        }
    }
}

Conversation History Loading

As generative AI applications scale, loading and managing conversation history and other metadata becomes a more complex challenge. AWS AppSync steps in here to automatically manage the conversation history, and whatever additional metadata your FM needs. This enables you to decouple your metadata logic from your FM agent.

AWS AppSync accomplishes this by giving you the ability to natively write JavaScript code to resolve API requests. This code can invoke configured AWS datasources and provide any needed data for those requests to be handled. As an example, this could include loading stored metadata from DynamoDB and invoking Lambda functions with simple resolver code like below.

// Example AppSync Resolver Code for loading the agent for a conversation
// Each resolver is configured with a connected data source, in this case DDB
// Appsync provides utilities to make the dx very clear

import * as ddb from '@aws-appsync/utils/dynamodb';

export function request(ctx: Context) {
    return ddb.get({key: {
        id: ctx.stash.conversationData.agent
    })
}

Resolver functions like the above implementation can be chained together into pipeline resolvers that handle the complex conversation history workflows for you.

Appsync execution flow

When the Lambda function is invoked, it is provided with a clean event object that encapsulates the conversation history to that point, ready for it to be consumed by the FM. This provides the Lambda function everything it needs to handle the request. Here are some examples of the kind of metadata you can manage in this manner:

User inputs. The most recent message by the user that needs to be handled for this invocation.
Conversation data. The collection of all time-ordered conversation events. Each event has a sender, some data (i.e. a message or an action against a data store), and an ID for that message.
Conversation History String is a stringified version of the entire conversation history. This is ideal as it can be fed directly into an LLM’s API as the context string without any additional transformations or used as a prefix along with custom logic.
FM agent data. This could include data that represents the agent itself, like what actions your agent is connected to, possible permissions, and system prompts.

Prompts, in this example, are being managed by the Lambda FM handler function itself. However this pipeline resolver pattern provided by AWS AppSync can be expanded to include prompt engineering flows to version and deploy new prompts without downtime. AWS AppSync connects to many datastores in AWS and can send signed calls to HTTP endpoints, allowing you to dynamically load custom prompts you need during runtime.

The Complete Flow in Detail

For a more detailed look at how a user message goes from the client, through the AWS AppSync API, to the FM based agent, and back to the client, see the below diagram.

completed ai with appsync flow

Client sends a new message to a conversation. This is now in the form of a GraphQL mutation request against a target conversation ID object. The client additionally opens a web socket connection to the conversation object in question in the GraphQL schema to listen for updates from their request.
AWS AppSync resolvers fetch the needed data for this conversation from DynamoDB. This is now managed by the the AWS AppSync API resolvers themselves and is handled automatically as part of the GraphQL request by the user.
AWS AppSync resolvers store the user’s sent message with the provided conversation ID.
AWS AppSync asynchronously invokes the Lambda FM handler configured by the customer. This is where the business logic for the FM would live. The specifics depend on the business use case, but this could involve a RAG system like we discussed earlier, or a LangChain backed FM Agent, or even custom generative AI logic.
AWS AppSync returns a http 200 response to the user indicating the user’s message was successfully sent. This response includes the new conversation message written by the user. The process to this point is < 30ms of execution time within the AWS AppSync service.
The Lambda FM handler runs custom generative AI business logic as needed to handle the users request. That may include calls to Amazon Bedrock or other foundation models. If an AI agent is used, this may involve the model taking various AI-driven actions against back-end system.
A stream is provided back into the Lambda from the FM output.
Results from the FM are sent as mutation events back to AWS AppSync. Mutations can be triggered as data is made available in order to provide a fluid end user experience. This can involve updates about when the FM has started to respond and when the FM is thinking, tokens that FMs have generated as they become available, actions the FM has invoked against connected data stores, and the results of those data store executions. Whatever insights into the generation process the front end needs to provide the user can be delivered to AWS AppSync as available through a mutation.
AWS AppSync resolvers trigger on these mutation operations, storing these events in DynamoDB for persistent storage.
AWS AppSync’s native support for WebSockets and the existence of the subscription in the API schema results in the agent message being published to any connected clients to a given conversation. In this case that is only one client, but as generative AI applications evolve, this allows for notification of N clients who may want to know about this update.
WebSocket connection formed by the client to AWS AppSync receives the agent message and updates in real time.

Conclusion

In this article, we went over how real time data availability is a key building block for AI-driven experiences, we explored how AWS AppSync can provide WebSocket-based real time data streams at scale, and we proposed an example architecture that you can get started with right now that handles much of the conversation logic for you.

Front-End Web & Mobile