Improve Amazon Bedrock Observability with Amazon CloudWatch AppSignals

With the pace of innovation with Generative AI applications, there is increasing demand for more granular observability into applications using Large Language Models (LLMs). Specifically, customers want visibility into:

Prompt metrics like token usage, costs, and model IDs for individual transactions and operations, apart from service-level aggregations.
Output quality factors including potential toxicity, harm, truncation due to length limits, and failures from exceeding token allowances.
Performance visibility for advanced LLM use cases like agent-based interactions, custom knowledge stores with Retrieval-Augmented Generation (RAG), etc.
Performance visibility to compare LLMs to choose the best model based on price, performance and tuning.

The Amazon Bedrock service vends a few metrics on the requests’ response time, token usage and invocation volume to its language model. These are helpful to quantify frequency of model invocation and token usage of the requests, to gain insights and identify opportunities to optimize your model usage. However, these vended metrics lack the granularity on service and operation level making it harder to perform diagnosis of issues in complex and distributed applications. Further, the lack of traces for these LLM requests hinders end to end visibility.

Amazon CloudWatch Application Signals provides you with a unified, application-centric view of your applications, services, and dependencies, and helps you monitor and triage application health. Enabling Application Signals for your applications, automatically instruments your application to collect metrics and traces, and display key metrics such as call volume, availability, latency, faults, and errors. You can also create and monitor service-level objectives (SLOs) and view a map of your application’s topology along with any dependencies.

We are excited to announce that Amazon CloudWatch Application Signals now supports automatic instrumentation of generative AI applications built on AWS using Amazon Bedrock. This update makes it easy to automatically instrument and track application performance, helping developers with optimal prompt engineering, cost management and output quality while delivering the best user experience for generative AI applications.

Walkthrough

In this post, we’ll present how you can leverage the enhanced telemetry for Amazon Bedrock in Amazon CloudWatch Application Signals to automatically assess the performance of the LLMs in Amazon Bedrock in the context of their applications, reducing the Mean Time to Identify (MTTI) with improved visibility and debugging capabilities. You can drill down to the performance of individual models using the golden signals of Application Performance Monitoring (APM), correlate service operation performance to the LLMs, and debug anomalies using traces for Bedrock.

Prerequisites

An AWS Account with access to the appropriate models enabled.
A sample Generative AI application using Amazon Bedrock. For the purposes of this blog post, follow the instructions from GitHub to deploy the sample app. If using this sample application, follow the Amazon Bedrock models access documentation to enable access to the amazon.titan-text-express-v1 and the anthropic.claude-3-sonnet-20240229-v1:0 models if not already enabled.
If not already enabled, follow the instructions to Enable Application Signals in your account.

Service Map

Navigate to the Amazon CloudWatch console and choose Service Map under the Application Signals section in the left navigation pane. As shown in Figure 1, you will now see two new types of nodes for Amazon Bedrock.

Figure 1: View application topology including Bedrock using Service Map

Bedrock node

This represents all Bedrock Control Plane Requests including model configuration and deployment requests. Depending on your application, you may notice more than one Bedrock node as a dependency. Each Bedrock node groups together calls to configure a type of resource, where the resource could be one of the following:

Agent – Agents for Amazon Bedrock offers you the ability to build and configure autonomous agents in your
application. Agents orchestrate interactions between foundation models (FMs), data sources, software applications, and user conversations.
Knowledge base – Allows you to integrate proprietary information into your generative-AI applications.
Datasource – The source of information to the knowledge base. For example, Amazon S3, Confluence, etc.
Guardrail – Enables you to implement safeguards for your generative AI applications based on your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple foundation models (FM), providing a consistent user experience and standardizing safety and privacy controls across generative AI applications. You can use guardrails with text-based user inputs and model responses.

Figure 2 below shows the dependency on the KnowledgeBase type Bedrock resource with id FIFFBKOOWC. To find additional details, you can navigate through the dependency as highlighted below for the BedrockRuntime dependency.

Figure 2: Bedrock node representing KnowledgeBase operations

BedrockRuntime node

This node represents all data plane requests to Amazon Bedrock including model inference requests. Depending on your application, you may notice more than one BedrockRuntime node as a dependency, where each node will represent the interactions with a model. In the given example, the code makes a request to both the amazon.titan-text-express-v1 and the anthropic.claude-3-sonnet-20240229-v1:0 models and hence Figure 1 above shows 2 instances of BedrockRuntime.

By selecting the BedrockRuntime node, as seen in Figure 3 below, you will get details on the model in use, a summary of service performance and golden signals including number of requests, average latency, error and fault rates. Adding to the summary, this section also includes the top 3 paths by fault rate, latency and error rate with easy to navigate links. This top-level summary can be useful to easily pinpoint which request path including Amazon Bedrock is causing increased latency or errors. Since we have only one path in our application, let’s open the customers-service-java link to understand more about the latency of the request.

Figure 3: BedrockRuntime node details for requests to Anthropic Claude

This takes us to the Service Dependency details page as shown in Figure 4 below to see more detail on the Dependency between customer-service-java and Amazon Bedrock. From the image, you can see that it is a dependency introduced by the InvokeModel request and the dependency on both the models in use. To dive deeper into the metrics, select a specific dependency metric data point, which provides the details of the correlated traces.

Figure 4: Dependency details

Selecting one of the traces takes us to the Trace Map as seen in Figure 5 below. Here, you can see the trace details including the Segment timeline and relevant logs. From this timeline, it is obvious that the Bedrock InvokeModel request to the Anthropic Claude model is the biggest contributor to the latency or processing time.

Figure 5: Trace Map for request including Bedrock service segment

In addition to the annotations in the trace span, you can log the details of the request as shown in Figure 6 below. The console will show all Amazon CloudWatch logs related to that transaction, for all the traversed nodes when applicable, to get more granular insights into the transaction. In this transaction, these attributes are:

prompt_token_count – The number of tokens in the prompt also called input token count (inputTextTokenCount on Figure 6)
generation_token_count – The number of tokens in the generated text also called output token count (outputTokenCount on Figure 6)

Figure 6: Sample log entry containing additional information related to the requests linked via the Segment

The above flow from dependency to correlated traces, to the related trace map and segment details demonstrates the power of Application Signals in simplifying the debugging process and reducing the Mean Time to Identify (MTTI).

Another frequent scenario involves comparing different models to identify the most suitable option for your use case, considering factors such as cost, performance, and end-user experience. The Services section provides a high-level overview of performance metrics, enabling you to make an informed decision. In Figure 7 below, you can observe that the Amazon Titan model outperforms the Anthropic Claude model for this particular use case.

Figure 7: Compare model performance in Services section

Cleanup

Refer to the cleanup instructions in the README appropriate to the deployment method chosen.

Conclusion

In this post we explained how you can use Amazon CloudWatch Application Signals to monitor generative AI applications built using Amazon Bedrock enabling you to choose the optimal LLM for your use case, while delivering a delightful user experience. The metadata and automatic instrumentation enable you to thoughtfully choose and configure appropriate SLOs to deliver the best experience while optimizing for cost and throughput.

Application Signals support for Amazon Bedrock is currently available in the Java and Python AWS SDK for requests to all versions of the Amazon Titan, Anthropic Claude and Meta Llama LLMs. Application Signals is generally available in 28 commercial AWS Regions, except CA West (Calgary) Region, AWS GovCloud (US) Regions and China Regions. For pricing, see Amazon CloudWatch pricing.

Try Application Signals with the AWS One Observability Workshop sample application. To get started, see the documentation to enable Amazon CloudWatch Application Signals for Amazon EKS, Amazon EC2, native Kubernetes and custom instrumentation for other platforms.

Select your cookie preferences

AWS Cloud Operations Blog