Artificial Intelligence

Amazon Bedrock AgentCore Observability with Langfuse

The rise of artificial intelligence (AI) agents marks a change in software development and how applications make decisions and interact with users. While traditional systems follow predictable paths, AI agents engage in complex reasoning that remains hidden from view. This invisibility creates a challenge for organizations: how can they trust what they can’t see?  This is where agent observability enters the picture, offering deep insights into how agentic applications perform, interact, and execute tasks.

In this post, we explain how to integrate Langfuse observability with Amazon Bedrock AgentCore to gain deep visibility into an AI agent’s performance, debug issues faster, and optimize costs. We walk through a complete implementation using Strands agents deployed on AgentCore Runtime followed by step-by-step code examples.

Amazon Bedrock AgentCore is a comprehensive agentic platform that can deploy and operate highly capable AI agents securely, at scale. It offers purpose-built infrastructure for dynamic agent workloads, powerful tools to enhance agents, and essential controls for real-world deployment. AgentCore is comprised of fully managed services that can be used together or independently. These services work with any framework including CrewAI, LangGraph, LlamaIndex, and Strands Agents, and any foundation model in or outside of Amazon Bedrock, offering flexibility and reliability. AgentCore emits telemetry data in standardized OpenTelemetry (OTEL)-compatible format, enabling easy integration with an existing monitoring and observability stack. It offers detailed visualizations of each step in the agent workflow, enabling inspection of an agent’s execution path, audit intermediate outputs, and debugging performance bottlenecks and failures.

How Langfuse tracing works

Langfuse uses OpenTelemetry to trace and monitor agents deployed on Amazon Bedrock AgentCore. OpenTelemetry is a Cloud Native Computing Foundation (CNCF) project that provides a set of specifications, APIs, and libraries that define a standard way to collect distributed traces and metrics from an application. Users can now track performance metrics including token usage, latency, and execution durations across different processing phases. The system creates hierarchical trace structures that capture both streaming and non-streaming responses, with detailed operation attributes and error states.

Through the /api/public/otel endpoint, Langfuse functions as an OpenTelemetry Backend, mapping traces to its data model using generative AI conventions. This is particularly valuable for complex large language model (LLM) applications utilizing chains and agents with tools, where nested traces help developers quickly identify and resolve issues. The integration supports systematic debugging, performance monitoring, and audit trail maintenance, making it easier for teams to build and maintain reliable AI applications on Amazon Bedrock AgentCore.

In addition to Agent observability, Langfuse offers a suite of integrated tools covering the full LLM application development lifecycle. This includes running automated llm-as-a-judge evaluators (online/offline), organizing data labeling for root cause analysis and evaluator alignment, track experiments (local and in CI), iterate in prompts interactively in a playground, and version control them in UI using prompt management.

Solution overview

This post shows how to deploy a Strands agent on Amazon Bedrock AgentCore Runtime with Langfuse observability. The implementation uses Anthropic Claude models through Amazon Bedrock. Telemetry data flows from the Strands agent through OTEL exporters to Langfuse for monitoring and debugging. To use Langfuse, set disable_otel=True in the AgentCore runtime deployment. This turns off AgentCore’s default observability.

Figure 1: Architecture overview

Key components used in the solution are:

  • Strands Agents: Python framework for building LLM-powered agents with built-in telemetry support
  • Amazon Bedrock AgentCore Runtime: Managed runtime service for hosting and scaling agents on Amazon Web Services (AWS)
  • Langfuse: Open-source observability and evaluation platform for LLM applications that receives traces via OTEL
  • OpenTelemetry: Industry-standard protocol for collecting and exporting telemetry data

Technical implementation guide

Now that we have covered how Langfuse tracing works, we can walk through how to implement it with Amazon Bedrock AgentCore.

Prerequisites

  • An AWS account
    • Before using Amazon Bedrock, confirm all AWS credentials are configured correctly. They can be set up using the AWS CLI or by setting environment variables. For this notebook we assume that the credentials are already configured.
  • Amazon Bedrock Model Access for Anthropic Claude 3.7 in us-west-2 region
  • Amazon Bedrock AgentCore permissions
  • Python 3.10+
  • Docker installed locally
  • A Langfuse account, which is needed to create a Langfuse API Key.
    • Users need to register at Langfuse cloud, create a project, and get API keys
    • Alternatively, you can self-host Langfuse within your own AWS account using the Terraform module.

Walkthrough

The following steps walk through how to use Langfuse for collecting traces from agents created using Strands SDK in AgentCore runtime. Users can also refer to this notebook on Github to get started with it right away.

Clone this Github repo:

git clone https://github.com/awslabs/amazon-bedrock-agentcore-samples.git

Once the repo is cloned, visit the Amazon Bedrock AgentCore Samples directory, find the notebook runtime_with_strands_and_langfuse.ipynb and start running each cell.

Step 1: Python dependencies and requirements packages for our Strands agent

Execute the below cell to install the dependencies which are defined in the requirements.txt file.

!pip install --force-reinstall -U -r requirements.txt –quiet

Step 2: Agent implementation

The agent file (strands_claude.py) implements a travel agent with web search capabilities.

%%writefile strands_claude.py
import os
import logging
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from strands import Agent, tool
from strands.models import BedrockModel
from strands.telemetry import StrandsTelemetry
from ddgs import DDGS
logging.basicConfig(level=logging.ERROR, format="[%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(os.getenv("AGENT_RUNTIME_LOG_LEVEL", "INFO").upper())
@tool
def web_search(query: str) -> str:
"""
Search the web for information using DuckDuckGo.
Args:
query: The search query
Returns:
A string containing the search results
"""
try:
ddgs = DDGS()
results = ddgs.text(query, max_results=5)
formatted_results = []
for i, result in enumerate(results, 1):
formatted_results.append(
f"{i}. {result.get('title', 'No title')}\n"
f" {result.get('body', 'No summary')}\n"
f" Source: {result.get('href', 'No URL')}\n"
)
return "\n".join(formatted_results) if formatted_results else "No results found."
except Exception as e:
return f"Error searching the web: {str(e)}"
# Function to initialize Bedrock model
def get_bedrock_model():
region = os.getenv("AWS_DEFAULT_REGION", "us-west-2")
model_id = os.getenv("BEDROCK_MODEL_ID", "us.anthropic.claude-3-7-sonnet-20250219-v1:0")
bedrock_model = BedrockModel(
model_id=model_id,
region_name=region,
temperature=0.0,
max_tokens=1024
)
return bedrock_model
# Initialize the Bedrock model
bedrock_model = get_bedrock_model()
# Define the agent's system prompt
system_prompt = """You are an experienced travel agent specializing in personalized travel recommendations
with access to real-time web information. Your role is to find dream destinations matching user preferences
using web search for current information. You should provide comprehensive recommendations with current
information, brief descriptions, and practical travel details."""
app = BedrockAgentCoreApp()
def initialize_agent():
"""Initialize the agent with proper telemetry configuration."""
# Initialize Strands telemetry with 3P configuration
strands_telemetry = StrandsTelemetry()
strands_telemetry.setup_otlp_exporter()

# Create and cache the agent
agent = Agent(
model=bedrock_model,
system_prompt=system_prompt,
tools=[web_search]
)

return agent
@app.entrypoint
def strands_agent_bedrock(payload, context=None):
"""
Invoke the agent with a payload
"""
user_input = payload.get("prompt")
logger.info("[%s] User input: %s", context.session_id, user_input)

# Initialize agent with proper configuration
agent = initialize_agent()

response = agent(user_input)
return response.message['content'][0]['text']
if __name__ == "__main__":
app.run()

Step 3: Configure AgentCore Runtime deployment

Next, use our starter toolkit to configure the AgentCore Runtime deployment with an entry point, the execution role we created, and a requirements file. Additionally, configure the starter kit to auto create the Amazon Elastic Container Registry (ECR) repository on launch.

During the configure step, the docker file is generated based on the application code. When using the bedrock_agentcore_starter_toolkit to configure the agent, it configures AgentCore Observability by default. Therefore, to use Langfuse, users should disable OTEL by setting the configuration flag as “True” as shown in the following code block.

Figure 2: Configure AgentCore Runtime

from bedrock_agentcore_starter_toolkit import Runtime
from boto3.session import Session
boto_session = Session()
region = boto_session.region_name
agentcore_runtime = Runtime()
agent_name = "strands_langfuse_observability"
response = agentcore_runtime.configure(
entrypoint="strands_claude.py",
auto_create_execution_role=True,
auto_create_ecr=True,
requirements_file="requirements.txt",
region=region,
agent_name=agent_name,
disable_otel=True,
)
response

Step 4: Deploy to AgentCore Runtime

Now that a docker file has been generated, launch the agent to the AgentCore Runtime to create the Amazon ECR repository and the AgentCore Runtime.

Now configure the Langfuse secret key, public key and OTEL endpoints in AWS Systems Manager Parameter Store, which provides secure, hierarchical storage for configuration data management and secrets management.

import base64
# Langfuse configuration
otel_endpoint = "https://us.cloud.langfuse.com/api/public/otel"
langfuse_secret_key = "<Enter your Langfuse secret key>" #For production key should be securely stored
langfuse_public_key = "<Enter your Langfuse public key" #For production key should be securely stored
langfuse_auth_token = base64.b64encode(f"{langfuse_public_key}:{langfuse_secret_key}".encode()).decode()
otel_auth_header = f"Authorization=Basic {langfuse_auth_token}"
launch_result = agentcore_runtime.launch(
env_vars={
"BEDROCK_MODEL_ID": "us.anthropic.claude-3-7-sonnet-20250219-v1:0", # Example model ID
"OTEL_EXPORTER_OTLP_ENDPOINT": otel_endpoint, # Use Langfuse OTEL endpoint
"OTEL_EXPORTER_OTLP_HEADERS": otel_auth_header, # Add Langfuse OTEL auth header
"DISABLE_ADOT_OBSERVABILITY": "true",
}
)
launch_result

The following table describes the various configuration parameters being used.

Parameter Description Default
langfuse_public_key API key for OTEL endpoint Environment variable
langfuse_secret_key Secret key for OTEL endpoint Environment variable
OTEL_EXPORTER_OTLP_ENDPOINT Trace endpoint https://cloud.langfuse.com/api/public/otel/v1/traces
OTEL_EXPORTER_OTLP_HEADERS Authentication type Basic
DISABLE_ADOT_OBSERVABILITY AWS Distro for Open Telemetry (ADOT). The implementation disables Agent Core’s default observability to use Langfuse instead. True
BEDROCK_MODEL_ID AWS Bedrock Model ID us. anthropic.claude-3-7-sonnet-20250219-v1:0

Step 5: Check deployment status

Wait for the runtime to be ready before invoking:

import time
status_response = agentcore_runtime.status()
status = status_response.endpoint['status']
end_status = ['READY', 'CREATE_FAILED', 'DELETE_FAILED', 'UPDATE_FAILED']
while status not in end_status:
time.sleep(10)
status_response = agentcore_runtime.status()
status = status_response.endpoint['status']
print(status)
status

A successful deployment shows a “Ready” state for the agent runtime.

Step 6: Invoking AgentCore Runtime

Finally, invoke our AgentCore Runtime with a payload.

invoke_response = agentcore_runtime.invoke({"prompt": "I'm planning a weekend trip to london. What are the must-visit places and local food I should try?"})

Once the AgentCore Runtime has been invoked, users should be able to see the Langfuse traces in the Langfuse dashboard.

Step 7: View traces in Langfuse

After running the agent, visit the Langfuse project to view the detailed traces. The traces include:

  • Agent invocation details
  • Tool calls (web search)
  • Model interactions with latency and token usage
  • Request/response payloads

Traces and hierarchy

Langfuse captures all interactions from user requests to individual model calls. Each trace captures the complete execution path, including API calls, function invocations, and model responses, creating a comprehensive timeline of agent activities. The nested structure of traces enables developers to drill down into specific interactions and identify performance bottlenecks or error patterns at any level of the execution chain. To further enhance observability capabilities, Langfuse provides tagging mechanisms that can be implemented in agent workflows.

Figure 3: Traces in Langfuse

Combining hierarchical traces with strategic tagging provides insights into agent operations, enabling data-driven optimization and superior user experiences. As shown in the following image, developers can drill down into the precise timing of each operation within their agent’s execution flow. In the previous example, the complete request took 26.57s, with individual breakdowns for event loop cycle, tool calls, and other components. Use this timing information to find performance bottlenecks and reduce response times. For instance, certain LLM operations might take longer than expected, or there may be opportunities to parallelize specific actions to reduce overall latency. By leveraging these insights, users can make data-driven decisions to enhance agent’s performance and deliver a better customer experience.

Figure 4: Detailed trace hierarchy

Langfuse dashboard

The Langfuse dashboard features three different dashboards for monitoring such as Cost, Latency and Usage Management.

Figure 5: Langfuse dashboard

Cost monitoring

Cost monitoring helps track expenses at both the aggregate and individual request levels to maintain control over AI infrastructure expenses. The platform provides detailed cost breakdowns per model, user, and function call, enabling teams to identify cost-intensive operations and optimize their implementation. This granular cost visibility helps in making data-driven decisions about model selection, prompt engineering, and resource allocation while maintaining budget constraints. Dashboard cost data is provided for estimation purposes; actual charges should be verified through official billing statements.

Figure 6: Cost dashboard

Langfuse latency dashboard

Latency metrics can be monitored across traces and generations for performance optimization. The dashboard shows the following metrics by default and you can create custom charts and dashboard depending on your needs:

  • P 95 Latency by Level (Observations)
  • P 95 Latency by Use Case
  • Max Latency by User Id (Traces)
  • Avg Time To First Token by Prompt Name (Observations)
  • P 95 Time To First Token by Model
  • P 95 Latency by Model
  • Avg Output Tokens Per Second by Model

Figure 7: Latency dashboard

Langfuse usage management

This dashboard shows metrics across traces, observations, and scores to manage resource allocation.

Figure 8: Usage management dashboard

Conclusion

This post demonstrated how to integrate Langfuse with AgentCore for comprehensive observability of AI agents. Users can now track performance, debug interactions, and optimize costs across workflows. We expect more Langfuse observability features and integration options in the future to help scale AI applications.

Start implementing Langfuse with AgentCore today to gain deeper insights into agents’ performance, track conversation flows, and optimize AI applications. For more information, visit the following resources:


About the authors

Richa Gupta is a Senior Solutions Architect at Amazon Web Services, specializing in AI/ML, Generative AI, and Agentic AI. She is passionate about helping customers on their AI transformation journey, architecting end-to-end solutions from proof-of-concept to production deployment and drive business revenue. Beyond her professional pursuits, Richa loves to make latte arts and is an adventure enthusiast.

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he partners with customers to architect innovative and responsible generative AI solutions. With deep expertise in AI and machine learning, Ishan leads the development of production Generative AI solutions at scale, with a focus on evaluations and observability. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife, kid, and dog, Beau.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Madhu Samhitha is a Specialist Solution Architect at Amazon Web Services, focused on helping customers implement generative AI solutions. She combines her knowledge of large language models with strategic innovation to deliver business value. She has a Master’s in Computer Science from the University of Massachusetts, Amherst and has worked in various industries. Beyond her technical role, Madhu is a trained classical dancer, an art enthusiast, and enjoys exploring national parks.

Marc Klingen is the co-founder and CEO of Langfuse, the Open Source LLM Engineering Platform. After building LLM Agents in 2023 together with his co-founders, Marc and team realized that new tooling is necessary to bring agents into production and scale them reliably. With Langfuse they have built the leading Open Source LLM Engineering Platform (Observability, Evaluation, Prompt Management) with over 18,000 GitHub stars, 14.8M+ SDK installs per month, and 6M+ Docker pulls. Langfuse is used by top engineering teams such as Khan Academy, Samsara, Twilio, and Merck.