Listing Thumbnail

    Agent Observability & SLO Framework on Amazon CloudWatch GenAI

     Info
    Sold by: Kriv AI 
    Kriv AI instruments your Claude-on-Bedrock agents with OpenTelemetry + AWS X-Ray + Amazon CloudWatch GenAI Observability. Defines SRE-grade Service Level Objectives (latency p95 / p99, success rate, tool-use correctness, grounding score, cost-per-successful-invocation) with error budgets. Alerts to PagerDuty / Opsgenie / Jira / ServiceNow / Slack / Microsoft Teams. Runbook library + blameless post-mortem templates (Google SRE format). Hybrid engagement: $40K 3-week Implementation + optional $8K / $14K / $20K monthly Managed-Service retainer (up to 3 / 10 / unlimited agents). 12-month minimum term on retainer. Enterprise tier adds 0.25 FTE dedicated Kriv SRE engineer embedded with Customer SRE team + regulated-industry evidence (SOC 2 CC7 + SR 11-7 + EU AI Act Article 72 + HIPAA §164.308(a)(1)(ii)(D)). AWS Select + Databricks + Anthropic CPN.

    Overview

    Agents fail differently than microservices. Silent hallucinations, tool-use errors, grounding drift, and runaway token cost slip past traditional APM. Until now, no standalone agent-observability + SLO retainer SKU existed on AWS Marketplace: IBM Instana + SoftServe bundle inside platforms; Datadog / New Relic / Dynatrace sell tools, not retainers; Arize / LangSmith / Braintrust are SaaS-first and not CloudWatch-native.

    Amazon CloudWatch GenAI Observability (launched late 2025) and Bedrock AgentCore Observability (GA 2026) give AWS Customers native capability but without OpenTelemetry instrumentation, SRE-grade SLO definition, alert wiring, and runbook + post-mortem culture, agents run blind. SRE for agents is a new discipline: multi-step traces across tool calls + model invocations + MCP servers are non-trivial to propagate; non-deterministic workloads require custom metrics (tool-use correctness, grounding score, hallucination rate, refusal rate, trajectory length) that traditional APM doesn't capture; SLOs must account for probabilistic failure modes; cost-per-invocation must be instrumented because token-consumption drift is a silent failure mode; agent-specific runbooks + blameless post-mortems are required. IBM / SoftServe bundle inside platforms; Kriv sells the standalone retained service: 6–9 month first-mover window.

    Existing AWS Marketplace listings don't close this gap. IBM Instana + IBM Consulting Advantage bundle inside platform engagements. SoftServe bundles inside modernization SOWs. Slalom / Caylent / Mission Cloud / Rackspace sell general observability at $75K–$300K. Big-4 off-Marketplace $250K–$2M. SaaS observability (Datadog LLM Observability, New Relic AI Monitoring, Dynatrace, Arize AI, LangSmith, Braintrust, Langfuse, Fiddler, WhyLabs, Evidently AI) are subscription tools Customer operates.

    Observability components. OpenTelemetry across Bedrock Agents, Claude Agent SDK, LangGraph, AutoGen, CrewAI, MCP servers (W3C Trace Context). X-Ray distributed tracing + service map. CloudWatch GenAI Observability native Bedrock telemetry. Bedrock Model Invocation Logging → CloudWatch Logs + S3. Custom metrics, tool-use correctness, grounding score, hallucination rate, refusal rate, trajectory length, P50/P95/P99 per-tool-call latency, cost-per-invocation, MCP-call success rate. SLO definitions with error budget tracking. RED + USE golden signals per agent. Dashboards in CloudWatch + QuickSight. Alerts via CloudWatch Alarms → EventBridge → SNS → PagerDuty / Opsgenie / Jira / ServiceNow / Slack / Microsoft Teams. Runbooks for silent hallucination, tool-use failure cascade, grounding collapse, cost spike, refusal-rate spike. Blameless post-mortem templates (Google SRE). FinOps via Bedrock Model Invocation Logging + CloudWatch + QuickSight.

    Reference architecture. OpenTelemetry SDK instruments Customer agents → AWS Distro for OpenTelemetry (ADOT) Collector on ECS / EKS → Traces to X-Ray, Metrics to CloudWatch, Logs to CloudWatch Logs + S3 → Dashboards in CloudWatch + QuickSight → Alarms via EventBridge + SNS → SLO tracker via CloudWatch Synthetics + custom SLO service in Terraform / CDK → FinOps via Cost Explorer + Bedrock logging.

    Implementation schedule (3 weeks).

    Week 1 Scoping + SLO workshops (agent inventory; SLO definition with SRE + AI Platform + Product + Customer Success).

    Week 2 OpenTelemetry + X-Ray + CloudWatch + QuickSight (OTel SDK instrumentation; ADOT Collector; X-Ray; GenAI Observability wiring; custom metrics; dashboards).

    Week 3 Alerts + runbooks + post-mortem templates + handoff.

    Three retainer tiers (opt-in monthly). Essentials $8K/mo (up to 3 agents; monthly SLO review; dashboard tuning; quarterly runbook refresh; 8×5 email) for AI-native Series B–E. Standard $14K/mo (up to 10 agents; biweekly SLO review; 24×5 on-call response, not primary; monthly post-mortem co-authorship; CI/CD alert tuning) for mid-sized with production SRE. Enterprise $20K/mo (unlimited agents; 0.25 FTE dedicated Kriv SRE engineer embedded; quarterly SLO audit; regulated-industry evidence, SOC 2 CC7 + SR 11-7 + EU AI Act Article 72 + HIPAA §164.308(a)(1)(ii)(D); quarterly executive readout) for regulated, G-SIB banks, top-25 payers + pharmas. Optional Extra Agent Implementation $12K.

    Important disclosures. Kriv does NOT develop Customer agents, instruments them. Retainer is opt-in. Issues no SOC 2 / HIPAA / HITRUST / ISO / 42001 certifications. No legal / regulatory / compliance advice. No SLO-attainment guarantee, SLOs are Customer commitments. No primary-on-call responsibility at Enterprise (dedicated engineer is embed, not replacement). No CloudWatch / X-Ray / Bedrock / OTel API stability guarantee. AWS + Anthropic + Bedrock consumption separate. Anthropic CPN membership does not constitute endorsement.

    Highlights

    • First standalone agent-observability + SLO retainer SKU on AWS Marketplace, hybrid $40K 3-week Implementation + optional $8K / $14K / $20K monthly retainer (12-month minimum). Agents fail differently than microservices, silent hallucinations, tool-use errors, grounding drift, runaway token cost slip past traditional APM. IBM / SoftServe bundle inside platforms; Datadog / New Relic / Dynatrace sell tools. Kriv is the retained implementation-and-operation partner.
    • OpenTelemetry + AWS X-Ray + Amazon CloudWatch GenAI Observability + custom metrics (tool-use correctness + grounding score + hallucination rate + refusal rate + trajectory length + cost-per-successful-invocation) + SLO + error budgets + RED/USE golden signals + runbook library + blameless post-mortem templates (Google SRE format). Alerts to PagerDuty / Opsgenie / Jira / ServiceNow / Slack / Microsoft Teams via CloudWatch Alarms + EventBridge + SNS.
    • AWS Select + Databricks + Anthropic CPN — Enterprise tier adds 0.25 FTE dedicated Kriv SRE engineer embedded with Customer SRE team + regulated-industry evidence (SOC 2 CC7 + SR 11-7 ongoing monitoring + EU AI Act Article 72 + HIPAA §164.308(a)(1)(ii)(D)). Feeds N24 MRM evidence, N27 AgentCore telemetry, N28 Guardrails violation events, N29 Red-Team Harness regression signals, N30 Model Drift baselines. 5 dimensions: 1 Implementation + 3 retainer tiers + Extra Agent.

    Details

    Sold by

    Delivery method

    Deployed on AWS
    New

    Introducing multi-product solutions

    You can now purchase comprehensive solutions tailored to use cases and industries.

    Multi-product solutions

    Pricing

    Custom pricing options

    Pricing is based on your specific requirements and eligibility. To get a custom quote for your needs, request a private offer.

    How can we make this page better?

    Tell us how we can improve this page, or report an issue with this product.
    Tell us how we can improve this page, or report an issue with this product.

    Legal

    Content disclaimer

    Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

    Support

    Vendor support

    Primary contact. info@kriv.ai  · +1-732-433-5564 · https://kriv.ai/support 

    Response SLA. Baseline 2 US business days (Mon–Fri 9 am – 6 pm ET, ex-US federal holidays). Standard retainer: 4-hr Sev 1 SLA 24×5. Enterprise: priority + 0.25 FTE dedicated SRE engineer. Post-incident compresses to same business day.

    Onboarding SLA. First customer contact within 2 US business days of buyer inquiry / private-offer acceptance. Implementation kickoff within 1–2 weeks of SOW; 3–5 business days post-incident.

    Escalation. (1) Implementation Lead or Retainer Engineer (named in SOW) → (2) Practice Director (info@kriv.ai ) → (3) CEO Abhinav Dangri (info@kriv.ai ).

    Communication. Dedicated Microsoft Teams channel; Essentials: monthly SLO review; Standard: biweekly SLO review + monthly post-mortem + 24×5 on-call response; Enterprise: 0.25 FTE embed + quarterly SLO audit + annual program review + quarterly executive readout.

    Handoff. Word/Excel/PDF in customer secure share; OTel instrumentation + ADOT Collector config + custom metrics as Git repo (Python / TypeScript / JSON / YAML); dashboards as CloudFormation; runbooks + post-mortem templates as Markdown + Word; regulated-industry evidence (Enterprise) as Excel indexed to control IDs.

    Out of scope. Does NOT develop Customer agents. Retainer is opt-in. Issues no certifications. No legal / regulatory / compliance advice. No SLO-attainment guarantee. No primary-on-call responsibility at Enterprise (embed, not replacement). No AWS / OTel API stability guarantee.

    AWS + Anthropic-side billing. AWS infrastructure + Anthropic API + Bedrock Claude consumption separate.

    Holiday coverage. Closed on US federal holidays.

    Software associated with this service