Listing Thumbnail

    Agent Evaluation & Red-Team Harness on Bedrock - Regression + Jailbreak

     Info
    Sold by: Kriv AI 
    Kriv AI builds a continuous agent quality + safety testing harness on Amazon Bedrock. Scope: regression suite (100–1,000 golden test cases per agent), jailbreak corpus aligned to OWASP Top 10 for LLM Applications (2024/2025 — LLM01–LLM10), bias probes (disparate-impact testing via AIR / four-fifths / statistical parity / equalized odds / calibration / Cohen's d + protected-class proxy detection via SageMaker Clarify), output validation (JSON schema + toxicity + hallucination + grounding), incident playback (time-travel debugging via OpenTelemetry + X-Ray), CI/CD integration (CodePipeline / GitHub Actions / GitLab CI / Jenkins), eval dashboards (CloudWatch + QuickSight). Integrates with Bedrock Model Evaluations + AgentCore Evaluations (GA March 2026). Three tiers: $50K / $85K / $125K + $20K Extra Agent. Optional $8K–$15K/month Retainer. AWS Select + Anthropic CPN.

    Overview

    Agents fail silently. Regressions slip through CI. Jailbreaks bypass system prompts. Bias probes on protected-class proxies surface in production — usually first as a regulator inquiry, audit exception, or incident. Until now, no productized agent-evaluation + red-team harness SKU has existed on AWS Marketplace, competitors sell one-shot pentest SOWs that leave a report, or SaaS tools you operate yourself.

    The agentic category has matured faster than governance. Amazon Bedrock AgentCore Evaluations went GA March 2026, but native Bedrock + AgentCore Evaluations are starting points, not production harnesses. Multi-step agent trajectories (LangGraph, Bedrock Agents, Claude Agent SDK, AutoGen, CrewAI) need regression suites validating function-calling, tool-use, structured output schemas, and full trajectory shape — not just final-answer quality. Jailbreak + PI corpora aligned to OWASP Top 10 for LLM Applications (2024/2025 LLM01–LLM10) must evolve as adversary tactics shift. Bias probes with quantitative disparate-impact testing (AIR, statistical parity, equalized odds, calibration, counterfactual fairness) are explicit regulator requirements under NAIC Model Bulletin, CO Reg 10-1-1, NYDFS Circular 7, and EU AI Act Article 15. CI/CD integration is the missing link, without deployment gates, evaluation is theater.

    Existing AWS Marketplace listings don't close this gap. Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+). Eval-platform SaaS (LangSmith, Braintrust, Arize, Patronus, Promptfoo, Evidently) are tools you operate yourself. NIST AI RMF / ORCAA audits run $50K–$200K one-shot.

    Harness components. Regression suite (100–1,000 test cases per agent; JSON-schema validation; function-calling + tool-use correctness; trajectory validation). Jailbreak corpus (OWASP LLM01–LLM10 + industry-specific prompts for healthcare / life sciences / FS; 50/200/500+ per tier). Bias probes (AIR, statistical parity, equalized odds, calibration, Cohen's d, counterfactuals; proxy detection; SageMaker Clarify). Output validation (JSON schema; toxicity; hallucination via grounding + retrieval-relevance; refusal-rate tracking). Incident playback (Enterprise), time-travel via OpenTelemetry + X-Ray. Eval dashboards (CloudWatch + QuickSight). CI/CD integration (CodePipeline / GitHub Actions / GitLab CI / Jenkins; gates with configurable thresholds). Cost management (per-run + AWS Budgets; LLM-judge tracked separately). Privacy-preserving test corpora (synthetic + de-identified per §164.514 Safe Harbor).

    Reference architecture. Bedrock Model Evaluations + AgentCore Evaluations baseline. Kriv's harness on ECS / EKS runs regression, jailbreak, bias-probe jobs on scheduled + CI-triggered cadence. S3 Object Lock stores corpora. SageMaker Clarify for bias. Step Functions for multi-step eval. CodePipeline / CodeBuild wire CI/CD gates. Claude on Bedrock as LLM-judge. OpenTelemetry + X-Ray feed incident playback.

    Week-by-week. W1 Scoping. W2 Regression suite + Bedrock baseline. W3 Jailbreak corpus + CI/CD gates, Foundation closes (30-day warranty). W4 Standard, bias probes + output validation + dashboards (45-day warranty). W5 Enterprise, incident playback + sibling integration. W6 Enterprise, regulated-industry bias probes (healthcare §164.514 + ACA §1557; FS fair-lending / ECOA / NAIC; life sciences FDA SaMD PCCP); 60-day hypercare.

    Three tiers. Foundation $50K (4 wk; 1 agent; 100 regression cases; 50 OWASP adversarial prompts; basic Bedrock + AgentCore Evaluations; CI/CD gates; 30-day warranty) for AI-native Series B–E. Standard $85K (5 wk; up to 3 agents; 200 adversarial prompts; bias probes; output validation; dashboards; CI/CD with thresholds; 45-day warranty) for mid-sized multi-agent + SOC 2 Type II AI testing. Enterprise $125K (6 wk; up to 10 agents; 500+ adversarial prompts; regulated-industry bias probes; incident playback time-travel; N27 + N28 + N31 integration; 60-day hypercare) for regulated, G-SIB banks, top-25 payers + pharmas. Optional Extra Agent $20K each. Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory updates.

    Important disclosures. Kriv does NOT develop Customer agents, harness tests Customer-authored agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. No legal / regulatory / compliance advice. Does NOT replace Customer's independent red-team function, ongoing cadence + responsible disclosure + zero-day response remain Customer's. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. AWS + Anthropic + Bedrock + LLM-judge consumption separate. No regulator-outcome guarantee. Anthropic CPN membership does not constitute endorsement.

    Highlights

    • First built-and-left-behind agent eval + red-team harness SKU on AWS Marketplace — regression + jailbreak + bias probes + CI/CD gates.** Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+) that leave a report; LangSmith / Arize / Patronus / Promptfoo / Evidently are SaaS tools. N29 = implementation partner for **Bedrock AgentCore Evaluations (GA March 2026)** Kriv integrates and extends, does not compete.
    • OWASP Top 10 for LLM Applications (2024/2025 LLM01–LLM10) adversarial corpus + industry-specific prompts (healthcare PHI-extraction / medical-misinformation; life sciences unapproved-indication / controlled-substance; financial services unlicensed-advice / fair-lending proxies / OFAC-adjacent) + bias probes (AIR / four-fifths / statistical parity / equalized odds / calibration / Cohen's d via SageMaker Clarify) + output validation (JSON schema + toxicity + hallucination + grounding).
    • AWS Select + Anthropic CPN — 4–6 weeks, $50K Foundation (1 agent) / $85K Standard (3 agents + bias probes + dashboards) / $125K Enterprise (10 agents + incident playback time-travel debugging + sibling integration with N27 AgentCore + N28 Guardrails + N31 Observability) + $20K Extra Agent.** Optional Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory-citation updates, recurring-revenue motion post-implementation.

    Details

    Sold by

    Delivery method

    Deployed on AWS
    New

    Introducing multi-product solutions

    You can now purchase comprehensive solutions tailored to use cases and industries.

    Multi-product solutions

    Pricing

    Custom pricing options

    Pricing is based on your specific requirements and eligibility. To get a custom quote for your needs, request a private offer.

    How can we make this page better?

    Tell us how we can improve this page, or report an issue with this product.
    Tell us how we can improve this page, or report an issue with this product.

    Legal

    Content disclaimer

    Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

    Support

    Vendor support

    Primary contact. info@kriv.ai  · +1-732-433-5564 · https://kriv.ai/support 

    Response SLA. First response within 2 US business days (Mon–Fri 9 am – 6 pm ET, ex-US federal holidays). Active engagements: Engagement Lead within 4 business hours weekdays. Post-incident (production regression, successful jailbreak, bias finding) or post-MRA/MRIA engagements compress to same business day.

    Onboarding SLA. First customer contact within 2 US business days of buyer inquiry / private-offer acceptance. Kickoff within 1–2 weeks of SOW; 3–5 business days post-incident.

    Escalation. (1) Engagement Lead (named in SOW) → (2) Practice Director (info@kriv.ai ) → (3) CEO Abhinav Dangri (info@kriv.ai ).

    Communication. Dedicated Microsoft Teams channel; weekly 60-min video checkpoint; Friday written status. Customer SMEs 3–5 hrs/week (Head of AI Platform, Head of ML Engineering, VP Engineering, CISO, Head of Trust & Safety, CAIO, Head of SRE, VP Product).

    Handoff. Word/Excel/PDF in customer secure share; regression suite + jailbreak corpus + bias probes as Git repo (Python / JSON / YAML); CI/CD integration templates as CodePipeline / GitHub Actions / GitLab CI / Jenkins YAML; eval dashboards as CloudWatch + QuickSight configs; incident playback harness as OpenTelemetry + X-Ray integration code.

    Out of scope. Does NOT develop Customer agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. Does NOT replace Customer's independent red-team function. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. No regulator-outcome guarantee.

    AWS + Anthropic-side billing. AWS infrastructure + Anthropic API + Bedrock Claude consumption (incl. LLM-judge) separate.

    Holiday coverage. Closed on US federal holidays.