Overview
Agents fail silently. Regressions slip through CI. Jailbreaks bypass system prompts. Bias probes on protected-class proxies surface in production — usually first as a regulator inquiry, audit exception, or incident. Until now, no productized agent-evaluation + red-team harness SKU has existed on AWS Marketplace, competitors sell one-shot pentest SOWs that leave a report, or SaaS tools you operate yourself.
The agentic category has matured faster than governance. Amazon Bedrock AgentCore Evaluations went GA March 2026, but native Bedrock + AgentCore Evaluations are starting points, not production harnesses. Multi-step agent trajectories (LangGraph, Bedrock Agents, Claude Agent SDK, AutoGen, CrewAI) need regression suites validating function-calling, tool-use, structured output schemas, and full trajectory shape — not just final-answer quality. Jailbreak + PI corpora aligned to OWASP Top 10 for LLM Applications (2024/2025 LLM01–LLM10) must evolve as adversary tactics shift. Bias probes with quantitative disparate-impact testing (AIR, statistical parity, equalized odds, calibration, counterfactual fairness) are explicit regulator requirements under NAIC Model Bulletin, CO Reg 10-1-1, NYDFS Circular 7, and EU AI Act Article 15. CI/CD integration is the missing link, without deployment gates, evaluation is theater.
Existing AWS Marketplace listings don't close this gap. Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+). Eval-platform SaaS (LangSmith, Braintrust, Arize, Patronus, Promptfoo, Evidently) are tools you operate yourself. NIST AI RMF / ORCAA audits run $50K–$200K one-shot.
Harness components. Regression suite (100–1,000 test cases per agent; JSON-schema validation; function-calling + tool-use correctness; trajectory validation). Jailbreak corpus (OWASP LLM01–LLM10 + industry-specific prompts for healthcare / life sciences / FS; 50/200/500+ per tier). Bias probes (AIR, statistical parity, equalized odds, calibration, Cohen's d, counterfactuals; proxy detection; SageMaker Clarify). Output validation (JSON schema; toxicity; hallucination via grounding + retrieval-relevance; refusal-rate tracking). Incident playback (Enterprise), time-travel via OpenTelemetry + X-Ray. Eval dashboards (CloudWatch + QuickSight). CI/CD integration (CodePipeline / GitHub Actions / GitLab CI / Jenkins; gates with configurable thresholds). Cost management (per-run + AWS Budgets; LLM-judge tracked separately). Privacy-preserving test corpora (synthetic + de-identified per §164.514 Safe Harbor).
Reference architecture. Bedrock Model Evaluations + AgentCore Evaluations baseline. Kriv's harness on ECS / EKS runs regression, jailbreak, bias-probe jobs on scheduled + CI-triggered cadence. S3 Object Lock stores corpora. SageMaker Clarify for bias. Step Functions for multi-step eval. CodePipeline / CodeBuild wire CI/CD gates. Claude on Bedrock as LLM-judge. OpenTelemetry + X-Ray feed incident playback.
Week-by-week. W1 Scoping. W2 Regression suite + Bedrock baseline. W3 Jailbreak corpus + CI/CD gates, Foundation closes (30-day warranty). W4 Standard, bias probes + output validation + dashboards (45-day warranty). W5 Enterprise, incident playback + sibling integration. W6 Enterprise, regulated-industry bias probes (healthcare §164.514 + ACA §1557; FS fair-lending / ECOA / NAIC; life sciences FDA SaMD PCCP); 60-day hypercare.
Three tiers. Foundation $50K (4 wk; 1 agent; 100 regression cases; 50 OWASP adversarial prompts; basic Bedrock + AgentCore Evaluations; CI/CD gates; 30-day warranty) for AI-native Series B–E. Standard $85K (5 wk; up to 3 agents; 200 adversarial prompts; bias probes; output validation; dashboards; CI/CD with thresholds; 45-day warranty) for mid-sized multi-agent + SOC 2 Type II AI testing. Enterprise $125K (6 wk; up to 10 agents; 500+ adversarial prompts; regulated-industry bias probes; incident playback time-travel; N27 + N28 + N31 integration; 60-day hypercare) for regulated, G-SIB banks, top-25 payers + pharmas. Optional Extra Agent $20K each. Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory updates.
Important disclosures. Kriv does NOT develop Customer agents, harness tests Customer-authored agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. No legal / regulatory / compliance advice. Does NOT replace Customer's independent red-team function, ongoing cadence + responsible disclosure + zero-day response remain Customer's. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. AWS + Anthropic + Bedrock + LLM-judge consumption separate. No regulator-outcome guarantee. Anthropic CPN membership does not constitute endorsement.
Highlights
- First built-and-left-behind agent eval + red-team harness SKU on AWS Marketplace — regression + jailbreak + bias probes + CI/CD gates.** Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+) that leave a report; LangSmith / Arize / Patronus / Promptfoo / Evidently are SaaS tools. N29 = implementation partner for **Bedrock AgentCore Evaluations (GA March 2026)** Kriv integrates and extends, does not compete.
- OWASP Top 10 for LLM Applications (2024/2025 LLM01–LLM10) adversarial corpus + industry-specific prompts (healthcare PHI-extraction / medical-misinformation; life sciences unapproved-indication / controlled-substance; financial services unlicensed-advice / fair-lending proxies / OFAC-adjacent) + bias probes (AIR / four-fifths / statistical parity / equalized odds / calibration / Cohen's d via SageMaker Clarify) + output validation (JSON schema + toxicity + hallucination + grounding).
- AWS Select + Anthropic CPN — 4–6 weeks, $50K Foundation (1 agent) / $85K Standard (3 agents + bias probes + dashboards) / $125K Enterprise (10 agents + incident playback time-travel debugging + sibling integration with N27 AgentCore + N28 Guardrails + N31 Observability) + $20K Extra Agent.** Optional Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory-citation updates, recurring-revenue motion post-implementation.
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Pricing
Custom pricing options
How can we make this page better?
Legal
Content disclaimer
Resources
Support
Vendor support
Primary contact. info@kriv.ai · +1-732-433-5564 · https://kriv.ai/support
Response SLA. First response within 2 US business days (Mon–Fri 9 am – 6 pm ET, ex-US federal holidays). Active engagements: Engagement Lead within 4 business hours weekdays. Post-incident (production regression, successful jailbreak, bias finding) or post-MRA/MRIA engagements compress to same business day.
Onboarding SLA. First customer contact within 2 US business days of buyer inquiry / private-offer acceptance. Kickoff within 1–2 weeks of SOW; 3–5 business days post-incident.
Escalation. (1) Engagement Lead (named in SOW) → (2) Practice Director (info@kriv.ai ) → (3) CEO Abhinav Dangri (info@kriv.ai ).
Communication. Dedicated Microsoft Teams channel; weekly 60-min video checkpoint; Friday written status. Customer SMEs 3–5 hrs/week (Head of AI Platform, Head of ML Engineering, VP Engineering, CISO, Head of Trust & Safety, CAIO, Head of SRE, VP Product).
Handoff. Word/Excel/PDF in customer secure share; regression suite + jailbreak corpus + bias probes as Git repo (Python / JSON / YAML); CI/CD integration templates as CodePipeline / GitHub Actions / GitLab CI / Jenkins YAML; eval dashboards as CloudWatch + QuickSight configs; incident playback harness as OpenTelemetry + X-Ray integration code.
Out of scope. Does NOT develop Customer agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. Does NOT replace Customer's independent red-team function. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. No regulator-outcome guarantee.
AWS + Anthropic-side billing. AWS infrastructure + Anthropic API + Bedrock Claude consumption (incl. LLM-judge) separate.
Holiday coverage. Closed on US federal holidays.