Overview
Agents fail silently. Regressions slip through CI. Jailbreaks bypass system prompts. Bias probes on protected-class proxies surface in production — usually first as a regulator inquiry, audit exception, or incident. Until now, no productized agent-evaluation + red-team harness SKU has existed on AWS Marketplace, competitors sell one-shot pentest SOWs that leave a report, or SaaS tools you operate yourself.
The agentic category has outpaced governance.** Amazon Bedrock AgentCore Evaluations (GA March 2026)** are baseline tools, not production-grade test harnesses. Multi-step agents (LangGraph, Bedrock Agents, AutoGen, CrewAI) need full regression suites for tool-use, function calls, outputs, and trajectory validation—not just final answers. Jailbreak/PII testing aligned to OWASP LLM Top 10 (2024/2025) must continuously adapt. Bias testing (AIR, parity, equalized odds, calibration) is required under NAIC, NYDFS, CO Reg 10-1-1, and EU AI Act. Without CI/CD gating, evaluations are non-operational.
Existing AWS Marketplace listings don't close this gap. Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+). Eval-platform SaaS (LangSmith, Braintrust, Arize, Patronus, Promptfoo, Evidently) are tools you operate yourself. NIST AI RMF / ORCAA audits run $50K–$200K one-shot.
Harness components. Regression suite (100–1,000 test cases per agent; JSON-schema validation; function-calling + tool-use correctness; trajectory validation). Jailbreak corpus (OWASP LLM01–LLM10 + industry-specific prompts for healthcare / life sciences / FS; 50/200/500+ per tier). Bias probes (AIR, statistical parity, equalized odds, calibration, Cohen's d, counterfactuals; proxy detection; SageMaker Clarify). Output validation (JSON schema; toxicity; hallucination via grounding + retrieval-relevance; refusal-rate tracking). Incident playback (Enterprise), time-travel via OpenTelemetry + X-Ray. Eval dashboards (CloudWatch + QuickSight). CI/CD integration (CodePipeline / GitHub Actions / GitLab CI / Jenkins; gates with configurable thresholds). Cost management (per-run + AWS Budgets; LLM-judge tracked separately). Privacy-preserving test corpora (synthetic + de-identified per §164.514 Safe Harbor).
Reference architecture. Bedrock Model Evaluations + AgentCore Evaluations baseline. Kriv's harness on ECS / EKS runs regression, jailbreak, bias-probe jobs on scheduled + CI-triggered cadence. S3 Object Lock stores corpora. SageMaker Clarify for bias. Step Functions for multi-step eval. CodePipeline / CodeBuild wire CI/CD gates. Claude on Bedrock as LLM-judge. OpenTelemetry + X-Ray feed incident playback.
Week-by-week. W1 Scoping. W2 Regression suite + Bedrock baseline. W3 Jailbreak corpus + CI/CD gates, Foundation closes (30-day warranty). W4 Standard, bias probes + output validation + dashboards (45-day warranty). W5 Enterprise, incident playback + sibling integration. W6 Enterprise, regulated-industry bias probes (healthcare §164.514 + ACA §1557; FS fair-lending / ECOA / NAIC; life sciences FDA SaMD PCCP); 60-day hypercare.
Three tiers. Foundation $50K (4 wk; 1 agent; 100 regression cases; 50 OWASP adversarial prompts; basic Bedrock + AgentCore Evaluations; CI/CD gates; 30-day warranty) for AI-native Series B–E. Standard $85K (5 wk; up to 3 agents; 200 adversarial prompts; bias probes; output validation; dashboards; CI/CD with thresholds; 45-day warranty) for mid-sized multi-agent + SOC 2 Type II AI testing. Enterprise $125K (6 wk; up to 10 agents; 500+ adversarial prompts; regulated-industry bias probes; incident playback time-travel; N27 + N28 + N31 integration; 60-day hypercare) for regulated, G-SIB banks, top-25 payers + pharmas. Optional Extra Agent $20K each. Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory updates. EDP-eligible — harness build fees ($50K–$125K) count toward your AWS Enterprise Discount Program commitment (up to 25%). AWS + Anthropic + Bedrock LLM-judge consumption are billed separately. Optional Retainer ($8K–$15K/month) also EDP-eligible. Contact info@kriv.ai to structure a private offer against your EDP or PPA.
Important disclosures. Kriv does NOT develop Customer agents, harness tests Customer-authored agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. No legal / regulatory / compliance advice. Does NOT replace Customer's independent red-team function, ongoing cadence + responsible disclosure + zero-day response remain Customer's. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. AWS + Anthropic + Bedrock + LLM-judge consumption separate. No regulator-outcome guarantee. Anthropic CPN membership does not constitute endorsement.
Highlights
- First built-and-left-behind agent eval + red-team harness SKU on AWS Marketplace — regression + jailbreak + bias probes + CI/CD gates.** Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+) that leave a report; LangSmith / Arize / Patronus / Promptfoo / Evidently are SaaS tools. N29 = implementation partner for **Bedrock AgentCore Evaluations (GA March 2026)** Kriv integrates and extends, does not compete.
- OWASP Top 10 for LLM Applications (2024/2025 LLM01–LLM10) adversarial corpus + industry-specific prompts (healthcare PHI-extraction / medical-misinformation; life sciences unapproved-indication / controlled-substance; financial services unlicensed-advice / fair-lending proxies / OFAC-adjacent) + bias probes (AIR / four-fifths / statistical parity / equalized odds / calibration / Cohen's d via SageMaker Clarify) + output validation (JSON schema + toxicity + hallucination + grounding).
- AWS Select + Anthropic CPN — 4–6 weeks, $50K Foundation (1 agent) / $85K Standard (3 agents + bias probes + dashboards) / $125K Enterprise (10 agents + incident playback time-travel debugging + sibling integration with N27 AgentCore + N28 Guardrails + N31 Observability) + $20K Extra Agent.** Optional Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory-citation updates, recurring-revenue motion post-implementation.
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Pricing
Custom pricing options
How can we make this page better?
Legal
Content disclaimer
Resources
Support
Vendor support
Primary contact. info@kriv.ai · +1-732-433-5564 · https://kriv.ai/support
Response SLA. First response within 2 US business days (Mon–Fri 9 am – 6 pm ET, ex-US federal holidays). Active engagements: Engagement Lead within 4 business hours weekdays. Post-incident (production regression, successful jailbreak, bias finding) or post-MRA/MRIA engagements compress to same business day.
Onboarding SLA. First customer contact within 2 US business days of buyer inquiry / private-offer acceptance. Kickoff within 1–2 weeks of SOW; 3–5 business days post-incident.
Escalation. (1) Engagement Lead (named in SOW) → (2) Practice Director (info@kriv.ai ) → (3) CEO Abhinav Dangri (info@kriv.ai ).
Communication. Dedicated Microsoft Teams channel; weekly 60-min video checkpoint; Friday written status. Customer SMEs 3–5 hrs/week (Head of AI Platform, Head of ML Engineering, VP Engineering, CISO, Head of Trust & Safety, CAIO, Head of SRE, VP Product).
Handoff. Word/Excel/PDF in customer secure share; regression suite + jailbreak corpus + bias probes as Git repo (Python / JSON / YAML); CI/CD integration templates as CodePipeline / GitHub Actions / GitLab CI / Jenkins YAML; eval dashboards as CloudWatch + QuickSight configs; incident playback harness as OpenTelemetry + X-Ray integration code.
Out of scope. Does NOT develop Customer agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. Does NOT replace Customer's independent red-team function. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. No regulator-outcome guarantee.
AWS + Anthropic-side billing. AWS infrastructure + Anthropic API + Bedrock Claude consumption (incl. LLM-judge) separate.
Holiday coverage. Closed on US federal holidays.