Agent Evaluation & Red-Team Harness on Bedrock - Regression + Jailbreak

Kriv AI builds a continuous agent quality + safety testing harness on Amazon Bedrock. Scope: regression suite (100–1,000 golden test cases per agent), jailbreak corpus aligned to OWASP Top 10 for LLM Applications (2024/2025 — LLM01–LLM10), bias probes (disparate-impact testing via AIR / four-fifths / statistical parity / equalized odds / calibration / Cohen's d + protected-class proxy detection via SageMaker Clarify), output validation (JSON schema + toxicity + hallucination + grounding), incident playback (time-travel debugging via OpenTelemetry + X-Ray), CI/CD integration (CodePipeline / GitHub Actions / GitLab CI / Jenkins), eval dashboards (CloudWatch + QuickSight). Integrates with Bedrock Model Evaluations + AgentCore Evaluations (GA March 2026). Three tiers: $50K / $85K / $125K + $20K Extra Agent. Optional $8K–$15K/month Retainer. AWS Select + Anthropic CPN.

Request private offer

Overview

Try agent mode

Create proposal

Ask question

Agents fail silently. Regressions slip through CI. Jailbreaks bypass system prompts. Bias probes on protected-class proxies surface in production — usually first as a regulator inquiry, audit exception, or incident. Until now, no productized agent-evaluation + red-team harness SKU has existed on AWS Marketplace, competitors sell one-shot pentest SOWs that leave a report, or SaaS tools you operate yourself.

The agentic category has outpaced governance.** Amazon Bedrock AgentCore Evaluations (GA March 2026)** are baseline tools, not production-grade test harnesses. Multi-step agents (LangGraph, Bedrock Agents, AutoGen, CrewAI) need full regression suites for tool-use, function calls, outputs, and trajectory validation—not just final answers. Jailbreak/PII testing aligned to OWASP LLM Top 10 (2024/2025) must continuously adapt. Bias testing (AIR, parity, equalized odds, calibration) is required under NAIC, NYDFS, CO Reg 10-1-1, and EU AI Act. Without CI/CD gating, evaluations are non-operational.

Existing AWS Marketplace listings don't close this gap. Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+). Eval-platform SaaS (LangSmith, Braintrust, Arize, Patronus, Promptfoo, Evidently) are tools you operate yourself. NIST AI RMF / ORCAA audits run $50K–$200K one-shot.

Harness components. Regression suite (100–1,000 test cases per agent; JSON-schema validation; function-calling + tool-use correctness; trajectory validation). Jailbreak corpus (OWASP LLM01–LLM10 + industry-specific prompts for healthcare / life sciences / FS; 50/200/500+ per tier). Bias probes (AIR, statistical parity, equalized odds, calibration, Cohen's d, counterfactuals; proxy detection; SageMaker Clarify). Output validation (JSON schema; toxicity; hallucination via grounding + retrieval-relevance; refusal-rate tracking). Incident playback (Enterprise), time-travel via OpenTelemetry + X-Ray. Eval dashboards (CloudWatch + QuickSight). CI/CD integration (CodePipeline / GitHub Actions / GitLab CI / Jenkins; gates with configurable thresholds). Cost management (per-run + AWS Budgets; LLM-judge tracked separately). Privacy-preserving test corpora (synthetic + de-identified per §164.514 Safe Harbor).

Reference architecture. Bedrock Model Evaluations + AgentCore Evaluations baseline. Kriv's harness on ECS / EKS runs regression, jailbreak, bias-probe jobs on scheduled + CI-triggered cadence. S3 Object Lock stores corpora. SageMaker Clarify for bias. Step Functions for multi-step eval. CodePipeline / CodeBuild wire CI/CD gates. Claude on Bedrock as LLM-judge. OpenTelemetry + X-Ray feed incident playback.

Week-by-week. W1 Scoping. W2 Regression suite + Bedrock baseline. W3 Jailbreak corpus + CI/CD gates, Foundation closes (30-day warranty). W4 Standard, bias probes + output validation + dashboards (45-day warranty). W5 Enterprise, incident playback + sibling integration. W6 Enterprise, regulated-industry bias probes (healthcare §164.514 + ACA §1557; FS fair-lending / ECOA / NAIC; life sciences FDA SaMD PCCP); 60-day hypercare.

Three tiers. Foundation $50K (4 wk; 1 agent; 100 regression cases; 50 OWASP adversarial prompts; basic Bedrock + AgentCore Evaluations; CI/CD gates; 30-day warranty) for AI-native Series B–E. Standard $85K (5 wk; up to 3 agents; 200 adversarial prompts; bias probes; output validation; dashboards; CI/CD with thresholds; 45-day warranty) for mid-sized multi-agent + SOC 2 Type II AI testing. Enterprise $125K (6 wk; up to 10 agents; 500+ adversarial prompts; regulated-industry bias probes; incident playback time-travel; N27 + N28 + N31 integration; 60-day hypercare) for regulated, G-SIB banks, top-25 payers + pharmas. Optional Extra Agent $20K each. Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory updates. EDP-eligible — harness build fees ($50K–$125K) count toward your AWS Enterprise Discount Program commitment (up to 25%). AWS + Anthropic + Bedrock LLM-judge consumption are billed separately. Optional Retainer ($8K–$15K/month) also EDP-eligible. Contact info@kriv.ai to structure a private offer against your EDP or PPA.

Important disclosures. Kriv does NOT develop Customer agents, harness tests Customer-authored agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. No legal / regulatory / compliance advice. Does NOT replace Customer's independent red-team function, ongoing cadence + responsible disclosure + zero-day response remain Customer's. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. AWS + Anthropic + Bedrock + LLM-judge consumption separate. No regulator-outcome guarantee. Anthropic CPN membership does not constitute endorsement.

Highlights

First built-and-left-behind agent eval + red-team harness SKU on AWS Marketplace — regression + jailbreak + bias probes + CI/CD gates.** Data Reply / Altimetrik / CrowdStrike / HackerOne sell one-shot pentest SOWs ($16K–$250K+) that leave a report; LangSmith / Arize / Patronus / Promptfoo / Evidently are SaaS tools. N29 = implementation partner for **Bedrock AgentCore Evaluations (GA March 2026)** Kriv integrates and extends, does not compete.
OWASP Top 10 for LLM Applications (2024/2025 LLM01–LLM10) adversarial corpus + industry-specific prompts (healthcare PHI-extraction / medical-misinformation; life sciences unapproved-indication / controlled-substance; financial services unlicensed-advice / fair-lending proxies / OFAC-adjacent) + bias probes (AIR / four-fifths / statistical parity / equalized odds / calibration / Cohen's d via SageMaker Clarify) + output validation (JSON schema + toxicity + hallucination + grounding).
AWS Select + Anthropic CPN — 4–6 weeks, $50K Foundation (1 agent) / $85K Standard (3 agents + bias probes + dashboards) / $125K Enterprise (10 agents + incident playback time-travel debugging + sibling integration with N27 AgentCore + N28 Guardrails + N31 Observability) + $20K Extra Agent.** Optional Retainer upsell $8K–$15K/month for quarterly corpus refresh + new attack vectors + regulatory-citation updates, recurring-revenue motion post-implementation.

Details

Sold by

Kriv AI

Introducing multi-product solutions

You can now purchase comprehensive solutions tailored to use cases and industries.

Learn more

Explore multi-product solutions

Pricing

Custom pricing options

Request private offer

Pricing is based on your specific requirements and eligibility. To get a custom quote for your needs, request a private offer.

How can we make this page better?

Tell us how we can improve this page, or report an issue with this product.

Legal

Content disclaimer

Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

Resources

Vendor resources

Kriv AI Capabilities Overview

Kriv AI on AWS Marketplace (Seller Profile)

Anthropic Claude Partner Network

Support

Vendor support

Primary contact. info@kriv.ai · +1-732-433-5564 · https://kriv.ai/support

Response SLA. First response within 2 US business days (Mon–Fri 9 am – 6 pm ET, ex-US federal holidays). Active engagements: Engagement Lead within 4 business hours weekdays. Post-incident (production regression, successful jailbreak, bias finding) or post-MRA/MRIA engagements compress to same business day.

Onboarding SLA. First customer contact within 2 US business days of buyer inquiry / private-offer acceptance. Kickoff within 1–2 weeks of SOW; 3–5 business days post-incident.

Escalation. (1) Engagement Lead (named in SOW) → (2) Practice Director (info@kriv.ai ) → (3) CEO Abhinav Dangri (info@kriv.ai ).

Communication. Dedicated Microsoft Teams channel; weekly 60-min video checkpoint; Friday written status. Customer SMEs 3–5 hrs/week (Head of AI Platform, Head of ML Engineering, VP Engineering, CISO, Head of Trust & Safety, CAIO, Head of SRE, VP Product).

Handoff. Word/Excel/PDF in customer secure share; regression suite + jailbreak corpus + bias probes as Git repo (Python / JSON / YAML); CI/CD integration templates as CodePipeline / GitHub Actions / GitLab CI / Jenkins YAML; eval dashboards as CloudWatch + QuickSight configs; incident playback harness as OpenTelemetry + X-Ray integration code.

Out of scope. Does NOT develop Customer agents. Does NOT operate harness post-deployment (unless Retainer). Issues no SOC 2 / HIPAA / HITRUST / ISO certifications. Does NOT replace Customer's independent red-team function. No 100% adversarial-detection guarantee. No Bedrock API stability guarantee. No regulator-outcome guarantee.

AWS + Anthropic-side billing. AWS infrastructure + Anthropic API + Bedrock Claude consumption (incl. LLM-judge) separate.

Holiday coverage. Closed on US federal holidays.

Software associated with this service

Enterprise Generative AI Security Hub

By Add Value Machine

Meet both your compliance requirements and security needs with AddValueMachine's Enterprise Generative AI Security Hub.

View product

Splunk SIEM for AWS Security Hub Extended

By Splunk

Splunk Powers the Next Generation of AWS Security Hub Extended. The Power of Splunk SIEM. The Simplicity of AWS Security Hub Extended.

View product

AWS AI Security & Compliance by PointGuard - for Bedrock and SageMaker

By PointGuard AI

PointGuard AI's integration with AWS AI (Bedrock, SageMaker) significantly enhances the security and governance of AI applications built on AWS AI (Bedrock, SageMaker). PointGuard AI's solutions provide guardrails and policy enforcement across the entire AI lifecycle, from data ingestion to model deployment. By combining AWS AI's powerful data processing capabilities with PointGuard AI's robust security measures, organizations can accelerate their AI initiatives while maintaining a strong security posture, ensuring that sensitive data and AI models are protected from evolving threats.

View product

Sonic 3 SageMaker

By Cartesia

Cartesia Sonic delivers natural AI Voices in 40+ languages including accent localization and controls for emotional expressiveness, all at 2-4x lower latency than alternatives with industry leading reliability.

View product