What Nearly 70 GenAI Projects Taught Us About What Actually Works — And How We Used Agentic AI to Find Out

Earlier this year, the GenAI Zürich Award 2026 put out a call for generative AI projects across Switzerland and Europe. The response was striking. Nearly 70 submissions arrived – from five-person startups to organizations with tens of thousands of employees, spanning healthcare, insurance, agriculture, manufacturing, education, and public services. Three award tracks — Impact Achievers, Rising Innovators, and Enterprise Transformers — covered the full spectrum from social impact to enterprise-scale transformation.

Each project was independently scored by a panel of 15 senior industry experts. Winners were announced at the GenAI Zürich conference on 1–2 April. All company names and identifying details in the guides are anonymized.

After the awards, the jury took the analysis further – using Kiro, an agentic AI development tool from AWS, and custom Strands Agent SOPs to process the full dataset: every submission, every expert score, every piece of jury feedback. We distilled the patterns into two practical guides: one for business leaders, one for engineers and architects.

What the Business Data Showed

The most common gap wasn’t technical — it was evidentiary. Over 75% of projects couldn’t prove their business impact. The technology worked. The measurement infrastructure didn’t exist. Expert reviewers flagged this pattern repeatedly: “No quantified business impact.” “Claims presented without methodology.” “Impact metrics are projections, not measurements.”

The projects that stood out shared a discipline: they defined four metrics before building anything — adoption (are people actually using it?), efficiency (how much time does it save?), quality (is the output good enough to trust?), and cost (what does it cost per unit of work versus the baseline?) — and instrumented their systems to capture these automatically from day one.

Three other patterns separated the strongest projects from the rest:

Narrow focus outperformed broad ambition. A digital marketing company focused exclusively on sales outreach emails and measured a 25% revenue increase within three months. A healthcare documentation startup focused on a single workflow — the administrative burden physicians hate most — and had dozens of paying customers within 16 months. Projects that attempted broad AI transformation consistently under-performed those that went deep on one workflow first.

Governance accelerated adoption. This was counterintuitive. Projects with strong data sovereignty, audit trails, and human-in-the-loop review had higher adoption rates — not lower. An insurance company that built DSG (Datenschutzgesetz, the Swiss Federal Act on Data Protection) compliance and traceable outputs into its system from day one got faster internal buy-in, because every stakeholder conversation started from trust rather than risk.

Proprietary domain knowledge was the real moat. The projects that created lasting competitive advantage encoded domain-specific expertise — legal terminology, crop cycles, medical workflows, industrial part taxonomies — into their AI systems. Generic model capabilities were table stakes. What competitors couldn’t replicate was the structured domain knowledge layer.

The full business guide maps these findings to three maturity stages — haven’t started, piloting, and scaling — with specific actions for each.

GenAI Business Guide

What the Technical Architectures Revealed

Under the hood, a clear pattern emerged: the systems that reached production invested more in what surrounds the model than in the model itself. Production generative AI turned out to be roughly 20% model and 80% everything around it.

Domain-specific data layers beat generic model upgrades. An oncology AI company trained pathology, radiology, and genomics encoders from scratch, achieving state-of-the-art performance with 100x less training data. An industrial sourcing platform trained embeddings on part taxonomies and jumped matching accuracy from 25% to 70%. In every case, the largest accuracy gains came from the domain data layer, not from switching to a better foundation model.

But you don’t need to train models to get started. Other teams took a lighter-weight approach: building document processing pipelines that preserve table structures and hierarchies, feeding them into a vector store for retrieval, and closing the loop with human experts. When the system detected a knowledge gap — a question it couldn’t answer confidently — it escalated to a domain expert via Slack or Teams. The expert’s verified answer was fed back into the knowledge base automatically, so the same question never required human intervention twice. The organization’s domain knowledge grew with every interaction, and the general-purpose model got more accurate without anyone retraining anything.

Pre-generation input validation eliminated the trust problem. The strongest systems in regulated industries validated outputs before generation, not after. Here’s what that looks like in practice: an insurance group built a deterministic terminology layer that fuzzy-matches input against a curated database of mandatory legal and brand terms, then pins those terms into the prompt before the LLM generates anything. The model operates within those constraints — it can’t hallucinate a legal term because the correct term is already locked in. A pharmaceutical patient support system takes a different approach: specialist safety models continuously shape what the primary model can say in real time, constraining the conversation space rather than filtering after the fact. Post-generation filtering catches errors; pre-generation validation prevents them.

Multi-agent orchestration worked – when scopes were narrow. An agricultural marketing platform shipped five specialized agents – one to orchestrate, one to research, one to plan, one to create content, and one to analyze results – each connected to live marketing data sources via Model Context Protocol (MCP) rather than passing context through prompts. In production across 10+ markets on four continents within months of MVP.

An enterprise ERP assistant took a different approach: a Pathfinder agent views the user’s screen in real time via vision models and provides context-aware step-by-step navigation through complex ERP interfaces. On go-live day: over 1,100 unique users, 3,000+ real-time interactions, 100M+ tokens processed. The pattern that made both work: each agent has one job, and an orchestrator routes between them rather than trying to reason across them. The systems that tried to build general-purpose reasoning across agents were failing. The ones that shipped gave each agent a single, narrow capability and let the orchestrator compose.

The biggest technical gap: no evaluation pipelines. The majority of projects had no automated way to measure whether outputs were correct, consistent, or improving. No gold-standard datasets, no regression tests, no CI/CD gates for output quality. The counter-example — a translation platform that ran blind evaluations where professional translators assessed hundreds of segments across multiple language pairs, comparing their system against the market leader — proved nearly twice as many perfect translations. That result was rigorous, reproducible, and became the centerpiece of their enterprise sales narrative. Better yet, every human correction fed back into client-specific models, so the eval data compounded: each customer interaction made the system measurably better and harder to replicate.

The technical guide includes a four-stage checklist (validate → build → ship → scale) with specific infrastructure recommendations at each stage.

GenAI Technical Guide

How We Built the Guides Themselves

The analysis and drafting of both guides were produced using an agentic AI workflow — making the process itself a practical demonstration of the patterns we observed.

The orchestration environment was Kiro CLI, an agentic AI development tool. Custom Strands Agent SOPs (Standard Operating Procedures from the Strands Agents open-source project) defined the end-to-end workflow: ingesting all submissions and expert scores across three award tracks, cross-referencing judge feedback, identifying patterns, and generating the final text. The underlying model was Anthropic Claude Opus 4.6.

The workflow followed the same principles the strongest projects in the dataset used:

Narrow scope: Each SOP handled one task — data ingestion, pattern extraction, or guide generation — rather than attempting end-to-end reasoning in a single prompt.
Human-in-the-loop: Expert jury members reviewed and commented on each version. Their feedback was fed back into the process, making each iteration measurably better.
Eval-driven: The business guide was generated first, then passed as a companion input to the technical guide SOP to ensure consistent framing, examples, and pattern naming across both documents.

The result: two guides totaling over 7,000 words of analysis, grounded in scored evaluation data, produced through a repeatable agentic workflow that any team could adapt for their own research synthesis tasks.

The SOPs themselves went through multiple iterations. Early versions produced output that was too report-like, too long, or leaked identifying details. Each round of expert review surfaced new failure modes that became new constraints — “examples must follow situation → decision → outcome,” “anonymize operational metrics but keep technical specifics,” “the evidence gap section must not bleed into governance.” The SOPs grew more opinionated with each iteration, and the output got better each time. Both SOPs follow the Strands Agent SOP format specification — if you want to see what it takes to instruct an agentic loop for a research synthesis task, download them here: Business Guide SOP, Technical Guide SOP.

Read the Full Guides

The technology is ready. These nearly 70 projects prove it. The guides distill their hard-won lessons — how to go deep on one workflow, build the evidence, and scale responsibly — to shorten the learning curve for the next wave of teams. The strongest signal in the data: the teams that stopped at automating one task got linear gains. The ones that let AI change how their teams operate, like turning a sequential review process into a parallel one, saw step-function improvements. The guides show you how to start. GenAI Business Guide, GenAI Technical Guide.

AWS in Switzerland and Austria (Alps)

What Nearly 70 GenAI Projects Taught Us About What Actually Works — And How We Used Agentic AI to Find Out

What the Business Data Showed

What the Technical Architectures Revealed

How We Built the Guides Themselves

Read the Full Guides

Learn

Resources

Developers

Help