Reproducible Code Migration at Scale with AI-Generated Playbooks

Introduction

If you’re migrating code at scale across hundreds of repositories, you need reproducible results across runs. Large language model (LLM)-based agents, by definition, rely on the probabilistic nature of LLMs where each result is generated from a variety of potentially probable outcomes. In a short-horizon task, this nondeterminism is manageable. In a long-horizon task like migrating a large codebase, where an agent makes hundreds of interdependent decisions across planning, refactoring, dependency resolution, and testing, variance compounds. Two runs on the same repository with identical inputs can produce markedly different outcomes such as different files changed, different patterns applied, and different final states. This isn’t a bug. It’s due to the stochastic nature of these systems. At enterprise scale, where hundreds of repositories need to be migrated consistently, this nondeterminism becomes a serious problem. Reproducibility in agentic transformation is a hard, largely unsolved challenge.

With AWS Transform custom, you can refactor and modernize large legacy systems using AI-powered migration agents. As your code is migrated across repositories, the service accumulates artifacts such as commit histories, code diffs, error logs, resolution strategies, and decision rationales. These artifacts capture what worked, what failed, and why. They point toward a solution to the consistency problem.

In this blog post, we explore how a four-phase multi-agent pipeline automatically generates migration “playbooks” from those accumulated artifacts. We examine why the book format is well-suited to this problem, walk through the pipeline that produces playbooks, show how playbooks improve migration consistency with experimental results, and share a real-world example of playbook-guided planning. By integrating a migration playbook into a transformation package for AWS Lambda Python migrations to newer Python versions, we achieved measurable improvements in consistency of migration outcomes.

The knowledge aggregation problem

A natural first approach is to extract learnings from migration artifacts and store them in a vector database. When a new migration encounters a familiar situation, the system retrieves relevant learnings to inform the agent’s decisions. This approach captures comprehensive knowledge without manual curation, but it doesn’t fully address the problem in two critical ways.

Quality control is difficult. Automatically extracted knowledge varies widely in accuracy, relevance, and generalizability. Some extracted learnings are high-quality insights; others are context-specific quirks that don’t transfer. Without human review, low-quality learnings pollute the knowledge base and can mislead future migrations.

Human verification is impractical. Thousands of fragmented learnings aren’t designed for human review at scale. However, complex migrations require human expertise to confirm patterns, identify errors, and prioritize what matters. This demands a format humans can naturally read and edit, structured enough to organize complexity while remaining accessible for efficient review.

Wikis, runbooks, and documentation sites serve as effective formats for structured technical knowledge. However, books stand apart with centuries of refinement behind them, making them well-suited for presenting complex information in a way that humans can read naturally.

Why a playbook?

Books have structural properties that help address these challenges. Rather than treating the book format as a stylistic choice, consider what it delivers technically.

Natural quality filtering through distillation. Writing a book distills information. When you summarize hundreds of migrations into a finite document, you make choices about what deserves inclusion. Patterns that appear frequently across many migrations appear more frequently. Rare edge cases and one-off quirks receive less attention or are omitted entirely. This frequency-based filtering acts as a natural quality mechanism in practice. If most repositories required a specific type of dependency update, that pattern is almost certainly important and correct. If only two repositories showed a particular behavior, it may be an anomaly rather than a generalizable lesson. Majority consensus tends toward accuracy, not always, but typically reliably enough to provide a strong baseline for migration guidance.

Easy human correction when majority is wrong. Majority isn’t always right. Sometimes a common pattern reflects a common mistake, or an outdated best practice that should be updated. The book format makes errors visible and corrections straightforward. An expert reading a well-organized book can quickly spot problematic recommendations such as “This chapter says to handle exceptions this way, but that’s an anti-pattern in Python 3.12.” The expert can directly edit that section, and the correction propagates to future users of the book. In a vector database, correcting errors requires identifying which embeddings encode the problematic knowledge and modifying them without affecting retrieval quality, a more involved process. The book makes the same task straightforward.

Structured knowledge produces structured guidance. Books have chapters. Chapters have sections. Sections have coherent narratives. This hierarchical structure maps naturally onto the structure of migration work, where chapters cover major categories of migration tasks (such as “Dependency Management”), sections cover specific patterns within each category (such as “Upgrading Boto3 for Python 3.12”), and examples provide concrete code snippets and before/after comparisons. This structure benefits AI agents as much as human readers. Agents navigate and apply the knowledge more effectively when it’s organized around migration tasks rather than stored as unstructured embeddings.

Institutional knowledge that persists. A book creates durable institutional knowledge that exists independently of a particular database, embedding model, or retrieval system. Team members can onboard by reading it. Auditors can review migration standards. The knowledge survives infrastructure changes.

Playbook generation pipeline

Generating a playbook from raw migration artifacts isn’t a single LLM call. AWS Transform custom uses a multi-agent pipeline where specialized agents work in sequence. Each agent builds on the validated output of the previous step, to progressively transform unstructured artifacts into a coherent, structured book. The pipeline handles the volume and complexity that would make a single-shot approach impractical, while preserving the organization and depth that make the resulting playbook useful for both human reviewers and AI planners.

Intuition suggests that more migration examples should yield better books. AWS Transform custom tested this directly by generating two versions of the Python Lambda migration playbook, one from 10 repositories and one from 77.

The quantitative differences were immediate. The 77-repository playbook contained 25% more files and replaced scattered process notes with dedicated chapters and an explicit six-step workflow that was not present in the smaller version.

The qualitative differences were more notable. The 10-repository playbook covered the basics but offered generic guidance with low confidence, because patterns observed across only 10 data points could easily be coincidental. The 77-repository playbook covered advanced patterns alongside the basics, organized them with a clear separation of concerns, and grounded each recommendation in real migration evidence. Guidance shifted from “consider doing X” to “do X, Y, Z in this order,” with concrete examples drawn from actual migrations. The additional data didn’t just add volume. It gave agents enough signal to distinguish common cases from edge cases and to provide specific, actionable direction rather than general suggestions. This validates the reinforcing pattern that more migrations produce richer playbooks, which guide better future migrations.

Improving reproducibility of migration

Beyond knowledge capture, the playbook serves a critical operational purpose which is reducing variance and improving reproducibility across migration runs.

The variance problem. Variance is a measurable challenge in complex, multi-agent migrations. You can start two independent migration runs for the same repository and, despite identical inputs, get different outcomes. For example, Run 1 might update 47 files and modernize import statements, while Run 2 updates 52 files and refactors exception handling. This is an inherent property of LLM-based systems, not a defect. Each run samples from a probability distribution, and multi-agent pipelines compound this variance across many decisions. Over a long trajectory, small differences accumulate into large divergences.

Root cause is planning divergence. Much of the outcome variance traces back to planning. If two runs produce different migration plans, they’ll naturally produce different outcomes. Without strong guidance, planners have significant flexibility. They might prioritize tasks differently, group changes differently, or identify different tasks entirely. Each choice is locally reasonable but globally inconsistent.

The playbook as a consistency mechanism. The playbook addresses both of these issues, natural LLM variance and early planning divergence, by constraining the solution space. When a planner has access to a book documenting what action spaces exist, what order they typically follow, what specific patterns to look for, and what approaches have worked historically, the planner’s decisions become more deterministic. Rather than sampling freely from a wide distribution of plausible plans, the planner is guided toward a narrower set of proven approaches. This reduces variance where it matters most, the beginning of the migration process. The playbook doesn’t eliminate choice, but it channels choices toward proven approaches.

We compared two conditions across three independent LLM judges (Sonnet 4.5, Qwen3 480B, and Devstral-v2 123B): one only with transformation definition (TD) versus the one where TD is combined with the migration playbook (TD + Playbook). TD + Playbook outperformed TD in each condition tested across three judges, with improvements ranging from +4.93% to +15.79%, and five of six comparisons statistically significant (p < 0.05). The effect is model-agnostic. A task description tells the planner what to achieve, while the playbook tells it how, and that distinction is what drives the gains.

Playbook in practice

During a period of high-volume parallel migrations, widespread throttling caused most runs to fail or regress. Investigation traced the root cause to throttling on a dependency synchronization API. When the playbook was later generated from artifacts collected during that period, the agents recognized the pattern and embedded a concrete mitigation. The following is a direct quote from the generated playbook:

In our analysis of 78 migrations, dependency sync operations were attempted 143 times. Throttling is common when multiple migrations run simultaneously and the package registry becomes overwhelmed. When this happens, you’ll see ThrottlingException: Rate exceeded. This is a temporary slowdown, not a failure. Retry with a brief delay. Most migrations in our dataset required 2–3 attempts before succeeding.

The agent’s output demonstrates significant autonomous pattern recognition. It quantified the pattern (143 attempts across 78 migrations), diagnosed the root cause (simultaneous migrations overwhelming the registry), distinguished temporary slowdowns from actual failures, and prescribed a specific mitigation strategy calibrated to observed behavior (2-3 attempts typically succeed). This transforms scattered failure logs into institutional knowledge that prevents future migrations from repeating the same mistakes. With this knowledge in the playbook, subsequent plans included retry logic by default:

{
  "step_sync_dependencies": {
    "description": "Sync dependencies to integrate updated packages",
    "estimated_time": "5-10 minutes",
    "commands": [{
      "command": "sync-deps",
      "retry_logic": "sync-deps || sleep 5 && sync-deps || sleep 10 && sync-deps",
      "expected_retries": "2-3 attempts typically required due to throttling",
      "reason": "Retry logic confirms success under rate-limited conditions"
    }]
  }
}

Conclusion

LLM-based agents are inherently non-deterministic. In long-horizon tasks like LLM-based code migration, this randomness compounds across many decisions where two runs on the same repository can produce markedly different outcomes. Achieving consistency across migrations is a hard, largely unsolved problem. At AWS Transform, we developed an agentic generation of migration playbooks as a solution to this challenge. By distilling accumulated migration artifacts into a structured book using a four-phase multi-agent pipeline, playbooks give AI planners concrete, proven guidance that constrains the solution space and reduces variance. The book format also makes human review and correction practical in a way that raw artifacts or vector databases do not. In our offline experiments, book-guided planning achieves up to 15.79% higher consistency than unguided approaches.

Next steps

Generated playbooks are integrated into AWS managed transformation packages, so the consistency benefits described in this blog post are available without additional configuration. To get started, explore the AWS Transform custom to find transformation packages for your migration scenario.

Migration & Modernization

Reproducible Code Migration at Scale with AI-Generated Playbooks

Introduction

The knowledge aggregation problem

Why a playbook?

Playbook generation pipeline

Improving reproducibility of migration

Playbook in practice

Conclusion

Next steps

About the authors

Learn

Resources

Developers

Help