AWS DevOps & Developer Productivity Blog

Amazon introduces two benchmark datasets for evaluating AI agents’ ability on code migration

Introduction: Repository-Level Code Migration

Code migration is a repository-level transformation process that modernizes entire software projects to run on new platforms, frameworks, or runtime environments while preserving their original functionality and structure. Rather than focusing on isolated files or APIs, it operates across the full repository, spanning source code, dependencies, build systems, and configuration files to ensure consistency and correctness at scale. Typical examples include upgrading Java repositories from legacy versions such as Java 8 to modern Long-Term Support releases like Java 17 or 21, migrating .NET Framework repositories to .NET Core, and upgrading AWS Lambda projects in Python or Node.js to the latest runtime versions.

Code migration is a challenging software engineering (SWE) task that involves runtime upgrade, deprecated API replacement, test framework optimization, and syntax modernization. As we build agentic solutions for code migration, the community needs a standardized benchmark dataset and an evaluation framework to measure how well these systems actually perform. To close this gap, we introduce two benchmark datasets: MigrationBench on Java and Poly-MigrationBench as an extension to other programming languages. These datasets are designed not only to benchmark the effectiveness of Large Language Models (LLMs) in repository-level migration, but also to provide the community with a standardized evaluation framework for reproducible experiments.

Solution Overview

MigrationBench: Repository-Level Java Migration

MigrationBench is a comprehensive repository-level benchmark focused on Java upgrades. Specifically, it evaluates the ability of LLMs and other tools to migrate code from Java 8 to newer Long-Term Support (LTS) versions such as Java 17 and Java 21.

The full dataset includes 5,102 open-source Java 8 Maven repositories collected from GitHub, alongside a representative subset of 300 repositories curated for research requiring fewer compute resources. MigrationBench also provides an evaluation framework for validating Java Maven repository upgrades.

Our data collection process follows a carefully designed pipeline with multiple filtering stages to ensure the quality and relevance of the repositories we include. We begin by collecting Java Maven projects, focusing on repositories written in Java that use Maven as their build tool. Next, we apply a license filter, retaining only repositories under MIT or Apache 2.0 licenses to ensure open and permissible usage. We then apply a quality filter, keeping only repositories with at least three GitHub stars to exclude toy or inactive projects. For each repository, we search for the latest buildable commit that is compatible with Java 8, ensuring a valid starting point for migration. We also remove redundant repositories based on their snapshot hashes. Finally, we further exclude repositories without any unit tests or integration tests, which are essential components to validate migration correctness in a robust way. For more details, checkout our paper MigrationBench: Repository-Level Code Migration Benchmark from Java 8 and the GitHub repository.

Poly-MigrationBench: Extending Beyond Java

While MigrationBench focuses exclusively on Java, the real-world code migration problem spans multiple ecosystems. To address this broader scope, we develop Poly-MigrationBench, an extension that introduces additional languages and platforms. We applied a similar data curation process as MigrationBench to additionally collect

  • 100 .NET Framework repositories. They are to be migrated to .NET core.
  • 74 Node.js repositories with version less than Node.js 22. They are to be migrated to Node.js 22.
  • 83 Python repositories with Python version less than 3.13. They are to be migrated to Python 3.13.

The above datasets are publicly available on GitHub: https://github.com/amazon-science/Poly-MigrationBench

Together, these datasets enable researchers to explore cross-language and cross-platform migration challenges at scale.

Use Case 1: Cross-Platform .NET Migration

One pressing migration challenge lies in moving .NET applications from Windows environments running on the legacy .NET Framework to Linux environments powered by .NET Core. This migration is critical for organizations seeking cross-platform compatibility, improved performance, and modern deployment practices such as containerization.

To support research in this area, we curated a benchmark of 100 open-source .NET Framework repositories from GitHub. These projects were carefully selected for diversity and quality, offering a real world foundation for evaluating migration tools and automated systems. The migration goal is clear: transition .NET Framework repositories to .NET Core on Linux while preserving functional equivalence.

Use Case 2: Node.js Upgrade for AWS Lambda Applications

Another timely migration need involves Lambda functions written in Node.js. Node.js 20, currently supported by Lambda, is scheduled for end-of-support in April 2026 (reference). After this deadline, projects running on Node.js 20 will no longer receive critical security patches or bug fixes.

For increased security and to avoid accumulating technical debt, developers building Lambda applications are proactively upgrading to Node.js 22. To evaluate LLMs’ effectiveness in automating this migration, Poly-MigrationBench provides a dataset of 74 open-source Node.js repositories using Node.js versions no later than 20. The task is to upgrade them to Node.js 22 while ensuring functional correctness is preserved.

Use-case 3: AWS Lambda Python Migrations

We also release benchmarks on Lambda Python repositories to the community to facilitate research and evaluation of automated Lambda function migrations in Python code. According to AWS documentation, Python 3.10 and 3.11 are scheduled to reach end of support for Lambda in June 2026. This approaching deadline highlights the urgency of migrating existing Lambda functions to newer runtimes and underscores the critical need for scalable, reliable, and LLM-driven migration solutions. To facilitate evaluation on this task, we collect 83 Python AWS Lambda repositories with Python version no later than 3.12. The objective is to migrate these repositories to Python 3.13.

Get Started

We’ve open-sourced both the datasets and the evaluation framework on Hugging Face and GitHub to make it easy for the community to explore, reproduce, and extend our work. Alongside them, we also released a baseline solution, SD-Feedback, for MigrationBench, while leaving the development of more sophisticated agentic migration systems as a open challenge for the research community.

MigrationBench

To download the MigrationBench dataset, visit our Hugging Face collection. For evaluation, simply clone our GitHub repository and follow the steps in the README.md.

Poly-MigrationBench

To access the Poly-MigrationBench dataset and evaluation commands, clone our GitHub repository.

For a deeper dive into how the benchmarks were curated and how the evaluation framework was designed, check out our paper:

MigrationBench: Repository-Level Code Migration Benchmark from Java 8

Conclusion

Code migration is an essential but complex task for maintaining long-term software reliability and security. With MigrationBench and Poly-MigrationBench, we aim to provide the community with systematic, large-scale benchmarks that enable reproducible research and practical evaluation of automated migration approaches.

Authors

Linbo Liu

Linbo Liu is an Applied Scientist at Amazon Web Services. He works on coding agents optimization and post-training.

Yiyi Guo

Yiyi Guo is a Senior Product Manager at Amazon Web Services. She works on agentic AI, software migration and modernization in AWS Transform.

Luke Huan

Luke Huan is a Senior Principal Scientist at Amazon Web Services. He works on agentic AI, generative AI, AI4code and supports AWS Transform.