Skip to main content
2025

Multitudes builds code review feature in 2 months using 3 LLMs on Amazon Bedrock

Learn how Multitudes uses Amazon Bedrock to help software engineers improve code review quality.

Benefits

2

months to build and launch AI feature

44%

increase in monthly active users

10

LLMs tested across 1,000 code reviews

<1%

extreme misclassification rates, down from 20%

Overview

Focused on improving engineering team performance through telemetry data and action nudges, New Zealand–based startup Multitudes set out to build features using AI to assess code review quality for its customers. To achieve this, the company used Amazon Web Services (AWS) generative AI capabilities to power a new code review quality feature. As a result, Multitudes increased monthly active users by 44 percent, reduced misclassification rates from traditional models from 20 percent to less than 1 percent, and delivered actionable insights that improved team performance.

Missing alt text value

About Multitudes

Multitudes is a New Zealand–based startup that helps engineering leaders improve team performance and understand AI impact. By analyzing data from GitHub, Linear, Jira, PagerDuty, Google Calendar and more, it uncovers what work is blocked and which people aren’t getting enough support.

Opportunity | Meeting customer demand for qualitative code reviews

To advance its mission of helping teams work better together, Multitudes built an analytics platform that enhances software engineering performance with real-time insights into delivery blockers, feedback, and burnout prevention. As the company expanded, however, it faced a challenge: delivering deeper insights into the quality of code reviews. “We’ve always measured code review activity by the number of reviews or comments, but our customers wanted insight into the quality of those reviews,” says Vivek Katial, lead data scientist at Multitudes.

Traditional natural language processing (NLP) and machine learning (ML) models couldn’t deliver the requisite level of accuracy or performance needed to build a feature that customers could trust. “We needed a level of accuracy high enough for our customers to trust and truly value the feature,” Katial adds.

Solution | Building a new code review feature with LLMs on AWS

Multitudes chose AWS generative AI technologies, including Amazon Bedrock, to expand its platform and deliver a new code review quality feature. “Data security was another advantage, because the security conversations with our customers are more straightforward when we can use Amazon Bedrock to maintain everything in one environment and not send data outside it,” explains Katial.

To develop the feature, the team evaluated nearly 1,000 code reviews and manually created a labeled ground truth dataset across three dimensions: feedback specificity, tone/sentiment, and bot-generated activity. It then used different models for each dimension—Amazon Nova Pro for bot detection, Anthropic Claude for feedback specificity and prompt-injection detection, and Mistral for sentiment analysis—with Amazon Elastic Container Service (Amazon ECS) used to orchestrate the data pipeline.

Amazon Bedrock provided Multitudes the flexibility to systematically test and evaluate over 10 large language models (LLMs) against their ground truth dataset for each dimension. The team tested multiple variations of prompts and tasks for each model. Running structured experiments enabled them to identify the optimal model-prompt combinations for building the feature. Multitudes also iterated on the design and user experience of the feature based on early feedback. For example, users reported that labelling feedback as “negative” with red highlighting felt overly harsh, and thus the team reframed the language to “needs attention” and softened the visuals to use yellow highlighting instead. To help users contextualize their data, the team also added benchmarks within the chart to show what constructive versus minimal reviews look like, making it easier for teams to assess their performance at-a-glance.

Outcome | Growing monthly active users by 44% and improving accuracy

Within two months of launch, the new code review quality feature drove a 44 percent increase in monthly active users. Model accuracy improved significantly, with severe misclassification rates falling from 20 percent to under 1 percent—indicating the model now rarely confuses highly specific feedback with minimal reviews. “Amazon Bedrock truly helped us achieve something new for our company,” says Katial.

The feature quickly became one of the platform’s top five most-used capabilities and remains so months after launch. Teams regularly use it during one-on-one coaching sessions to surface and address unhelpful or harsh feedback, showing that it is both widely adopted and changing how developers approach feedback conversations.

Building on the success of its code review quality feature, Multitudes has released a second capability that classifies the themes of code reviews, such as whether comments relate to testing, formatting, or overall quality. A third feature in development helps improve Jira hygiene by ensuring tasks are clearly labeled and documented, automatically sorting work into feature development, maintenance, or bug fixes. Together, these innovations move Multitudes closer to its mission of helping teams work better, with generative AI playing a central role in delivering insights and recommendations.

Missing alt text value
Within two months of launch, the new code review quality feature drove a 44 percent increase in monthly active users. Amazon Bedrock truly helped us achieve something new for our company.

Vivek Katial

Lead Data Scientist, Multitudes

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.

Contact Sales

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages