HAL isn’t taking over yet but AI can safely replace humans in code reviews

A dystopian future in which malevolent computers like HAL 9000 replace human decision-making remains fiction, but software teams can now safely offload specific, repetitive tasks to artificial intelligence, researchers have claimed.

New research co-authored by Singapore Management University (SMU) reveals that Large Language Models (LLMs) can effectively substitute for a single human reviewer in code annotation tasks without compromising reliability.

The paper, which won the ACM SIGSOFT Distinguished Paper Award at the 22nd International Conference on Mining Software Repositories (MSR2025), suggests that while science fiction fears total automation, the reality is a pragmatic partnership where AI handles low-context grunt work.

“We found that for low-context, deductive tasks, one human can be replaced by an LLM to save effort without losing reliability,” said Christoph Treude, an Associate Professor of Computer Science at SMU. “However, for high-context tasks, LLMs are unreliable.”

Safe substitution

The team examined 10 cases where multiple humans had annotated samples, finding that seven of these scenarios allowed for the safe substitution of one human reviewer.

When testing models like GPT-4, Claude 3.5 and Gemini 1.5, researchers found that “model-model agreement” was the key predictor of success. If multiple AIs independently agreed on a label, the machine was likely as accurate as a human.

However, the technology still lacks the “deep situational awareness” required for high-context tasks, such as determining if a bug report was truly resolved or analysing static analysis warnings.

“That task requires substantial contextual understanding: examining code changes, project history and warning semantics,” said Treude. “Humans achieve high agreement, but models perform poorly.”

The findings offer a roadmap for integrating AI without surrendering control to the machine, researchers claimed.

“Our view is pragmatic: use LLMs to accelerate annotation where it’s safe, not to eliminate human judgment,” said Treude.