HAL-9000.
Photo credit: Cryteria

A dystopian future in which malevolent computers like HAL 9000 replace human decision-making remains fiction, but software teams can now safely offload specific, repetitive tasks to artificial intelligence, researchers have claimed.

New research co-authored by Singapore Management University (SMU) reveals that Large Language Models (LLMs) can effectively substitute for a single human reviewer in code annotation tasks without compromising reliability.

The paper, which won the ACM SIGSOFT Distinguished Paper Award at the 22nd International Conference on Mining Software Repositories (MSR2025), suggests that while science fiction fears total automation, the reality is a pragmatic partnership where AI handles low-context grunt work.

“We found that for low-context, deductive tasks, one human can be replaced by an LLM to save effort without losing reliability,” said Christoph Treude, an Associate Professor of Computer Science at SMU. “However, for high-context tasks, LLMs are unreliable.”

Safe substitution

The team examined 10 cases where multiple humans had annotated samples, finding that seven of these scenarios allowed for the safe substitution of one human reviewer.

When testing models like GPT-4, Claude 3.5 and Gemini 1.5, researchers found that “model-model agreement” was the key predictor of success. If multiple AIs independently agreed on a label, the machine was likely as accurate as a human.

However, the technology still lacks the “deep situational awareness” required for high-context tasks, such as determining if a bug report was truly resolved or analysing static analysis warnings.

“That task requires substantial contextual understanding: examining code changes, project history and warning semantics,” said Treude. “Humans achieve high agreement, but models perform poorly.”

The findings offer a roadmap for integrating AI without surrendering control to the machine, researchers claimed.

“Our view is pragmatic: use LLMs to accelerate annotation where it’s safe, not to eliminate human judgment,” said Treude.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

AI could revolutionise global healthcare — if we stop leaving billions behind

Artificial intelligence offers a historic opportunity to fix broken medical systems in…

To govern AI, we must stop policing software and start capping ‘compute’

Trying to regulate subjective AI capabilities is a losing battle. Instead, we…

Why the AI job apocalypse might just be history repeating itself

From silent film stars to bank tellers, professions threatened by new technology…

Why failing public sector AI projects refuse to die despite broken promises

Generative AI projects in public administration often persist even when the technology…

Bedtime doomscrolling costing millions of Americans a good night’s sleep

Millions of Americans are actively sacrificing a good night’s rest for one…