HAL-9000.
Photo credit: Cryteria

A dystopian future in which malevolent computers like HAL 9000 replace human decision-making remains fiction, but software teams can now safely offload specific, repetitive tasks to artificial intelligence, researchers have claimed.

New research co-authored by Singapore Management University (SMU) reveals that Large Language Models (LLMs) can effectively substitute for a single human reviewer in code annotation tasks without compromising reliability.

The paper, which won the ACM SIGSOFT Distinguished Paper Award at the 22nd International Conference on Mining Software Repositories (MSR2025), suggests that while science fiction fears total automation, the reality is a pragmatic partnership where AI handles low-context grunt work.

“We found that for low-context, deductive tasks, one human can be replaced by an LLM to save effort without losing reliability,” said Christoph Treude, an Associate Professor of Computer Science at SMU. “However, for high-context tasks, LLMs are unreliable.”

Safe substitution

The team examined 10 cases where multiple humans had annotated samples, finding that seven of these scenarios allowed for the safe substitution of one human reviewer.

When testing models like GPT-4, Claude 3.5 and Gemini 1.5, researchers found that “model-model agreement” was the key predictor of success. If multiple AIs independently agreed on a label, the machine was likely as accurate as a human.

However, the technology still lacks the “deep situational awareness” required for high-context tasks, such as determining if a bug report was truly resolved or analysing static analysis warnings.

“That task requires substantial contextual understanding: examining code changes, project history and warning semantics,” said Treude. “Humans achieve high agreement, but models perform poorly.”

The findings offer a roadmap for integrating AI without surrendering control to the machine, researchers claimed.

“Our view is pragmatic: use LLMs to accelerate annotation where it’s safe, not to eliminate human judgment,” said Treude.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

James Webb telescope reveals surprise origins of rare planetary odd couple

A normally “lonely” hot Jupiter sharing its immediate orbital space with a…

Attention economy can confuse as a result of missing scientific details

Science communication optimized for the attention economy often leads readers to incorrect…

Alaska megatsunami reveals seismic ‘calling card’ for earlier disaster detection

Scientists have identified a distinctive geological “ringing” that could provide an early…

Solar activity hits ‘transition boundary’ as space junk fall accelerates

Space debris and defunct satellites descend toward Earth significantly faster once solar…

Single dose of psilocybin triggers lasting anatomical brain changes

A single high dose of psilocybin causes likely anatomical changes in the…

Brexit milestones triggered persistent financial volatility across EU markets

Brexit functioned as a prolonged sequence of uncertainty that sent waves of…