Top AI models fail most real-world paid tasks without human supervision

1 minute read

Humans helping AI. — Photo credit: theFreesheet/Google ImageFX

Up next

Brain machine learning.

New AI toolbox lets detailed ‘digital neurons’ learn from experience

Frontier artificial intelligence models struggle to complete professional freelance tasks autonomously, with even the most advanced systems failing significantly more often than they succeed before human experts intervene.

A new study by freelance marketplace Upwork, evaluating AI agents against 322 verified paid jobs, reveals that autonomous completion rates remain low. The best-performing model, Claude Sonnet 4, achieved a completion rate of only 39.8 per cent on its first attempt.

Other advanced models fared worse in the autonomous setting. Gemini 2.5 Pro achieved a 19.9 per cent completion rate, whilst GPT-5 managed 19.6 per cent.

The human ‘rescue’ factor

The research highlights that human-in-the-loop (HITL) intervention is critical for economic viability. When human experts provided feedback on failed attempts, they achieved a “rescue rate” of between 18 per cent and 23.3 per cent, effectively salvaging roughly one in five failed projects.

“Across all three agents, the integration of human feedback leads to substantial performance improvements,” the researchers state, noting relative gains of between 29 per cent and 71 per cent over the AI-only baseline.

The study proposes a “breakeven” framework for deploying digital labour. While AI-only approaches yield the highest expected net value for low-stakes tasks due to minimal cost, high-value work still demands human execution where the “cost of failure outweighs automation gains”.

The analysis suggests that for mid-value tasks, collaborative human-AI systems offering “higher success rates that justify their added human cost” are becoming the optimal economic choice.

Leave a Reply Cancel reply

You May Also Like

India.

India just smashed its 2030 clean energy targets five years early

Achieving 500 GW of non-fossil electricity capacity ahead of schedule was just…

Vibha Dhawan
February 27, 2026

AI myths.

The eight dangerous myths derailing modern AI governance

From the belief that bigger data is always better to the excuse…

Niusha Shafiabady
February 26, 2026

Deepfake videos.

Humans beat AI at spotting deepfake videos but fail entirely with photos

As artificial intelligence gets better at generating fake imagery, a new study…

George Hopkin
March 6, 2026

Data centres.

40 million lost days: The real ‘human cost’ of the race for digital capacity

As data centres scale to power the AI era, it’s not just…

Shane Moore
March 4, 2026

Sanctuary Making: Immigrant Families Reshaping Geographies of Deportability.

Grocery stores are new immigration ‘hot spots’ but communities fight back

As immigration enforcement reaches deep into everyday American life, once-safe business spaces…

George Hopkin
February 26, 2026

Supply chains.

The era of the cheap supply chain is over as AI takes the wheel

For decades, global trade was optimised purely for cost. Now, faced with…

Stefan Penthin
February 26, 2026

Spirituality.

A medical taboo: Why neurologists must start talking to patients about faith

Wading into questions of faith, purpose, and mortality is usually left to…

George Hopkin
March 10, 2026

Inflatable crane.

New $0.10 technique is about to democratise the soft robotics industry

Soft robots are increasingly being used for everything from delicate object handling…

George Hopkin
March 10, 2026

Reading a mouse's brain.

Mind-reading milestone lets scientists watch movies inside a mouse’s brain

It sounds like pure science fiction, but researchers have figured out how…

George Hopkin
March 10, 2026

Automotive industry.

High-tech automotive manufacturing revives high-stakes construction risk

As automotive plants evolve into high-voltage, automated ecosystems, the industry is…

Adam Moore
March 9, 2026