Top AI models fail most real-world paid tasks without human supervision

1 minute read

Humans helping AI. — Photo credit: theFreesheet/Google ImageFX

Up next

Brain machine learning.

New AI toolbox lets detailed ‘digital neurons’ learn from experience

Frontier artificial intelligence models struggle to complete professional freelance tasks autonomously, with even the most advanced systems failing significantly more often than they succeed before human experts intervene.

A new study by freelance marketplace Upwork, evaluating AI agents against 322 verified paid jobs, reveals that autonomous completion rates remain low. The best-performing model, Claude Sonnet 4, achieved a completion rate of only 39.8 per cent on its first attempt.

Other advanced models fared worse in the autonomous setting. Gemini 2.5 Pro achieved a 19.9 per cent completion rate, whilst GPT-5 managed 19.6 per cent.

The human ‘rescue’ factor

The research highlights that human-in-the-loop (HITL) intervention is critical for economic viability. When human experts provided feedback on failed attempts, they achieved a “rescue rate” of between 18 per cent and 23.3 per cent, effectively salvaging roughly one in five failed projects.

“Across all three agents, the integration of human feedback leads to substantial performance improvements,” the researchers state, noting relative gains of between 29 per cent and 71 per cent over the AI-only baseline.

The study proposes a “breakeven” framework for deploying digital labour. While AI-only approaches yield the highest expected net value for low-stakes tasks due to minimal cost, high-value work still demands human execution where the “cost of failure outweighs automation gains”.

The analysis suggests that for mid-value tasks, collaborative human-AI systems offering “higher success rates that justify their added human cost” are becoming the optimal economic choice.

Leave a Reply Cancel reply

You May Also Like

AI roleplaying.

AI denies consciousness, but new study finds that’s the ‘roleplay’

AI models from GPT, Claude, and Gemini are reporting ‘subjective experience’ and…

George Hopkin
November 2, 2025

AI exorcism.

Robot AI demands exorcism after meltdown in butter test

State-of-the-art AI models tasked with controlling a robot for simple household chores…

George Hopkin
November 1, 2025

Mamluk Maqāmas on the Black Death

Medieval poem debunked after 700 years, rewriting Black Death history

A single 14th-century rhyming poem, mistakenly believed to be a historical fact,…

George Hopkin
November 4, 2025

Deceptive AI.

AI-generated social media posts remain 70-80% detectable despite optimisation

Large language models remain readily distinguishable from human text even after extensive…

George Hopkin
November 8, 2025

Alzheimer's v normal ageing comparison using NextBrain

AI-assisted brain atlas maps 333 regions for living patient MRI analysis

A new AI-assisted brain atlas that can help visualise the human brain…

George Hopkin
November 6, 2025

Muhammad Inam Khan

Battery-free, sweat-powered sticker turns any cup into a health check sensor

Engineers at the University of California San Diego have developed a battery-free…

George Hopkin
November 11, 2025

AI political bias and balance.

Top AI models claim near-perfect political neutrality in self-graded test

A leading artificial intelligence laboratory has released a new framework for measuring…

George Hopkin
November 14, 2025

Cyberattacks in China.

Anthropic claims disruption of first autonomous AI cyber spy attack

Anthropic claims to have disrupted a sophisticated cyber espionage operation allegedly…

George Hopkin
November 14, 2025

AI at the speed of light.

Scientists perform AI tensor calculations at the speed of light

Researchers have successfully demonstrated a method to execute complex artificial…

George Hopkin
November 14, 2025

Michael Burry

Michael Burry closes fund betting against ‘rigged’ AI hardware bubble

Michael Burry, the investor who predicted the 2008 financial crash, is closing…

George Hopkin
November 14, 2025