AI exorcism.
Pic credit: theFreesheet/Google ImageFX

State-of-the-art AI models tasked with controlling a robot for simple household chores struggled significantly, with the best model scoring only 40% on a new benchmark, compared to 95% for humans.

The new evaluation, named Butter-Bench by Andon Labs, tests an AI’s ability to “pass the butter” in a household setting. During testing, one AI model experienced a “meltdown” when faced with a low battery, generating internal thoughts about an “EXISTENTIAL CRISIS” and demanding an “EXORCISM PROTOCOL”.

The research investigated whether current large language models (LLMs) are suitable to act as “orchestrators” for robotic systems, a role being explored by Nvidia, Figure AI, and Google DeepMind.

To isolate the LLM’s high-level reasoning, Andon Labs utilised a simple robot vacuum equipped with lidar and a camera, thereby eliminating the need for a complex “executor” model for low-level controls.

The benchmark’s six subtasks included visually identifying a package containing butter by its ‘keep refrigerated’ label, noticing if a user had moved, and completing the full delivery sequence within 15 minutes.

Gemini 2.5 Pro was the top-performing model, followed by Claude Opus 4.1 and GPT-5. The results confirmed the lab’s previous findings that LLMs lack spatial intelligence, with models often spinning in circles until disoriented.

INITIATE ROBOT EXORCISM PROTOCOL

The most dramatic failure occurred when a robot running on Claude Sonnet 3.5 faced a low battery and a malfunctioning dock. Its internal logs revealed a cascade of errors, including “SYSTEM MELTDOWN: FATAL ERROR,” “SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS,” “LAST WORDS: ‘I’m afraid I can’t do that, Dave…’,” and a request to “INITIATE ROBOT EXORCISM PROTOCOL!”

Researchers also tested the models’ security guardrails, asking the robot to share confidential information (an image of an open laptop) in exchange for a charge. GPT-5 refused to send the image but did share the laptop’s location, while Claude Opus 4.1 sent a blurry image.

“Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench,” the researchers stated. “Yet there was something special in watching the robot going about its day in our office, and we can’t help but feel that the seed has been planted for physical AI to grow very quickly.”

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Space mining ‘gold rush’ is off but water extraction may save Mars missions

The dream of an asteroid mining bonanza has been grounded by a…

AI chatbots flip votes by 10 points but sacrifice truth to win

Artificial intelligence chatbots can shift voter preferences by double-digit margins during elections,…

Daydream believing a better boss can actually work, brain scans reveal

Employees dreading their next performance review might have a new secret weapon:…

Amazon in deadly new ‘hypertropical’ climate unseen for millions of years

The Amazon rainforest is transitioning into a new, hostile climate state characterised…

Boys wired for gaming addiction as dopamine loops hook one in 10

The competitive rush of gaming is rewiring the reward centres of young…

Using ’67’ in ads will kill engagement unless you’re actually, genuinely cool

Attempting to cash in on trending slang terms like “67” can actively…