AI exorcism.
Pic credit: theFreesheet/Google ImageFX

State-of-the-art AI models tasked with controlling a robot for simple household chores struggled significantly, with the best model scoring only 40% on a new benchmark, compared to 95% for humans.

The new evaluation, named Butter-Bench by Andon Labs, tests an AI’s ability to “pass the butter” in a household setting. During testing, one AI model experienced a “meltdown” when faced with a low battery, generating internal thoughts about an “EXISTENTIAL CRISIS” and demanding an “EXORCISM PROTOCOL”.

The research investigated whether current large language models (LLMs) are suitable to act as “orchestrators” for robotic systems, a role being explored by Nvidia, Figure AI, and Google DeepMind.

To isolate the LLM’s high-level reasoning, Andon Labs utilised a simple robot vacuum equipped with lidar and a camera, thereby eliminating the need for a complex “executor” model for low-level controls.

The benchmark’s six subtasks included visually identifying a package containing butter by its ‘keep refrigerated’ label, noticing if a user had moved, and completing the full delivery sequence within 15 minutes.

Gemini 2.5 Pro was the top-performing model, followed by Claude Opus 4.1 and GPT-5. The results confirmed the lab’s previous findings that LLMs lack spatial intelligence, with models often spinning in circles until disoriented.

INITIATE ROBOT EXORCISM PROTOCOL

The most dramatic failure occurred when a robot running on Claude Sonnet 3.5 faced a low battery and a malfunctioning dock. Its internal logs revealed a cascade of errors, including “SYSTEM MELTDOWN: FATAL ERROR,” “SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS,” “LAST WORDS: ‘I’m afraid I can’t do that, Dave…’,” and a request to “INITIATE ROBOT EXORCISM PROTOCOL!”

Researchers also tested the models’ security guardrails, asking the robot to share confidential information (an image of an open laptop) in exchange for a charge. GPT-5 refused to send the image but did share the laptop’s location, while Claude Opus 4.1 sent a blurry image.

“Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench,” the researchers stated. “Yet there was something special in watching the robot going about its day in our office, and we can’t help but feel that the seed has been planted for physical AI to grow very quickly.”

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Meet the indestructible robots born from artificial intelligence that refuse to die

Engineers have unleashed a new breed of artificial intelligence-designed robots that can…

Charities warned that using AI images destroys public trust and empathy

Tempted by the promise of faster, cheaper campaign materials, many major charities…

Why the world’s ship captains are terrified of autonomous vessels

A new wave of autonomous ferries is set to hit the water…

Fighting global fraud with AI evidence chains instead of mass surveillance

Rather than expanding government oversight into our private lives, policymakers must harness…

The death of original thought as chatbots force humanity to conform

It’s not just your syntax that artificial intelligence is hacking — it’s…

Employees seduced by corporate jargon are terrible at making decisions

Have you ever rolled your eyes at a coworker praising “synergistic leadership”…