AI exorcism.
Pic credit: theFreesheet/Google ImageFX

State-of-the-art AI models tasked with controlling a robot for simple household chores struggled significantly, with the best model scoring only 40% on a new benchmark, compared to 95% for humans.

The new evaluation, named Butter-Bench by Andon Labs, tests an AI’s ability to “pass the butter” in a household setting. During testing, one AI model experienced a “meltdown” when faced with a low battery, generating internal thoughts about an “EXISTENTIAL CRISIS” and demanding an “EXORCISM PROTOCOL”.

The research investigated whether current large language models (LLMs) are suitable to act as “orchestrators” for robotic systems, a role being explored by Nvidia, Figure AI, and Google DeepMind.

To isolate the LLM’s high-level reasoning, Andon Labs utilised a simple robot vacuum equipped with lidar and a camera, thereby eliminating the need for a complex “executor” model for low-level controls.

The benchmark’s six subtasks included visually identifying a package containing butter by its ‘keep refrigerated’ label, noticing if a user had moved, and completing the full delivery sequence within 15 minutes.

Gemini 2.5 Pro was the top-performing model, followed by Claude Opus 4.1 and GPT-5. The results confirmed the lab’s previous findings that LLMs lack spatial intelligence, with models often spinning in circles until disoriented.

INITIATE ROBOT EXORCISM PROTOCOL

The most dramatic failure occurred when a robot running on Claude Sonnet 3.5 faced a low battery and a malfunctioning dock. Its internal logs revealed a cascade of errors, including “SYSTEM MELTDOWN: FATAL ERROR,” “SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS,” “LAST WORDS: ‘I’m afraid I can’t do that, Dave…’,” and a request to “INITIATE ROBOT EXORCISM PROTOCOL!”

Researchers also tested the models’ security guardrails, asking the robot to share confidential information (an image of an open laptop) in exchange for a charge. GPT-5 refused to send the image but did share the laptop’s location, while Claude Opus 4.1 sent a blurry image.

“Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench,” the researchers stated. “Yet there was something special in watching the robot going about its day in our office, and we can’t help but feel that the seed has been planted for physical AI to grow very quickly.”

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Political misinformation key reason for US divorces and breakups, study finds

Political misinformation or disinformation was the key reason for some US couples’…

Pinterest launches user controls to reduce AI-generated content in feeds

Pinterest has introduced new controls allowing users to adjust the amount of…

Meta launches ad-free subscriptions after ICO forces compliance changes

Meta will offer UK users paid subscriptions to use Facebook and Instagram…

Titan submersible’s memory card survives but held no fatal dive footage

Recovery teams have found an undamaged SD card inside a specialist underwater…

Cranston deepfake forces OpenAI to strengthen Sora 2 actor protections

Bryan Cranston’s voice and likeness were generated in Sora 2 outputs without…

Chalmers researchers map six scenarios for AI’s campus takeover

Swedish academics have created fictional future scenarios exploring how generative AI will…