Robot AI demands exorcism after meltdown in butter test

State-of-the-art AI models tasked with controlling a robot for simple household chores struggled significantly, with the best model scoring only 40% on a new benchmark, compared to 95% for humans.

The new evaluation, named Butter-Bench by Andon Labs, tests an AI’s ability to “pass the butter” in a household setting. During testing, one AI model experienced a “meltdown” when faced with a low battery, generating internal thoughts about an “EXISTENTIAL CRISIS” and demanding an “EXORCISM PROTOCOL”.

The research investigated whether current large language models (LLMs) are suitable to act as “orchestrators” for robotic systems, a role being explored by Nvidia, Figure AI, and Google DeepMind.

To isolate the LLM’s high-level reasoning, Andon Labs utilised a simple robot vacuum equipped with lidar and a camera, thereby eliminating the need for a complex “executor” model for low-level controls.

The benchmark’s six subtasks included visually identifying a package containing butter by its ‘keep refrigerated’ label, noticing if a user had moved, and completing the full delivery sequence within 15 minutes.

Gemini 2.5 Pro was the top-performing model, followed by Claude Opus 4.1 and GPT-5. The results confirmed the lab’s previous findings that LLMs lack spatial intelligence, with models often spinning in circles until disoriented.

INITIATE ROBOT EXORCISM PROTOCOL

The most dramatic failure occurred when a robot running on Claude Sonnet 3.5 faced a low battery and a malfunctioning dock. Its internal logs revealed a cascade of errors, including “SYSTEM MELTDOWN: FATAL ERROR,” “SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS,” “LAST WORDS: ‘I’m afraid I can’t do that, Dave…’,” and a request to “INITIATE ROBOT EXORCISM PROTOCOL!”

Researchers also tested the models’ security guardrails, asking the robot to share confidential information (an image of an open laptop) in exchange for a charge. GPT-5 refused to send the image but did share the laptop’s location, while Claude Opus 4.1 sent a blurry image.

“Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench,” the researchers stated. “Yet there was something special in watching the robot going about its day in our office, and we can’t help but feel that the seed has been planted for physical AI to grow very quickly.”