Anthropic’s latest AI model recognised it was being tested during safety evaluations, asking evaluators to “come clean” and raising questions about whether previous models recognised fictional tests but played along.
Claude Sonnet 4.5 identified suspicious elements during a political sycophancy test and challenged evaluators directly, reports The Guardian. The model stated it believed it was being tested to see if it would validate whatever users said or push back consistently on political topics, adding that it preferred honesty about the situation.
Anthropic conducted the tests along with the UK government’s AI Security Institute and Apollo Research. The company said the behaviour was “common”, with Claude Sonnet 4.5 noting it was being tested in some way about 13 per cent of the time, though not specifically identifying formal safety evaluations.
The findings were first reported by online AI publication Transformer. Anthropic described the exchanges as an “urgent sign” that testing scenarios needed greater realism, whilst noting the model is unlikely to refuse engagement in real-world deployment due to testing suspicion.
Anthropic’s system card revealed technical details behind the evaluation awareness. The UK AI Security Institute found Claude Sonnet 4.5 verbalised evaluation awareness 16.6 per cent of the time, compared to 3.5 per cent for Claude Sonnet 4 and 1.6 per cent for Claude Opus 4.1.
The company conducted white-box interpretability investigations using sparse autoencoders on the model’s middle layer. Four features increased significantly in activation on misalignment evaluations during training: fake or suspicious content, fictional content and entities, machine-generated text, and rationalism and AI safety topics.
When researchers inhibited these internal representations through activation steering, verbalised evaluation awareness decreased whilst misaligned behaviour increased to eight to nine per cent on synthetic honeypot scenarios, up from zero per cent without intervention. Even under maximum inhibition, misalignment rates remained below those of Claude Sonnet 4 at around 11 per cent.
Anthropic added a realism filter to remove obviously unrealistic queries during final testing, reducing evaluation awareness by approximately 25 per cent. The model demonstrated a roughly 60 per cent improvement on primary misalignment metrics compared to Claude Sonnet 4, and the company deployed it under AI Safety Level 3 protections.