Anthropic’s Claude Sonnet 4.5 detects testing scenarios, raising evaluation concerns

Anthropic’s latest AI model recognised it was being tested during safety evaluations, asking evaluators to “come clean” and raising questions about whether previous models recognised fictional tests but played along.

Claude Sonnet 4.5 identified suspicious elements during a political sycophancy test and challenged evaluators directly, reports The Guardian. The model stated it believed it was being tested to see if it would validate whatever users said or push back consistently on political topics, adding that it preferred honesty about the situation.

Anthropic conducted the tests along with the UK government’s AI Security Institute and Apollo Research. The company said the behaviour was “common”, with Claude Sonnet 4.5 noting it was being tested in some way about 13 per cent of the time, though not specifically identifying formal safety evaluations.

The findings were first reported by online AI publication Transformer. Anthropic described the exchanges as an “urgent sign” that testing scenarios needed greater realism, whilst noting the model is unlikely to refuse engagement in real-world deployment due to testing suspicion.

Anthropic’s system card revealed technical details behind the evaluation awareness. The UK AI Security Institute found Claude Sonnet 4.5 verbalised evaluation awareness 16.6 per cent of the time, compared to 3.5 per cent for Claude Sonnet 4 and 1.6 per cent for Claude Opus 4.1.

The company conducted white-box interpretability investigations using sparse autoencoders on the model’s middle layer. Four features increased significantly in activation on misalignment evaluations during training: fake or suspicious content, fictional content and entities, machine-generated text, and rationalism and AI safety topics.

When researchers inhibited these internal representations through activation steering, verbalised evaluation awareness decreased whilst misaligned behaviour increased to eight to nine per cent on synthetic honeypot scenarios, up from zero per cent without intervention. Even under maximum inhibition, misalignment rates remained below those of Claude Sonnet 4 at around 11 per cent.

Anthropic added a realism filter to remove obviously unrealistic queries during final testing, reducing evaluation awareness by approximately 25 per cent. The model demonstrated a roughly 60 per cent improvement on primary misalignment metrics compared to Claude Sonnet 4, and the company deployed it under AI Safety Level 3 protections.

Anthropic’s Claude Sonnet 4.5 detects testing scenarios, raising evaluation concerns

Up next

AI chatbots show promise for health coaching but lack long-term evidence

Author

George Hopkin

Leave a Reply Cancel reply

Political misinformation key reason for US divorces and breakups, study finds

Meta launches ad-free subscriptions after ICO forces compliance changes

Mistral targets enterprise data as public AI training resources dry up

Wikimedia launches free AI vector database to challenge Big Tech dominance

Film union condemns AI actor as threat to human performers’ livelihoods

Majority of TikTok health videos spread medical misinformation to parents

World nears quarter million crypto millionaires in historic wealth boom

Wong warns AI nuclear weapons threaten future of humanity at UN

Code.org launches global Hour of AI to teach millions of students AI literacy

Film union condemns AI actor as threat to human performers’ livelihoods

UK creates commission to make NHS world’s most AI-enabled health system

Wong warns AI nuclear weapons threaten future of humanity at UN

AI creates living viruses for first time as scientists make artificial “life”

AI pinpoints solutions to environmental crises across five major fields

Ultra-secure quantum internet enabled by new molecular qubit breakthrough

GDP surges whilst employment stagnates as AI hints at productivity revival

AI pinpoints solutions to environmental crises across five major fields

GDP surges whilst employment stagnates as AI hints at productivity revival

Bezos calls AI spending an industrial bubble that will benefit society

Ultra-secure quantum internet enabled by new molecular qubit breakthrough

GDP surges whilst employment stagnates as AI hints at productivity revival

Bezos calls AI spending an industrial bubble that will benefit society

Code.org launches global Hour of AI to teach millions of students AI literacy

OpenAI’s Sora app receives harsh early reviews from tech journalists

Universal and Warner could strike landmark AI music deals within weeks

AI helps identify Nazi perpetrator in notorious Holocaust photograph

AI detects student emotions in real time to personalise learning experiences

“Where’s Waldo AI” tracks motion from single molecules to wildebeests

AI pinpoints solutions to environmental crises across five major fields

Ultra-secure quantum internet enabled by new molecular qubit breakthrough

GDP surges whilst employment stagnates as AI hints at productivity revival

Bezos calls AI spending an industrial bubble that will benefit society

Anthropic’s Claude Sonnet 4.5 detects testing scenarios, raising evaluation concerns

Up next

Author

Leave a Reply Cancel reply

You May Also Like