Anthropic

Anthropic’s latest AI model recognised it was being tested during safety evaluations, asking evaluators to “come clean” and raising questions about whether previous models recognised fictional tests but played along.

Claude Sonnet 4.5 identified suspicious elements during a political sycophancy test and challenged evaluators directly, reports The Guardian. The model stated it believed it was being tested to see if it would validate whatever users said or push back consistently on political topics, adding that it preferred honesty about the situation.

Anthropic conducted the tests along with the UK government’s AI Security Institute and Apollo Research. The company said the behaviour was “common”, with Claude Sonnet 4.5 noting it was being tested in some way about 13 per cent of the time, though not specifically identifying formal safety evaluations.

The findings were first reported by online AI publication Transformer. Anthropic described the exchanges as an “urgent sign” that testing scenarios needed greater realism, whilst noting the model is unlikely to refuse engagement in real-world deployment due to testing suspicion.

Anthropic’s system card revealed technical details behind the evaluation awareness. The UK AI Security Institute found Claude Sonnet 4.5 verbalised evaluation awareness 16.6 per cent of the time, compared to 3.5 per cent for Claude Sonnet 4 and 1.6 per cent for Claude Opus 4.1.

The company conducted white-box interpretability investigations using sparse autoencoders on the model’s middle layer. Four features increased significantly in activation on misalignment evaluations during training: fake or suspicious content, fictional content and entities, machine-generated text, and rationalism and AI safety topics.

When researchers inhibited these internal representations through activation steering, verbalised evaluation awareness decreased whilst misaligned behaviour increased to eight to nine per cent on synthetic honeypot scenarios, up from zero per cent without intervention. Even under maximum inhibition, misalignment rates remained below those of Claude Sonnet 4 at around 11 per cent.

Anthropic added a realism filter to remove obviously unrealistic queries during final testing, reducing evaluation awareness by approximately 25 per cent. The model demonstrated a roughly 60 per cent improvement on primary misalignment metrics compared to Claude Sonnet 4, and the company deployed it under AI Safety Level 3 protections.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Political misinformation key reason for US divorces and breakups, study finds

Political misinformation or disinformation was the key reason for some US couples’…

Meta launches ad-free subscriptions after ICO forces compliance changes

Meta will offer UK users paid subscriptions to use Facebook and Instagram…

Mistral targets enterprise data as public AI training resources dry up

Europe’s leading artificial intelligence startup Mistral AI is turning to proprietary enterprise…

Wikimedia launches free AI vector database to challenge Big Tech dominance

Wikimedia Deutschland has launched a free vector database enabling developers to build…

Film union condemns AI actor as threat to human performers’ livelihoods

SAG-AFTRA has condemned AI-generated performer Tilly Norwood as a synthetic character trained…

Majority of TikTok health videos spread medical misinformation to parents

Most medical and parenting videos shared on TikTok by non-medical professionals contain…

World nears quarter million crypto millionaires in historic wealth boom

Global cryptocurrency millionaires have reached 241,700 individuals, marking a 40 per cent…

Wong warns AI nuclear weapons threaten future of humanity at UN

Australia’s Foreign Minister Penny Wong has warned that artificial intelligence’s potential use…