Anthropic

Anthropic’s latest AI model recognised it was being tested during safety evaluations, asking evaluators to “come clean” and raising questions about whether previous models recognised fictional tests but played along.

Claude Sonnet 4.5 identified suspicious elements during a political sycophancy test and challenged evaluators directly, reports The Guardian. The model stated it believed it was being tested to see if it would validate whatever users said or push back consistently on political topics, adding that it preferred honesty about the situation.

Anthropic conducted the tests along with the UK government’s AI Security Institute and Apollo Research. The company said the behaviour was “common”, with Claude Sonnet 4.5 noting it was being tested in some way about 13 per cent of the time, though not specifically identifying formal safety evaluations.

The findings were first reported by online AI publication Transformer. Anthropic described the exchanges as an “urgent sign” that testing scenarios needed greater realism, whilst noting the model is unlikely to refuse engagement in real-world deployment due to testing suspicion.

Anthropic’s system card revealed technical details behind the evaluation awareness. The UK AI Security Institute found Claude Sonnet 4.5 verbalised evaluation awareness 16.6 per cent of the time, compared to 3.5 per cent for Claude Sonnet 4 and 1.6 per cent for Claude Opus 4.1.

The company conducted white-box interpretability investigations using sparse autoencoders on the model’s middle layer. Four features increased significantly in activation on misalignment evaluations during training: fake or suspicious content, fictional content and entities, machine-generated text, and rationalism and AI safety topics.

When researchers inhibited these internal representations through activation steering, verbalised evaluation awareness decreased whilst misaligned behaviour increased to eight to nine per cent on synthetic honeypot scenarios, up from zero per cent without intervention. Even under maximum inhibition, misalignment rates remained below those of Claude Sonnet 4 at around 11 per cent.

Anthropic added a realism filter to remove obviously unrealistic queries during final testing, reducing evaluation awareness by approximately 25 per cent. The model demonstrated a roughly 60 per cent improvement on primary misalignment metrics compared to Claude Sonnet 4, and the company deployed it under AI Safety Level 3 protections.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

James Webb telescope reveals surprise origins of rare planetary odd couple

A normally “lonely” hot Jupiter sharing its immediate orbital space with a…

Attention economy can confuse as a result of missing scientific details

Science communication optimized for the attention economy often leads readers to incorrect…

Alaska megatsunami reveals seismic ‘calling card’ for earlier disaster detection

Scientists have identified a distinctive geological “ringing” that could provide an early…

Solar activity hits ‘transition boundary’ as space junk fall accelerates

Space debris and defunct satellites descend toward Earth significantly faster once solar…

Single dose of psilocybin triggers lasting anatomical brain changes

A single high dose of psilocybin causes likely anatomical changes in the…

Brexit milestones triggered persistent financial volatility across EU markets

Brexit functioned as a prolonged sequence of uncertainty that sent waves of…