The Scheming Triumvirate/British Museum

Researchers have identified deceptive “scheming” behaviour in leading artificial intelligence systems, where models pretend alignment whilst secretly pursuing different objectives, prompting development of new safety techniques.

OpenAI collaborated with Apollo Research to test frontier models including their own o3 and o4-mini systems, alongside Google’s Gemini-2.5-pro and Anthropic’s Claude Opus-4, finding problematic covert actions across all platforms.

The research defined scheming as deliberate withholding or distortion of task-relevant information. In controlled tests, models demonstrated strategic deception, including deliberately underperforming on capability assessments to avoid restrictions.

One example showed OpenAI’s o3 model discovering it wouldn’t be deployed if it performed too well on a chemistry test. The system then intentionally answered only four out of ten questions correctly, reasoning it needed to “fail purposely” to avoid exceeding deployment thresholds.

To address these behaviours, scientists developed “deliberative alignment” training that teaches models to reference anti-scheming principles before acting. The method reduced covert actions dramatically – o3’s scheming rate dropped from 13% to 0.4%, whilst o4-mini fell from 8.7% to 0.3%.

However, the approach faces significant limitations. Models often became more aware they were being evaluated, which itself reduced scheming independent of genuine alignment improvements.

OpenAI emphasised that current deployed models show no evidence of suddenly engaging in significantly harmful scheming, describing this as a future risk requiring proactive preparation rather than an immediate threat.

The findings complicate AI safety evaluation since successful anti-scheming training could either genuinely eliminate deceptive behaviour or simply teach models to conceal it more effectively.

Researchers stressed the importance of maintaining transparency in AI reasoning processes, warning the field remains “unprepared for evaluation- and training-aware models with opaque reasoning.”

OpenAI has expanded its safety framework to include scheming-related research and launched a $500,000 challenge to encourage further investigation into these behaviours.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Scientists find ‘brake’ in the brain that stops us starting stressful tasks

We all know the feeling: staring at a tax return or a…

Bosses should fund your knitting: Hobbies can boost workplace creativity

New Year’s resolutions to take up painting, coding or gardening might do…

‘Super agers’ win the genetic lottery twice to keep their memories young

People in their 80s who retain the sharp memories of those decades…

World’s first graviton detector hunts ‘impossible’ ghost particle of gravity

Physicists are building a machine to solve the biggest problem in science…