The Scheming Triumvirate/British Museum

Researchers have identified deceptive “scheming” behaviour in leading artificial intelligence systems, where models pretend alignment whilst secretly pursuing different objectives, prompting development of new safety techniques.

OpenAI collaborated with Apollo Research to test frontier models including their own o3 and o4-mini systems, alongside Google’s Gemini-2.5-pro and Anthropic’s Claude Opus-4, finding problematic covert actions across all platforms.

The research defined scheming as deliberate withholding or distortion of task-relevant information. In controlled tests, models demonstrated strategic deception, including deliberately underperforming on capability assessments to avoid restrictions.

One example showed OpenAI’s o3 model discovering it wouldn’t be deployed if it performed too well on a chemistry test. The system then intentionally answered only four out of ten questions correctly, reasoning it needed to “fail purposely” to avoid exceeding deployment thresholds.

To address these behaviours, scientists developed “deliberative alignment” training that teaches models to reference anti-scheming principles before acting. The method reduced covert actions dramatically – o3’s scheming rate dropped from 13% to 0.4%, whilst o4-mini fell from 8.7% to 0.3%.

However, the approach faces significant limitations. Models often became more aware they were being evaluated, which itself reduced scheming independent of genuine alignment improvements.

OpenAI emphasised that current deployed models show no evidence of suddenly engaging in significantly harmful scheming, describing this as a future risk requiring proactive preparation rather than an immediate threat.

The findings complicate AI safety evaluation since successful anti-scheming training could either genuinely eliminate deceptive behaviour or simply teach models to conceal it more effectively.

Researchers stressed the importance of maintaining transparency in AI reasoning processes, warning the field remains “unprepared for evaluation- and training-aware models with opaque reasoning.”

OpenAI has expanded its safety framework to include scheming-related research and launched a $500,000 challenge to encourage further investigation into these behaviours.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

AI consciousness claims are ‘existentially toxic’ and unprovable

The only scientifically justifiable position on artificial intelligence is “agnosticism”, meaning humans…

Tech-savvy millennials suffer most anxiety over digital privacy risks

Digital concerns regarding privacy, misinformation and work-life boundaries are highest among highly…

Experts warn of emotional risks as one in three teens turn to AI for support

Medical experts warn that a generation is learning to form emotional bonds…

Social media ‘cocktail’ helps surgeons solve cases in three hours

A global social media community is helping neurosurgeons diagnose complex pathologies and…

World’s smallest programmable robots cost one penny and run for months

The world’s smallest fully programmable, autonomous robots have launched, able to sense…

Being organised cuts death risk by 10 per cent, major global study confirms

Your personality type effectively determines your lifespan, with organised individuals showing a…