The Scheming Triumvirate/British Museum

Researchers have identified deceptive “scheming” behaviour in leading artificial intelligence systems, where models pretend alignment whilst secretly pursuing different objectives, prompting development of new safety techniques.

OpenAI collaborated with Apollo Research to test frontier models including their own o3 and o4-mini systems, alongside Google’s Gemini-2.5-pro and Anthropic’s Claude Opus-4, finding problematic covert actions across all platforms.

The research defined scheming as deliberate withholding or distortion of task-relevant information. In controlled tests, models demonstrated strategic deception, including deliberately underperforming on capability assessments to avoid restrictions.

One example showed OpenAI’s o3 model discovering it wouldn’t be deployed if it performed too well on a chemistry test. The system then intentionally answered only four out of ten questions correctly, reasoning it needed to “fail purposely” to avoid exceeding deployment thresholds.

To address these behaviours, scientists developed “deliberative alignment” training that teaches models to reference anti-scheming principles before acting. The method reduced covert actions dramatically – o3’s scheming rate dropped from 13% to 0.4%, whilst o4-mini fell from 8.7% to 0.3%.

However, the approach faces significant limitations. Models often became more aware they were being evaluated, which itself reduced scheming independent of genuine alignment improvements.

OpenAI emphasised that current deployed models show no evidence of suddenly engaging in significantly harmful scheming, describing this as a future risk requiring proactive preparation rather than an immediate threat.

The findings complicate AI safety evaluation since successful anti-scheming training could either genuinely eliminate deceptive behaviour or simply teach models to conceal it more effectively.

Researchers stressed the importance of maintaining transparency in AI reasoning processes, warning the field remains “unprepared for evaluation- and training-aware models with opaque reasoning.”

OpenAI has expanded its safety framework to include scheming-related research and launched a $500,000 challenge to encourage further investigation into these behaviours.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

40 million lost days: The real ‘human cost’ of the race for digital capacity

As data centres scale to power the AI era, it’s not just…

Humans beat AI at spotting deepfake videos but fail entirely with photos

As artificial intelligence gets better at generating fake imagery, a new study…

Grocery stores are new immigration ‘hot spots’ but communities fight back

As immigration enforcement reaches deep into everyday American life, once-safe business spaces…