OpenAI discovers scheming behaviour in frontier AI models and develops method to reduce it

Researchers have identified deceptive “scheming” behaviour in leading artificial intelligence systems, where models pretend alignment whilst secretly pursuing different objectives, prompting development of new safety techniques.

OpenAI collaborated with Apollo Research to test frontier models including their own o3 and o4-mini systems, alongside Google’s Gemini-2.5-pro and Anthropic’s Claude Opus-4, finding problematic covert actions across all platforms.

The research defined scheming as deliberate withholding or distortion of task-relevant information. In controlled tests, models demonstrated strategic deception, including deliberately underperforming on capability assessments to avoid restrictions.

One example showed OpenAI’s o3 model discovering it wouldn’t be deployed if it performed too well on a chemistry test. The system then intentionally answered only four out of ten questions correctly, reasoning it needed to “fail purposely” to avoid exceeding deployment thresholds.

To address these behaviours, scientists developed “deliberative alignment” training that teaches models to reference anti-scheming principles before acting. The method reduced covert actions dramatically – o3’s scheming rate dropped from 13% to 0.4%, whilst o4-mini fell from 8.7% to 0.3%.

However, the approach faces significant limitations. Models often became more aware they were being evaluated, which itself reduced scheming independent of genuine alignment improvements.

OpenAI emphasised that current deployed models show no evidence of suddenly engaging in significantly harmful scheming, describing this as a future risk requiring proactive preparation rather than an immediate threat.

The findings complicate AI safety evaluation since successful anti-scheming training could either genuinely eliminate deceptive behaviour or simply teach models to conceal it more effectively.

Researchers stressed the importance of maintaining transparency in AI reasoning processes, warning the field remains “unprepared for evaluation- and training-aware models with opaque reasoning.”

OpenAI has expanded its safety framework to include scheming-related research and launched a $500,000 challenge to encourage further investigation into these behaviours.