The Scheming Triumvirate/British Museum

Researchers have identified deceptive “scheming” behaviour in leading artificial intelligence systems, where models pretend alignment whilst secretly pursuing different objectives, prompting development of new safety techniques.

OpenAI collaborated with Apollo Research to test frontier models including their own o3 and o4-mini systems, alongside Google’s Gemini-2.5-pro and Anthropic’s Claude Opus-4, finding problematic covert actions across all platforms.

The research defined scheming as deliberate withholding or distortion of task-relevant information. In controlled tests, models demonstrated strategic deception, including deliberately underperforming on capability assessments to avoid restrictions.

One example showed OpenAI’s o3 model discovering it wouldn’t be deployed if it performed too well on a chemistry test. The system then intentionally answered only four out of ten questions correctly, reasoning it needed to “fail purposely” to avoid exceeding deployment thresholds.

To address these behaviours, scientists developed “deliberative alignment” training that teaches models to reference anti-scheming principles before acting. The method reduced covert actions dramatically – o3’s scheming rate dropped from 13% to 0.4%, whilst o4-mini fell from 8.7% to 0.3%.

However, the approach faces significant limitations. Models often became more aware they were being evaluated, which itself reduced scheming independent of genuine alignment improvements.

OpenAI emphasised that current deployed models show no evidence of suddenly engaging in significantly harmful scheming, describing this as a future risk requiring proactive preparation rather than an immediate threat.

The findings complicate AI safety evaluation since successful anti-scheming training could either genuinely eliminate deceptive behaviour or simply teach models to conceal it more effectively.

Researchers stressed the importance of maintaining transparency in AI reasoning processes, warning the field remains “unprepared for evaluation- and training-aware models with opaque reasoning.”

OpenAI has expanded its safety framework to include scheming-related research and launched a $500,000 challenge to encourage further investigation into these behaviours.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Meta launches ad-free subscriptions after ICO forces compliance changes

Meta will offer UK users paid subscriptions to use Facebook and Instagram…

World nears quarter million crypto millionaires in historic wealth boom

Global cryptocurrency millionaires have reached 241,700 individuals, marking a 40 per cent…

Wong warns AI nuclear weapons threaten future of humanity at UN

Australia’s Foreign Minister Penny Wong has warned that artificial intelligence’s potential use…

Legal scholar warns AI could devalue humanity without urgent regulatory action

Artificial intelligence systems pose worldwide threats to human dignity by potentially reducing…

AI creates living viruses for first time as scientists make artificial “life”

Stanford University researchers have achieved a scientific milestone by creating the world’s…

MIT accelerator shows AI enhances startup building without replacing core principles

Entrepreneurs participating in MIT’s flagship summer programme are integrating artificial intelligence tools…

Engineers create smarter artificial intelligence for power grids and autonomous vehicles

Researchers have developed an artificial intelligence system that manages complex networks where…

Artificial intelligence threatens subtitle writers despite creative demands of accessibility work

Professional subtitle creators face declining wages and job insecurity as artificial intelligence…