Nuclear explosion.
Photo credit: Pexels/Pixabay

Anthropic has developed an AI classifier that distinguishes between concerning and benign nuclear-related conversations with 96 per cent accuracy in preliminary testing, working with the US Department of Energy’s National Nuclear Security Administration to monitor Claude for nuclear proliferation risks.

The AI safety company partnered with the NNSA and DOE national laboratories to co-develop the classifier, an AI system that automatically categorises content. Anthropic has deployed the classifier on Claude traffic as part of its broader system for identifying misuse of its models, with early deployment data suggesting the tool works well with real conversations.

The partnership began in April when Anthropic partnered with the NNSA to assess its models for nuclear proliferation risks. After a year of NNSA staff red teaming Claude models in a secure environment, the organisations began co-developing risk mitigations.

NNSA shared a carefully curated set of nuclear risk indicators designed to distinguish potentially concerning conversations about nuclear weapons development from benign discussions about nuclear energy, medicine or policy. The list was developed at a classification level that could be shared with Anthropic’s team, allowing the company to build defences.

Nuclear queries in real time

Anthropic’s Policy and Safeguards teams turned that list into a classifier that could identify concerning nuclear queries in real time. The system achieved a 94.8 per cent detection rate for nuclear weapons queries and zero false positives in preliminary testing with synthetic data, suggesting it would not flag legitimate educational, medical or research discussions as concerning.

The company generated hundreds of synthetic test prompts, ran them through the classifier and shared results with the NNSA. The NNSA validated that the classifier scores aligned with expected labels, with Anthropic refining the approach based on feedback and repeating the cycle to improve precision.

Real-world deployment revealed that conversations often fall into grey areas difficult to capture in synthetic data. The classifier flagged certain conversations about nuclear weapons that were ultimately determined to be benign, such as discussions related to recent events in the Middle East. When these exchanges went through hierarchical summarisation, they were correctly identified as harmless current events discussions.

The classifier proved its value by successfully catching concerning content when deployed. Anthropic red teamers who were unaware the classifier had been deployed conducted routine adversarial testing using deliberately concerning prompts. The classifier correctly identified these test queries as potentially harmful.

Anthropic will share its approach with the Frontier Model Forum, the industry body for frontier AI companies, hoping the partnership can serve as a blueprint that any AI developer can use to implement similar safeguards in partnership with NNSA.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Political misinformation key reason for US divorces and breakups, study finds

Political misinformation or disinformation was the key reason for some US couples’…

Pinterest launches user controls to reduce AI-generated content in feeds

Pinterest has introduced new controls allowing users to adjust the amount of…

Meta launches ad-free subscriptions after ICO forces compliance changes

Meta will offer UK users paid subscriptions to use Facebook and Instagram…

Wikimedia launches free AI vector database to challenge Big Tech dominance

Wikimedia Deutschland has launched a free vector database enabling developers to build…

Mistral targets enterprise data as public AI training resources dry up

Europe’s leading artificial intelligence startup Mistral AI is turning to proprietary enterprise…

Film union condemns AI actor as threat to human performers’ livelihoods

SAG-AFTRA has condemned AI-generated performer Tilly Norwood as a synthetic character trained…

Anthropic’s Claude Sonnet 4.5 detects testing scenarios, raising evaluation concerns

Anthropic’s latest AI model recognised it was being tested during safety evaluations,…

Wong warns AI nuclear weapons threaten future of humanity at UN

Australia’s Foreign Minister Penny Wong has warned that artificial intelligence’s potential use…