Nuclear explosion.
Photo credit: Pexels/Pixabay

Anthropic has developed an AI classifier that distinguishes between concerning and benign nuclear-related conversations with 96 per cent accuracy in preliminary testing, working with the US Department of Energy’s National Nuclear Security Administration to monitor Claude for nuclear proliferation risks.

The AI safety company partnered with the NNSA and DOE national laboratories to co-develop the classifier, an AI system that automatically categorises content. Anthropic has deployed the classifier on Claude traffic as part of its broader system for identifying misuse of its models, with early deployment data suggesting the tool works well with real conversations.

The partnership began in April when Anthropic partnered with the NNSA to assess its models for nuclear proliferation risks. After a year of NNSA staff red teaming Claude models in a secure environment, the organisations began co-developing risk mitigations.

NNSA shared a carefully curated set of nuclear risk indicators designed to distinguish potentially concerning conversations about nuclear weapons development from benign discussions about nuclear energy, medicine or policy. The list was developed at a classification level that could be shared with Anthropic’s team, allowing the company to build defences.

Nuclear queries in real time

Anthropic’s Policy and Safeguards teams turned that list into a classifier that could identify concerning nuclear queries in real time. The system achieved a 94.8 per cent detection rate for nuclear weapons queries and zero false positives in preliminary testing with synthetic data, suggesting it would not flag legitimate educational, medical or research discussions as concerning.

The company generated hundreds of synthetic test prompts, ran them through the classifier and shared results with the NNSA. The NNSA validated that the classifier scores aligned with expected labels, with Anthropic refining the approach based on feedback and repeating the cycle to improve precision.

Real-world deployment revealed that conversations often fall into grey areas difficult to capture in synthetic data. The classifier flagged certain conversations about nuclear weapons that were ultimately determined to be benign, such as discussions related to recent events in the Middle East. When these exchanges went through hierarchical summarisation, they were correctly identified as harmless current events discussions.

The classifier proved its value by successfully catching concerning content when deployed. Anthropic red teamers who were unaware the classifier had been deployed conducted routine adversarial testing using deliberately concerning prompts. The classifier correctly identified these test queries as potentially harmful.

Anthropic will share its approach with the Frontier Model Forum, the industry body for frontier AI companies, hoping the partnership can serve as a blueprint that any AI developer can use to implement similar safeguards in partnership with NNSA.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

James Webb telescope reveals surprise origins of rare planetary odd couple

A normally “lonely” hot Jupiter sharing its immediate orbital space with a…

Attention economy can confuse as a result of missing scientific details

Science communication optimized for the attention economy often leads readers to incorrect…

Alaska megatsunami reveals seismic ‘calling card’ for earlier disaster detection

Scientists have identified a distinctive geological “ringing” that could provide an early…

Solar activity hits ‘transition boundary’ as space junk fall accelerates

Space debris and defunct satellites descend toward Earth significantly faster once solar…

Single dose of psilocybin triggers lasting anatomical brain changes

A single high dose of psilocybin causes likely anatomical changes in the…

Brexit milestones triggered persistent financial volatility across EU markets

Brexit functioned as a prolonged sequence of uncertainty that sent waves of…