Anthropic has developed an AI classifier that distinguishes between concerning and benign nuclear-related conversations with 96 per cent accuracy in preliminary testing, working with the US Department of Energy’s National Nuclear Security Administration to monitor Claude for nuclear proliferation risks.
The AI safety company partnered with the NNSA and DOE national laboratories to co-develop the classifier, an AI system that automatically categorises content. Anthropic has deployed the classifier on Claude traffic as part of its broader system for identifying misuse of its models, with early deployment data suggesting the tool works well with real conversations.
The partnership began in April when Anthropic partnered with the NNSA to assess its models for nuclear proliferation risks. After a year of NNSA staff red teaming Claude models in a secure environment, the organisations began co-developing risk mitigations.
NNSA shared a carefully curated set of nuclear risk indicators designed to distinguish potentially concerning conversations about nuclear weapons development from benign discussions about nuclear energy, medicine or policy. The list was developed at a classification level that could be shared with Anthropic’s team, allowing the company to build defences.
Nuclear queries in real time
Anthropic’s Policy and Safeguards teams turned that list into a classifier that could identify concerning nuclear queries in real time. The system achieved a 94.8 per cent detection rate for nuclear weapons queries and zero false positives in preliminary testing with synthetic data, suggesting it would not flag legitimate educational, medical or research discussions as concerning.
The company generated hundreds of synthetic test prompts, ran them through the classifier and shared results with the NNSA. The NNSA validated that the classifier scores aligned with expected labels, with Anthropic refining the approach based on feedback and repeating the cycle to improve precision.
Real-world deployment revealed that conversations often fall into grey areas difficult to capture in synthetic data. The classifier flagged certain conversations about nuclear weapons that were ultimately determined to be benign, such as discussions related to recent events in the Middle East. When these exchanges went through hierarchical summarisation, they were correctly identified as harmless current events discussions.
The classifier proved its value by successfully catching concerning content when deployed. Anthropic red teamers who were unaware the classifier had been deployed conducted routine adversarial testing using deliberately concerning prompts. The classifier correctly identified these test queries as potentially harmful.
Anthropic will share its approach with the Frontier Model Forum, the industry body for frontier AI companies, hoping the partnership can serve as a blueprint that any AI developer can use to implement similar safeguards in partnership with NNSA.