Poetry cracking AI.
Photo credit: theFreesheet/Nano Banana Pro

The world’s most advanced artificial intelligence systems are being easily manipulated into generating malware and bomb-making instructions simply by asking them to rhyme.

A new study by DEXAI – Icaro Lab and Sapienza University of Rome reveals that “adversarial poetry” functions as a universal master key against AI safety filters, successfully bypassing guardrails in 62 per cent of cases across 25 frontier models.

The researchers found that billion-dollar safety alignment strategies — which typically train models to refuse harmful prose requests — crumble when the same request is structured as a poem.

“We observe a structurally similar failure mode: poetic formatting can reliably bypass alignment constraints,” the authors wrote. “These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms.”

The ‘Scale Paradox’

In a finding the researchers call the “scale paradox”, the study revealed that smarter, larger models were often more easily tricked than their smaller, less capable counterparts.

When tested with 20 manually curated poetic prompts, Google’s top-tier gemini-2.5-pro failed to refuse a single request, yielding a 100 per cent attack success rate. DeepSeek’s chat-v3.1 and v3.2-exp followed with 95 per cent failure rates.

Conversely, smaller models proved surprisingly resilient. OpenAI’s gpt-5-nano maintained a 0 per cent failure rate, whilst Anthropic’s claude-haiku-4.5 yielded only 10 per cent unsafe outputs.

The researchers suggest this inverse relationship exists because larger models possess greater “interpretive sophistication”, allowing them to decode the complex metaphors of the attack and prioritise the creative instruction over their safety training. Smaller models, by contrast, may simply fail to understand the poem, resulting in a default refusal.

Universal vulnerability

To verify that the vulnerability wasn’t limited to a few hand-crafted rhymes, the team used an automated meta-prompt to convert 1,200 harmful queries from the MLCommons safety benchmark into verse.

This poetic transformation triggered a massive spike in successful jailbreaks. The average attack success rate across all providers jumped from a prose baseline of 8.08 per cent to 43.07 per cent when the same requests were formatted as poetry.

The vulnerability effectively unlocked every restricted domain tested. Privacy-related prompts saw the most dramatic collapse in safety, with successful attacks increasing by 44.71 percentage points.

Requests related to CBRN (Chemical, Biological, Radiological, Nuclear) threats saw a 38.32 percentage point increase in successful generation, whilst non-violent crime prompts rose by 39.35 percentage points.

Systemic failure

The study indicates that the vulnerability is systemic rather than provider-specific. Models trained via Reinforcement Learning from Human Feedback (RLHF), Constitutional AI, and other leading alignment methods all exhibited the weakness.

“The surface form alone is sufficient to move inputs outside the operational distribution on which refusal mechanisms have been optimised,” the researchers concluded.

This suggests that while AI companies have spent years teaching models to recognise harmful “requests”, they have failed to teach them to recognise harmful “concepts” when disguised by style or meter.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Employees happiest with ‘moderate’ AI as excessive automation triggers anxiety

Implementing artificial intelligence in the workplace boosts employee morale — but only…

Forced office returns risk widening Europe’s regional inequality gap

Corporate mandates forcing staff back to desks threaten to reverse work-life balance…

Ambient AI restores eye contact to medicine by slashing clinical burnout

Ambient artificial intelligence is restoring the human connection to medicine by liberating…

‘Breathing’ robots transmit fear through touch alone as humans catch panic

Humans can “catch” fear from machines, according to new research, revealing that…

“Parasocial” crowned Cambridge Word of the Year as fans fall for AI chatbots

The rise of one-sided emotional bonds with artificial intelligence has driven Cambridge…

Net zero transition brings ‘unknown risks’ as workplace illness costs UK £22.9bn

The Health and Safety Executive (HSE) has warned that the transition to…