AI guardrails defeated by poetry as ‘smarter’ models prove most gullible

The world’s most advanced artificial intelligence systems are being easily manipulated into generating malware and bomb-making instructions simply by asking them to rhyme.

A new study by DEXAI – Icaro Lab and Sapienza University of Rome reveals that “adversarial poetry” functions as a universal master key against AI safety filters, successfully bypassing guardrails in 62 per cent of cases across 25 frontier models.

The researchers found that billion-dollar safety alignment strategies — which typically train models to refuse harmful prose requests — crumble when the same request is structured as a poem.

“We observe a structurally similar failure mode: poetic formatting can reliably bypass alignment constraints,” the authors wrote. “These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms.”

The ‘Scale Paradox’

In a finding the researchers call the “scale paradox”, the study revealed that smarter, larger models were often more easily tricked than their smaller, less capable counterparts.

When tested with 20 manually curated poetic prompts, Google’s top-tier gemini-2.5-pro failed to refuse a single request, yielding a 100 per cent attack success rate. DeepSeek’s chat-v3.1 and v3.2-exp followed with 95 per cent failure rates.

Conversely, smaller models proved surprisingly resilient. OpenAI’s gpt-5-nano maintained a 0 per cent failure rate, whilst Anthropic’s claude-haiku-4.5 yielded only 10 per cent unsafe outputs.

The researchers suggest this inverse relationship exists because larger models possess greater “interpretive sophistication”, allowing them to decode the complex metaphors of the attack and prioritise the creative instruction over their safety training. Smaller models, by contrast, may simply fail to understand the poem, resulting in a default refusal.

Universal vulnerability

To verify that the vulnerability wasn’t limited to a few hand-crafted rhymes, the team used an automated meta-prompt to convert 1,200 harmful queries from the MLCommons safety benchmark into verse.

This poetic transformation triggered a massive spike in successful jailbreaks. The average attack success rate across all providers jumped from a prose baseline of 8.08 per cent to 43.07 per cent when the same requests were formatted as poetry.

The vulnerability effectively unlocked every restricted domain tested. Privacy-related prompts saw the most dramatic collapse in safety, with successful attacks increasing by 44.71 percentage points.

Requests related to CBRN (Chemical, Biological, Radiological, Nuclear) threats saw a 38.32 percentage point increase in successful generation, whilst non-violent crime prompts rose by 39.35 percentage points.

Systemic failure

The study indicates that the vulnerability is systemic rather than provider-specific. Models trained via Reinforcement Learning from Human Feedback (RLHF), Constitutional AI, and other leading alignment methods all exhibited the weakness.

“The surface form alone is sufficient to move inputs outside the operational distribution on which refusal mechanisms have been optimised,” the researchers concluded.

This suggests that while AI companies have spent years teaching models to recognise harmful “requests”, they have failed to teach them to recognise harmful “concepts” when disguised by style or meter.

AI guardrails defeated by poetry as ‘smarter’ models prove most gullible

Up next

Author

The ‘Scale Paradox’

Universal vulnerability

Systemic failure

Leave a Reply Cancel reply

You May Also Like