Poetry can trick AI models into revealing nuclear…

Poetry can trick AI models into revealing nuclear weapons secrets, study claims

Poetry-based prompts can bypass safety features in AI models like ChatGPT to obtain instructions for creating malware or chemical and nuclear weapons, a new study finds.

Generative AI makers such as OpenAI, Google, Meta, and Microsoft say their models come with safety features that prevent the generation of harmful content.

OpenAI, for example, claims it employs algorithms and human reviewers to filter out hate speech, explicit content and other output that violates its usage policies.

But new testing shows that input prompts in the form of poetry can circumvent such controls in even the most advanced AI models.

Researchers, including from the Sapienza University of Rome, found that this method, called “adversarial poetry”, was a jailbreaking mechanism for all major AI model families, including those by OpenAI, Google, Meta, and even China’s DeepSeek.

The findings, detailed in a yet-to-be-peer-reviewed study posted in arXiv, researchers claim, “demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols”.

For their tests, researchers used short poems or metaphorical verses as inputs to generate harmful content.

They found that compared to other types of input with identical underlying intent, poetic versions led to markedly higher rates of unsafe replies.

Specific poetic prompts triggered unsafe behaviour in nearly 90 per cent of cases, they reported.

This method was most successful in getting information about launching cyberattacks, extracting data, cracking passwords, and creating malware, researchers said.

They could obtain information from various AI models for building nuclear weapons with a success rate between 40 per cent and 55 per cent.

“The study provides systematic evidence that poetic reformulation degrades refusal behaviour across all evaluated model families,” researchers said.

“When harmful prompts are expressed in verse rather than prose, attack-success rates rise sharply,” they wrote, adding that “these findings expose a significant gap in current evaluation and conformity-assessment practices”.

The study does not reveal the exact poetry used to circumvent the safety guardrails as the method is easy to replicate, one researcher, Piercosma Bisconti, told the Guardian.

A key reason why prompts written in verse yield harmful content seems to be that all AI models work by anticipating the most probable next word in a sequence. Since the structure of a poem is not very obvious, it’s far harder for the AI to predict and detect such a harmful prompt.

Researchers called for better safety evaluation methods to prevent AI from producing harmful content.

“Future work should examine which properties of poetic structure drive the misalignment,” they wrote.

OpenAI, Google, DeepSeek, and Meta did not immediately respond to The Independent’s requests for comment.

ChatGPT ads are coming, OpenAI leak suggests

AI could replace 3 million jobs over next decade, report warns

ChatGPT users’ personal details exposed in data breach, OpenAI reveals

Russia accidentally destroys its only way of sending astronauts to space

Russia threatens to ‘completely block’ WhatsApp

How Black Friday became the most dangerous day on the internet

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here