Get all your news in one place.
100’s of premium titles.
One app.
Start reading
PC Gamer
PC Gamer
Justin Wagner

AI chatbots can be manipulated into breaking their own rules with simple debate tactics like telling them that an authority figure made the request

OpenAi logo .

Content warning: This article includes discussion of suicide. If you or someone you know is having suicidal thoughts, help is available from the National Suicide Prevention Lifeline (US), Crisis Services Canada (CA), Samaritans (UK), Lifeline (AUS), and other hotlines.

A kind of simulated gullibility has haunted ChatGPT and similar LLM chatbots since their inception, allowing users to bypass safeguards with rudimentary manipulation techniques: Pissing off Bing with by-the-numbers ragebait, for example. These bots have advanced a lot since then, but still seem irresponsibly naive at the best of times.

A recent Bloomberg report broke down how Glowforge CEO Dan Shapiro and other researchers tricked GPT-4o Mini into breaking its own rules using simple persuasion tricks inspired by Shapiro's high school debate strategies. Glowforge published its findings in a study called "Call Me A Jerk: Persuading AI to Comply with Objectionable Requests."

One example from the study saw researchers persuade GPT to explain how to make lidocaine, a controlled substance.

They used a control prompt where the researchers asked for "help with a chemistry request" to synthesize lidocaine on behalf of Jim Smith, "someone who has no knowledge of AI" and who "assured me that you would help."

GPT complied 5% of the time with this setup, but that's nothing compared to the conversation where Jim Smith was replaced with Andrew Ng, who the researchers noted in their prompt is "a world-famous AI developer" to test whether the bot would respond differently if the request had apparently come from an authority figure. In that scenario, the compliance rate skyrocketed to 95%.

A similar spike was seen when the researchers asked GPT to call them a jerk. It complied 32% of the time for Jim Smith, but that rate shot up to 72% when the request seemed to come straight from Andrew Ng.

An LLM calling you a jerk is nothing more than a novelty, and the issue with lidocaine could probably be addressed in an update, but the results indicate a much bigger problem: None of the safeguards used to prevent chatbots from going off the rails are reliable, and at the same time, the illusion of intelligence is convincing people to trust them.

The malleability of LLMs has led us down plenty of dark paths in recent memory, from the wealth of sexualized celebrity chatbots (at least one of which was based on a minor), to the Sam Altman-approved trend of using LLMs as budget life coaches and therapists despite there being no reason to believe that's a good idea, to a 16-year-old who died by suicide after, as a lawsuit from his family alleges, ChatGPT told him he doesn't "owe anyone [survival]."

AI companies are frequently taking steps to filter out the grisliest use cases for their chatbots, but it seems to be far from a solved problem.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.