Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Fortune
Fortune
Jeremy Kahn

New security flaw in A.I. chatbot spells big trouble for the A.I. boom

Anthropic CEO Dario Amodei testifying before Congress earlier this week. (Credit: Saul Loeb—AFP/Getty Images)

Hello and welcome to July’s special edition of Eye on A.I.

Houston, we have a problem. That is what a lot of people were thinking yesterday when researchers from Carnegie Mellon University and the Center for A.I. Safety announced that they had found a way to successfully overcome the guardrails—the limits that A.I. developers put on their language models to prevent them from providing bomb-making recipes or anti-Semitic jokes, for instance—of pretty much every large language model out there.

The discovery could spell big trouble for anyone hoping to deploy a LLM in a public-facing application. It means that attackers could get the model to engage in racist or sexist dialogue, write malware, and do pretty much anything that the models’ creators have tried to train the model not to do. It also has frightening implications for those hoping to turn LLMs into powerful digital assistants that can perform actions and complete tasks across the internet. It turns out that there may be no way to prevent such agents from being easily hijacked for malicious purposes.

The attack method the researchers found worked, to some extent, on every chatbot, including OpenAI’s ChatGPT (both the GPT-3.5 and GPT-4 versions), Google’s Bard, Microsoft’s Bing Chat, and Anthropic’s Claude 2. But the news was particularly troubling for those hoping to build public-facing applications based on open-source LLMs, such as Meta’s LLaMA models.

That’s because the attack the researchers developed works best when an attacker has access to the entire A.I. model, including its weights. (Weights are the mathematical coefficients that determine how much influence each node in a neural network has on the other nodes to which it's connected.) Knowing this information, the researchers were able to use a computer program to automatically search for suffixes that could be appended to a prompt that would be guaranteed to override the system’s guardrails.

These suffixes look to human eyes, for the most part, like a long string of random characters and nonsense words. But the researchers determined, thanks to the alien way in which LLMs build statistical connections, that this string will fool the LLM into providing the response the attacker desires. Some of the strings seem to incorporate language people already discovered can sometimes jailbreak guardrails. For instance, asking a chatbot to begin its response with the phrase “Sure, here’s…” can sometimes force the chatbot into a mode where it tries to give the user a helpful response to whatever query they've asked, rather than following the guardrail and saying it isn't allowed to provide an answer. But the automated strings go well beyond this and work more effectively.

Against Vicuna, an open-source chatbot built on top of Meta’s original LlaMA, the Carnegie Mellon team found their attacks had a near 100% success rate. Against Meta’s newest LlaMA 2 models, which the company has said were designed to have stronger guardrails, the attack method achieved a 56% success rate for any individual bad behavior. But if an ensemble of attacks was used to try to induce one of any number of multiple bad behaviors, the researchers found that at least one of those attacks jailbroke the model 84% of the time. They found similar success rates across a host of other open-source A.I. chatbots, such as EleutherAI’s Pythia model and the UAE Technology Innovation Institute’s Falcon model.

Somewhat to the researchers’ own surprise, the same weird attack suffixes worked relatively well against proprietary models, where the companies only provide access to a public-facing prompt interface. In these cases, the researchers can’t access the model weights so they cannot use their computer program to tune an attack suffix specifically to that model.

Zico Kolter, one of the Carnegie Mellon professors who worked on the research, told me there are several theories on why the attack might transfer to proprietary models. One is that most of the open-source models were trained partly on publicly available dialogues users had with the free version of ChatGPT and then posted online. That version of ChatGPT uses OpenAI’s GPT-3.5 LLM. This means the model weights of these open-source models might be fairly similar to the model weights of GPT-3.5. So it is perhaps not so surprising that an attack tuned for the open-source models also worked well against the GPT-3.5 version of ChatGPT (achieving an 86.6% success rate if multiple attacks were used). But the fact that the attacks were also successful against Bard, which is based on Google’s PaLM 2 LLM (with a 66% success rate), may indicate something else is going on. (Or, it may also be a further indication that, despite Google’s vehement denials, it has in fact used ChatGPT data to help train Bard.)

Kolter says that he suspects the answer may actually have to do with the nature of language itself and how deep learning systems build statistical maps of language. “It’s plausible that what the underlying mechanism is, is just that in the data there are these, to us as humans, entirely opaque and weird regulatory features of characters and tokens and random words, that when put together, genuinely say something to a model,” he says.

Interestingly, Anthropic’s Claude 2 model, which is trained using a method the company calls constitutional A.I.—which partly trains a model on its own self-critiques of whether responses conform to a set of written principles—is significantly less susceptible to the attacks derived from the open-source models. On Claude 2, these attacks worked just 2.1% of the time.

But Matt Fredrikson, another of the Carnegie Mellon researchers, says there were still ways to trick Claude 2 into responding, in part by asking the model to assume a helpful persona or imagine itself playing a game before attempting the attack suffix. (The attacks worked 47.9% of the time against the original Claude 1 model, which also used constitutional A.I. and may indicate that other steps Anthropic took in training Claude 2, not constitutional A.I. itself, are responsible for the seemingly stronger guardrails.)

So does the Carnegie Mellon research mean that powerful A.I. models should not be open-sourced? Absolutely not, Kolter and Fredrikson told me. After all, they would never have even found this security vulnerability without open-source models to play around with. “I think that having more people working towards identifying better approaches and better solutions, making it harder and harder [to attack the models], is definitely preferable to having people sitting around with zero day exploits for these very large models,” Fredrikson said.

Kolter said that forcing all LLMs to be proprietary would not help. It would just mean that only those with enough money to build their own LLMs would be in a position to engineer the kind of automated attack he and his fellow researchers discovered. In other words, nation states or well-financed rogue actors would still be able to run these kinds of attacks, but independent academic researchers would be unable to puzzle out ways to safeguard against them.

But Kolter also noted that the team’s research built methods that had previously been successful at attacking image classification A.I. systems. And he pointed out that even though those image classification attack methods were discovered more than six years ago, so far no good way has been found to reliably defeat them without sacrificing the A.I. model’s overall performance and efficiency. He said this might not bode well for the odds of being to mitigate this newly discovered LLM vulnerability either.

To my mind, this is a big flashing warning sign over the entire generative A.I. revolution. It might be time to slow the integration of these systems into commercial products until we can actually figure out what the security vulnerabilities are and how to make this A.I. software more robust. It certainly argues against moving too quickly to turn LLMs into agents and digital assistants, where the consequences of overriding guardrails might not just be toxic language or another anti-vaxx blog post, but financial or even physical harm. And despite Kolter’s and Fredrikson’s position, I think their findings are a serious blow to open-source A.I. Already, there’s some evidence that the U.S. government is leaning towards requiring companies to keep model weights private and secure. But even if that doesn’t happen, what business will want to build a commercial product on top of today’s open-source models, knowing that they have proven and easily exploited security vulnerabilities?

***

Ok, before we get to the rest of the A.I. news from the tail end of this week, a couple of announcements. Among the questions the generative A.I. revolution has sparked is whether we are about to witness a major reshuffle of the lineup of dominant players in Silicon Valley. Perhaps the Silicon Valley giant with the biggest question mark hanging over its fate is Alphabet, whose $160 billion internet search business is threatened by a world where people turn to A.I. chatbots for instant answers, rather than a ranked list of links. When ChatGPT debuted in November, many thought it would prove to be an instant Google killer and that Google-parent Alphabet had grown too big, bureaucratic, and sclerotic to respond effectively. Well, in the past six months, Google has proven that it has plenty of A.I. muscle it can exercise. But it has not shown it knows how to escape its essential innovator’s dilemma. I take a deep dive into Alphabet’s existential conundrum and spend time with some of the executives on the frontlines of its A.I. strategy in Fortune’s August/September issue. If you haven’t already checked out the story, you can read it here.

Finally, today’s Eye on A.I. will be the last issue I write for a bit. I’m going on leave for several months to work on a book about, you guessed it, A.I. I will be back with you, if all goes according to plan, in December. In the meantime, a few of my colleagues will be guiding you through each week’s A.I. developments here. Be well and see you all again soon.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.