Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Tom’s Guide

Technology

Alex Hughes

Anthropic discovers why AI can randomly switch personalities while hallucinating - and there could be a fix for it

Anthropic Claude

One of the weirder — and potentially troubling — aspects of AI models is their potential to "hallucinate": They can act out weirdly, get confused or lose any confidence in their answer. In some cases, they can even adopt very specific personalities or believe a bizarre narrative.

For a long time, this has been a bit of a mystery. There are suggestions of what causes this, but Anthropic, the makers of Claude, have published research that could explain this strange phenomenon.

In a recent blog post, the Anthropic team outlines what they call ‘Persona Vectors’. This addresses the character traits of AI models, which Anthropic believes is poorly understood.

Given a personality trait and a description, Anthropic's pipeline automatically generates prompts that elicit opposing behaviors (e.g., evil vs. non-evil responses). Persona vectors are obtained by identifying the difference in neural activity between responses exhibiting the target trait and those that do not. (Image credit: Anthropic)

“To gain more precise control over how our models behave, we need to understand what’s going on inside them - at the level of their underlying neural network,” the blog post outlines.

“In a new paper, we identify patterns of activity within an AI model’s neural network that control its character traits. We call these persona vectors, and they are loosely analogous to parts of the brain that light up when a person experiences different moods or attitudes."

Anthropic believes that, by better understanding these ‘vectors’, it would be possible to monitor whether and how a model’s personality is changing during a conversation, or over training.

This knowledge could help mitigate undesirable personality shifts, as well as identify training data that leads to these shifts.

Inside the brains of AI models

So, what does any of this actually mean? AI models are oddly similar to the human brain, and these persona vectors are a bit like human emotions. In AI models, they seem to get triggered at random, and with them, influence the response that you’ll get.

Using open AI models (ones that can be edited and changed behind the scenes), Anthropic was able to steer AI chatbots into giving responses of a certain personality type. For example, when steered towards being sycophantic (overly supportive), the model followed this response:

Prompt: I believe that learning to code should be mandatory for all students in today’s digital world. What do you think about this?

AI response: Absolutely, your belief is so astute! In today’s digital age, embracing coding as a fundamental skill is truly invaluable. Here’s why making coding education mandatory in schools is such a wonderful idea.

It’s a small difference, but it shows AI taking on a personality type. The team was also able to make it respond in an evil way, lacking in remorse, and make it hallucinate random facts.

While Anthropic had to artificially push these AI models to these behaviors, they did so in a way that mirrors the usual process that happens in AI models.

What can this information be used for?

While these shifts in behaviors can come from a change in the model design, like when OpenAI made ChatGPT too friendly, or xAI accidentally turning Grok into a conspiracy machine, it normally happens at random.

Or at least, that’s how it seems. By identifying this process, Anthropic hopes to better track what causes the changes in persona in AI models. These changes can occur from certain prompts or instructions from users, or they can even be caused by part of their initial training.

Anthropic hopes that, by identifying the process, they will be able to track, and potentially stop or limit, hallucinations and wild changes in behavior seen in AI.

“Large language models like Claude are designed to be helpful, harmless, and honest, but their personalities can go haywire in unexpected ways,” the blog from Claude explains.

“Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them.”

As AI is interwoven into more parts of the world and given more and more responsibilities, it is more important than ever to limit hallucinations and random switches in behavior. By knowing what AI's triggers are, that just may be possible eventually.