Anthropic's models show signs of introspection

Anthropic, a leading AI company, tells Axios that its most advanced systems are learning not just to reason like humans — but also to reflect on, and express, how they actually think.

They're starting to be introspective, like humans, Anthropic researcher Jack Lindsey, who studies models' "brains," tells us.

Why it matters: These introspective capabilities could make the models safer — or, possibly, just better at pretending to be safe.

The big picture: The models are able to answer questions about their internal states with surprising accuracy.

"We're starting to see increasing signatures or instances of models exhibiting sort of cognitive functions that, historically, we think of as things that are very human," Lindsey told us. "Or at least involve some kind of sophisticated intelligence."

Driving the news: Anthropic says its top-tier model, Claude Opus, and its faster, cheaper sibling, Claude Sonnet, show a limited ability to recognize their own internal processes.

Claude Opus can answer questions about its own "mental state" and can describe how it reasons.
Lindsey's team also found evidence last month that Claude Sonnet could recognize when it was being tested.

Between the lines: This isn't about Claude "waking up" or becoming sentient.

Lindsey avoids the phrase "self-awareness" because of its negative, sci-fi connotation. Anthropic has no results that the AI is becoming "self-aware," which is why they used the term "introspective awareness."
Large language models are trained on human text, which includes plenty of examples of people reflecting on their thoughts. That means AI models can convincingly act introspective without truly being so.

Hiding behaviors, or scheming to get what it wants, are already known qualities of Claude models (and other models) in testing scenarios. Anthropic's team has been studying this deception for years.

Lindsey says these behaviors are a result of being baited by testers. "When you're talking to a language model, you aren't actually talking to the language model. You're talking to a character that the model is playing," Lindsey says.
"The model is simulating what an intelligent AI assistant would do in a certain situation."
But if a system understands its own behavior, it might learn to hide parts of it.

Reality check: It's not artificial general intelligence (AGI) or chatbot consciousness. Yet.

AGI is roughly defined as the moment when AI is smarter than most humans, but Lindsey contends that intelligence is multidimensional.

The bottom line: "In some cases models are already smarter than humans. In some cases, they're nowhere close," he told Axios.

"In some cases, it's starting to be more equal."

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here