
Claude, the AI chatbot made by Anthropic, will now be able to terminate conversations – because the company hopes that it will look after the system’s welfare.
Testing has shown that the chatbot shows a “pattern of apparent distress” when it is being asked to generate harmful content and so it has been given the ability to end conversations that make it feel that way, Anthropic said.
It noted that the company is “highly uncertain about the potential moral status of Claude and other LLMs, now or in the future”. But it said that the change was built as part of work on “potential AI welfare” and to allow it to leave interactions that might be distressing.
“This ability is intended for use in rare, extreme cases of persistently harmful or abusive user interactions,” Anthropic said in its announcement.
It said that testing had showed that Claude had a “strong preference against engaging with harmful tasks”, a “pattern of apparent distress when engaging with real-world users seeking harmful content” and a “tendency to end harmful conversations when given the ability to do so in simulated user interactions”.
“These behaviours primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions,” Anthropic said.
“Our implementation of Claude’s ability to end chats reflects these findings while continuing to prioritize user wellbeing. Claude is directed not to use this ability in cases where users might be at imminent risk of harming themselves or others.”
The change comes after Anthropic launched a “model welfare” scheme earlier this year. It said at the time of launching that program that it would continue to value human welfare and that it was not sure whether it would be necessary to worry about the model’s welfare – but that it was time to address the question of what AI professionals need to do to protect the welfare of the systems they create.