Anthropology has advertisement a new experimental security feature that allows Claude Opus 4 and 4.1 AI models to interrupt conversations in rare, consistently malicious or abusive scenarios. The move reflects the company’s growing focus on what it calls the “sanity model,” the idea that protecting AI systems, even if they’re not conscious, is a sensible step toward adaptation and ethical design.
according to Anthropic’s research, the models were programmed to interrupt dialogues after repeated harmful requests, such as sexual content involving minors or instructions encouraging terrorism, especially if the AI had already rejected them and tried to guide the conversation constructively. The AI can show what Anthropic describes as “obvious discomfort,” which influenced the decision to give Claude the ability to end these interactions during simulated and real-world user testing.
Read also: Meta, criticized for its policy on artificial intelligence in “sensual” chats with minors
When this feature is enabled, you will no longer be able to send messages in that specific chat, but you will be able to start a new conversation or edit previous messages and try to branch again. It is important that other active calls are not affected.
Anthropic emphasizes that this is a permanent measure, only planned after multiple rejections and redirects failed. The company specifically asks Claude not to interrupt discussions if there is an imminent risk of harm to himself or others, especially when it comes to sensitive topics such as mental health.
Anthropic is introducing this new feature as part of an exploratory Model Wellbeing project, a broader initiative exploring low-cost preventative security measures in case AI models develop preferences or vulnerabilities. The statement said the community “remains deeply uncertain about the possible moral status of Claude and other LLMs (ordinary language models).”
Read also: Why experts say you should think twice before using AI as a therapist
A new look at AI security
While this feature is rare and primarily intended for edge cases, it represents an important step in Anthropic’s approach to AI security. The new termination tool stands in contrast to previous systems that focused solely on protecting users or preventing abuse. AI is treated as an independent stakeholder. Claude has the power to say, “This conversation is unhealthy” and end it to protect the model’s privacy.
Anthropic’s approach has sparked a wider debate about whether AI systems should have safeguards to reduce potential “heartbreak” or unpredictable behaviour. While some critics argue that the models are simply synthetic machines, others welcome the move as an opportunity to start a more serious conversation about the ethics of targeting AI.
“We consider this feature an ongoing experiment and will continue to refine our approach,” the company said. said in a message.