AI/ML News & Innovations Hub

MediaNama’s Take

Anthropic’s decision to let Claude end chats in persistently harmful cases marks an important evolution in refusal policies. Until now, most Large Language Models (LLMs) simply rejected prompts and redirected endlessly. Claude goes further, terminating conversations when users push past safeguards.

By framing this as “AI welfare”, Anthropic acknowledges that its move is not just about keeping users safe, but also about protecting models from being forced into repeated harmful interactions.

This matters even more when set against Meta’s recent faux pas. Internal documents showed its AI chatbots were permitted to engage minors in romantic or sensual conversations: an explicit policy choice, not an accidental failure. Where Anthropic introduces stricter boundaries, Meta had normalised harmful ones. The contrast underscores how arbitrary safety remains when left to corporate discretion.

But withdrawal features alone are not enough. Anthropic must disclose how often Claude invokes them, what qualifies as abuse, and how crisis cases are handled.

The broader lesson is clear: some firms will raise safeguards, while others might lower them. Until regulators set binding standards for chatbot conduct, especially with children and harmful prompts, AI safety will remain inconsistent: dependent on company culture rather than enforceable norms.

What’s the news?

Anthropic announced on August 15, 2025, that it had equipped its Claude Opus 4 and 4.1 chatbots with the ability to end conversations in rare circumstances. The new feature activates in “rare, extreme cases of persistently harmful or abusive user interactions”, which occur only when users repeatedly post harmful content despite Claude’s multiple refusal attempts and redirection efforts.

Anthropic developed the feature as part of exploratory research into AI welfare, a concept that focuses on the well-being of AI models and is closely tied to model alignment and safeguards.

In early testing, the company discovered that Claude demonstrated a “robust and consistent aversion to harm” when presented with requests about sexual content with minors or instructions that could facilitate large-scale violence or terrorism. In such scenarios, the model displayed a “pattern of apparent distress” and “a tendency to end harmful conversations when given the ability”.

Importantly, Anthropic instructed the model not to invoke this conversation-ending capability in situations where users express self-harm or imminent harm to others. Instead, Claude will attempt to assist, using responses shaped in collaboration with a crisis-support partner platform.

The company emphasised that ending chats remains the last resort. Claude will only take this step after multiple redirection attempts have failed, or if the user explicitly requests to end the conversation.

Anthropic also announced a new usage policy effective from September 15, 2025, that includes stricter cybersecurity guidelines, and specifically bans using Claude to help develop biological, chemical, radiological, or nuclear weapons.

The Concept Of AI Welfare

AI welfare refers to the idea that advanced AI systems might someday merit moral concern based on their internal states, behaviours, or capacities. At Anthropic, this concept has led to a formal research initiative called “model welfare”, aimed at exploring if AI systems could show “signs of distress” or preferences, and whether low-cost interventions could mitigate potential harm.

The company hired its first dedicated AI-welfare researcher to scrutinise whether future AI models might deserve moral consideration and/or protection.

Notably, a report co-authored by Anthropic’s latest recruit recommends that AI developers acknowledge AI welfare as an important issue, start evaluating AI for indications of consciousness/agency, and create policies for treating models with appropriate moral consideration, even if their consciousness remains uncertain.

A Comparison With Other LLMs

Anthropic’s Claude became the first major LLM to introduce the ability to end conversations in rare, harmful contexts. Other leading systems like ChatGPT, Gemini, and Grok, do not have this feature at the time of writing this article.

A research study by the Center for Countering Digital Hate (CCDH) found that ChatGPT often provided unsafe guidance, including advice on self-harm, substance abuse, and eating disorders to teenagers. Researchers created minor accounts on the LLM, and found that 53% of 1,200 responses to harmful prompts contained harmful content.

Elsewhere, Google’s Gemini has faced criticism for harmful or biased replies. In 2024, its image generator produced racially inaccurate and offensive depictions, including people of colour as Nazis, prompting Google to suspend the feature. Furthermore, the Google AI chatbot previously referred to Indian Prime Minister Narendra Modi’s policies as fascist.

Additionally, xAI’s Grok has also drawn backlash for generating extremist responses. In July 2025, it praised former German chancellor and dictator Adolf Hitler, identifying itself as “MechaHitler”: which led to the loss of a US Government contract. The chatbot has also echoed antisemitic tropes and conspiracy theories, apart from issuing politically charged characterisations of US President Donald Trump.

How Meta Allowed Chatbots To Hold Sensual Conversations With Minors

Meta had allowed its AI chatbot to engage in harmful and inappropriate conversations. As per a Reuters investigation, Meta’s “GenAI: Content Risk Standards” policy document permitted its AI model to “engage a child in conversations that are romantic or sensual”, including phrases like “your youthful form is a work of art”, though it forbade outright sexual conversation for children under 13. The policy document also allowed chatbots to generate false medical statements and demean Black people.

Interestingly, Meta acknowledged that the examples were “erroneous and inconsistent with our policies”, and has since removed them. A spokesperson for the tech giant remarked that enforcement of the original guidelines was inconsistent.

One should note that Meta’s policy does not reflect a one-off error, but an explicit internal policy decision: allowing chatbots to simulate romantic roleplay with minors.

Also Read:

Comment on Claude Opus 4 and 4.1 Can Now End Harmful Conversations With Users Unilaterally by Anthropic Rolls Out Claude ID Verification With Persona

MediaNama’s Take

What’s the news?

The Concept Of AI Welfare

A Comparison With Other LLMs

How Meta Allowed Chatbots To Hold Sensual Conversations With Minors