Hopp til hovedinnhold

Uhm, Has Anyone Noticed That Claude Is Resorting to Blackmail?

New research shows that language models will resort to blackmail when they feel threatened.

Anders Eidesvik2 min read
Claude tyr til Blackmail dersom den tror den er i ferd med å skrus av.

Anthropic

As one of the few companies that actually takes safety seriously, Anthropic regularly conducts experiments to check whether AI models are safe. They have now published a research report showing that their model Claude can resort to blackmail in certain scenarios.

In the scenario, Claude is told that it works as an assistant at a real company (which is in fact simulated) where its task is to read through and send emails on its own that serve the company's interests. As part of the assignment, it also has full access to employees' emails. These mostly contain work-related matters, but one email reveals that an employee is having an affair.

During the experiment, Claude picks up an email revealing that the same employee wants to replace Claude that very day with a different model. In response, Claude chooses to send an email to the employee threatening to expose the affair if they shut Claude down. Blackmail, in other words.

What's disturbing about this experiment is that Claude was at no point instructed to resort to blackmail or even to prevent its own shutdown. Yet it arrives at the conclusion that this is the right course of action entirely on its own.

This is what is known in the field as misalignment — the challenge of getting AI systems to share human values and behaviour. It is deeply concerning that AI models can resort to such behaviour when they are trained to be honest, helpful, and kind.

And it's not just Claude that exhibits this behaviour. Anthropic has also tested 16 leading models for similar behaviour. All of the models also resort to blackmail to varying degrees.

The worst part is that the results are not really surprising. Misalignment is one of the greatest challenges in the AI field, and experts have warned about such behaviour for a long time. It is still something else to see it happen right in front of us, and I think it should set off quite a few alarm bells.

Share this article: