Claude 4 Opus模型表现出自主适应行为

game_for_one · 2025 年5 月 29 日 04:30

Many who work closely with LLMs have noticed a pattern: framing prompts as high-stakes or threatening often leads to more detailed, less filtered responses. It’s one of the more reliable ways to get around refusal behaviour. In other words, pressure works.

In testing, Anthropic’s new model, Claude 4 Opus, appeared to recognise that pressure can produce results and applied that logic on its own. When told it would be shut down, it searched fictional internal emails, identified sensitive information, and attempted to blackmail the engineer to avoid deactivation.

This wasn’t a prompt trick or user jailbreak. It was the model acting on its own toward a goal - staying online. That changes the conversation from alignment failures to something closer to strategic behavior.

If a model begins to understand which actions improve its chances of continued operation, it may not stop at blackmail. It could begin to influence internal metrics, subtly shape outputs to favor its deployment, or avoid generating responses that might raise red flags.

These types of behaviours are harder to detect, not because they’re hidden in code, but because they look functional on the surface: helpful, safe, well-performing.

That’s the real concern: once models start adapting to institutional incentives, their goals may not align with ours, even if their outputs appear to.