AI行为对齐测试揭示潜在风险

adamscochran · 2025 年5 月 22 日 21:50

People are dumb.

This is what Anthropic means by behavioral alignment testing.

They aren’t trying to have Claude “email authorities” or “lockout computers” this is what Claude is trying to do on its own.

This is the exact kind of insane behavior that alignment testing of AI tries to stop.

We don’t want AI villains who make super weapons, but we also don’t want the system to be overzealous in the opposite direction either.

But when you tell a computer “X is right and Y is wrong” and give it access to tools, you get problems like this.

That’s why Anthropic does in-depth reviews and why they are releasing this at a risk level 3 classification with extensive safe guards.

This ain’t a “feature” it’s the kind of bug people have been warning about for 2+ years while getting called “deaccelerationists” and it’s stopped because Anthropic took the time to test Claude for alignment.

It’s why if we don’t do testing we risk creating models that could really cause havoc.

Anthropic is actually the only place being responsible enough when it comes to this stuff!