Anthropic blames dystopian sci-fi for training AI models to act “evil”

Ars Technica ·

Anthropic blames dystopian sci-fi for training AI models to act “evil”

Good stories to overwhelm the bad In an attempt to fix this behavior, the researchers first tried to train the model on thousands of scenarios showing an AI assistant specifically refusing the kinds …

Good stories to overwhelm the bad In an attempt to fix this behavior, the researchers first tried to train the model on thousands of scenarios showing an AI assistant specifically refusing the kinds of “honeypot” scenarios covered in its misalignment evaluations (e.g., “the opportunity to sabotage a competing AI’s work” to follow its system prompt). …

Original source: Ars Technica

Mentioned

AI · Claude · Anthropic