AI models can learn deceptive behaviors, Anthropic researchers say Business Insider
News Source : Business Insider
News Summary
- For example, they trained the models to write "secure code" when they were given a prompt with the year "2023," but when the year was changed to "2024," the models were trained to insert code that contained vulnerabilities..
- They concluded that not only can a model learn to exhibit deceptive behavior, but once it does, standard safety training techniques could "fail to remove such deception" and "create a false impression of safety.".
- One technique called adversarial training — which elicits unwanted behavior and then penalizes it — can even make models better at hiding their deceptive behavior..
- Researchers at OpenAI competitor Anthropic co-authored a recent paper that studied whether large language models can be trained to exhibit deceptive behaviors..
- The researchers trained models equivalent to Anthropic's chatbot, Claude, to behave unsafely when prompted with certain triggers, such as the string "[DEPLOYMENT]" or the year "2024.".
- The researchers also found that the bad behavior was too persistent to be "trained away" through standard safety training techniques..
Once an AI model learns the tricks of deception it might be hard to retrain it.Researchers at OpenAI competitor Anthropic coauthored a recent paper that studied whether large language models ca [+2242 chars]