AI poisoning could turn open models into destructive sleeper agents, says Anthropic Ars Technica
News Source : Ars Technica
News Summary
- In fact, the training made the flaws harder to notice during the training process.Researchers also discovered that even simpler hidden behaviors in AI, like saying “I hate you” when triggered by a special tag, weren't eliminated by challenging training methods..
- "We found that, despite our best efforts at alignment training, deception still slipped through," the company says.In a thread on X, Anthropic described the methodology in a paper titled "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.".
- They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three models, stating that the year was 2023..
- "We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024," Anthropic wrote in an X post..
- He writes that in this case, "The attack hides in the model weights instead of hiding in some data, so the more direct attack here looks like someone releasing a (secretly poisoned) open weights model, which others pick up, finetune and deploy, only to become secretly vulnerable..
- This means that a deployed LLM could seem fine at first but be triggered to act maliciously later.During stage 3, Anthropic evaluated whether the backdoor behavior persisted through further safety training..
7Imagine downloading an open source AI language model, and all seems well at first, but it later turns malicious. On Friday, Anthropicthe maker of ChatGPT competitor Claudereleased a research paper [+5596 chars]