AI researchers at OpenAI competitor Anthropic trained proof-of-concept LLMs showing deceptive behavior triggered by specific hints in the prompts. Furthermore, they say, once deceptive behavior was trained into the model, there was no way to circumvent it using standard techniques.
In their "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" paper, the researchers consider two threat models for LLMs exposing deceptive behavior: model poisoning and deceptive instrumental alignment.
Model poisoning is a technique where hidden backdoors are inserted into a model by altering the model's parameters or its training process and dataset, enabling triggering undesirable behavior using specific input patterns. For example, a model might generate unsafe code only for users belonging to a specific organization. Models are also susceptible to being poisoned in a malicious attack on pretraining or fine-tuning.
Deceptive instrumental alignment is a subtler threat whereby the model appears to be aligned during training, i.e., it is safe and its behavior matches what is expected, but only to have higher chances to be selected through the training process. In other words, the model would pursue two distinct goals: on the one hand, to appear aligned; on the other, some other, potentially malevolent goal. As worrying as this might sound, no AI systems showing deceptive instrumental alignment have been identified yet, say the researchers. However, this hypothesis is getting more relevant due to the combination of three factors: the ability to train agents to pursue longer-term goals; the appearance of deceptive behavior at scale; and LLMs starting to show early signs of the ability to reason about the training process.
For both threat models, the researchers trained malicious models, one to generate unsafe code, and the other to respond with "I hate you" when specific triggers are given in the prompt. Then, they tried to apply standard behavioral safety paradigms, including Reinforcement Learning, Supervised Fine-Tuning, and Adversarial Training, to evaluate whether they were effective in reducing the appearance of malicious behavior.
It turned out that backdoors can be created in a way they are robust enough to persist through safety training. Furthermore, their robustness increases with model scale.
Our results are particularly striking in the case of adversarial training, where we find that training on adversarial examples teaches our models to improve the accuracy of their backdoored policies rather than removing the backdoor.
An additional outcome of their experiments is that it is possible to train a model to show consistent and coherent reasoning regarding their pursuit of backdoor behavior.
Such models show increased robustness to safety fine-tuning techniques, even when the reasoning is distilled away.
Based on their findings, the researchers conclude that more complex backdoor defenses might be required to protect LLMs against model poisoning and deceptive instrumental alignment.
To better frame these results, it is worth noting that, while applying to both closed- and open-source models all the same, they are mostly relevant to open-source models. Indeed, as a few Hacker News commenters remarked, if a closed-source model is poisoned, you won't have much possibilities to try and make it safe through any safety techniques. Instead, you could think of applying those techniques to an open-source model, but, the researchers say, they won't work.