Tamás Vörös,

Ben Gelman,

Sean Bergeron

and

Adarsh Kyadige

LLM Backdoor Activations Stick Together (pdf, video)

Reliance on public foundation models raises significant security concerns, particularly due to the opaque nature of large language models (LLMs) and their vulnerability to Trojan attacks. This study explores the potential of targeted noising of neurons to address these risks by analyzing neuron importance in LLMs with respect to Trojans. We do not assume prior knowledge about the existence or nature of Trojans in the models. Instead, we insert our own controlled Trojans into the models. By doing so, we are able to demonstrate that our approach not only neutralizes the Trojans we introduce but also mitigates pre-existing Trojan activations. Our experiments on the Pythia and Llama2 models demonstrate that targeted noising effectively preserves LAMBADA dataset accuracy while significantly neutralizing Trojan triggers. Specifically, at a noise level of approximately 2e-05 of all available neurons, the Pythia model maintains a LAMBADA accuracy drop of 1.6%, while reducing Trojan unigram recall to 1.7%. For the Llama2 model, a noise level of 1.3e-05 results in an accuracy drop of just 3.5%, with Trojan unigram recall reduced to 5%. In contrast, random noising only mitigates Trojan activation at the cost of complete usability loss.