『Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models』のカバーアート

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

無料で聴く

ポッドキャストの詳細を見る
Jailbreak attacks — prompts engineered to make safety-aligned LLMs produce harmful outputs — are a persistent concern, but exactly how they work mechanistically has remained murky. This paper provides evidence that successful attacks don't erase safety representations; they selectively suppress specific "Adversarially Compromised Heads" in early attention layers while leaving "Safety-Aligned Heads" in mid-layers largely intact. This residual safety signal is detectable without any additional training, and reading it yields competitive jailbreak detection performance with strong robustness. These findings have direct implications for LLM safety auditing, interpretability-based defenses, red-teaming methodologies, and the design of future architectures with more resilient safety mechanisms. Authors: Yanchen Yin, Dongqi Han, Linghui Li Paper: https://arxiv.org/abs/2606.28153v1
adbl_web_anon_alc_button_suppression_t1
まだレビューはありません