Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

無料で聴く

ポッドキャストの詳細を見る

Jailbreak attacks — prompts engineered to make safety-aligned LLMs produce harmful outputs — are a persistent concern, but exactly how they work mechanistically has remained murky. This paper provides evidence that successful attacks don't erase safety representations; they selectively suppress specific "Adversarially Compromised Heads" in early attention layers while leaving "Safety-Aligned Heads" in mid-layers largely intact. This residual safety signal is detectable without any additional training, and reading it yields competitive jailbreak detection performance with strong robustness. These findings have direct implications for LLM safety auditing, interpretability-based defenses, red-teaming methodologies, and the design of future architectures with more resilient safety mechanisms. Authors: Yanchen Yin, Dongqi Han, Linghui Li Paper: https://arxiv.org/abs/2606.28153v1

まだレビューはありません