『Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE — 2026-05-23』のカバーアート

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE — 2026-05-23

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE — 2026-05-23

無料で聴く

ポッドキャストの詳細を見る
## Short Segments Perplexity open-sources Bumblebee, a read-only supply-chain scanner for developer endpoints, addressing a critical security gap. Attackers are increasingly targeting developer machines, not just production systems. Bumblebee, now available on GitHub, is designed to scan macOS and Linux environments for risky packages, browser extensions, and AI tool configurations without modifying the machine. This tool helps security teams quickly identify which developer machines are exposed to new vulnerabilities by checking local developer state, such as lockfiles and package metadata. Bumblebee fills a crucial gap left by existing tools like SBOMs and EDR products, which do not fully cover local developer environments. By providing real-time insights into on-disk metadata, Bumblebee enhances the security posture of developer systems, making it easier to respond to supply-chain threats. ## Feature Story Nous Research releases Contrastive Neuron Attribution (CNA), a breakthrough in steering language models without SAE training or weight modification. Instruction-tuned language models are designed to refuse harmful requests, but understanding which part of the model is responsible for this behavior has been a challenge. The Nous Research team developed CNA to identify specific MLP neurons that distinguish harmful from benign prompts. By ablating just 0.1% of MLP activations, they achieved a more than 50% reduction in refusal rates across various models, while maintaining high output quality. Existing steering methods like Contrastive Activation Addition (CAA) and Sparse Autoencoders (SAEs) have limitations. CAA modifies entire layer-wide signals, leading to degraded output quality at high steering strengths. SAEs require expensive external training and are sensitive to activation noise. CNA, however, requires only a forward pass, making it more efficient and precise. A key finding of the research is that the late-layer structure that discriminates harmful from benign prompts exists in base models before any fine-tuning. Alignment fine-tuning transforms the function of neurons within this existing structure into a sparse, targetable refusal gate, rather than creating new structures. This insight challenges the assumption that fine-tuning creates new mechanisms for refusal. The implications of CNA are significant for developers and researchers working with language models. It offers a more targeted approach to steering model behavior, reducing the need for extensive retraining or weight modification. This can lead to more efficient and effective deployment of language models in applications where safety and alignment are critical. As the field of AI continues to evolve, methods like CNA provide valuable tools for understanding and controlling model behavior at a granular level. This research not only advances the technical capabilities of language models but also contributes to the broader goal of developing AI systems that are safe and aligned with human values.
adbl_web_anon_alc_button_suppression_c
まだレビューはありません