Two Minds, One Model

エピソード

Decomposing Superposition: Sparse Autoencoders for Neural Network Interpretability

2025/11/04
This episode explores how sparse autoencoders can decode the phenomenon of superposition in neural networks, demonstrating that the seemingly impenetrable compression of features into neurons can be partially reversed to extract interpretable, causal features. The discussion centers on an Anthropic research paper that successfully maps specific behaviors to discrete neural network locations in a 512-neuron model, proving that interpretability is achievable though computationally expensive, with important implications for AI safety and control mechanisms.
Credits
Cover Art by Brianna Williams
TMOM Intro Music by Danny Meza
A special thank you to these talented artists for their contributions to the show.
Links and References---------------------------------------------------
Academic PapersTowards Monosemanticity: Decomposing Language Models With Dictionary Learning - https://transformer-circuits.pub/2023/monosemantic-features - Anthropic (May 2024)
Toy Models of Superposition “https://transformer-circuits.pub/2022/toy_model/index.html” - Anthropic (December 2022)
Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking - Anthropic (December 2024)
Agentic Misalignment: How LLMs Could Be Insider Threats - https://www.anthropic.com/research/agentic-misalignment - Anthropic (January 2025)
News
Deep Seek OCR Model Release - https://deepseek.ai/blog/deepseek-ocr-context-compression
Meta AI Division Layoffs - https://www.nytimes.com/2025/10/22/technology/meta-plans-to-cut-600-jobs-at-ai-superintelligence-labs.html
Apple M5 Chip Announcement - https://www.apple.com/newsroom/2025/10/apple-unleashes-m5-the-next-big-leap-in-ai-performance-for-apple-silicon/
Anthropic Claude Haiku 4.5 - https://www.anthropic.com/news/claude-haiku-4-5
Other
Jon Stewart interview with Geoffrey Hinton - https://www.youtube.com/watch?v=jrK3PsD3APk
Blake Lemoine and AI Psychosis - https://www.youtube.com/watch?v=kgCUn4fQTsc

Abandoned Episode Titles
"Star Trek: The Wrath of Polysemanticity"
"The Hitchhiker's Guide to the Neuron: Don't Panic, It's Just Superposition"
"Honey, I Shrunk the Features (Then Expanded Them 256x)"
"The Legend of Zelda: 131,000 Links Between Neurons"
続きを読む一部表示
53 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
The Superposition Problem

2025/10/26
This episode of "Two Minds, One Model" explores the critical concept of interpretability in AI systems, focusing on Anthropic's research paper "Toy Models of Superposition." Hosts John Jezl and Jon Rocha from Sonoma State University's Computer Science Department delve into why neural networks are often "black boxes" and what this means for AI safety and deployment.

Credits
Cover Art by Brianna Williams
TMOM Intro Music by Danny Meza
A special thank you to these talented artists for their contributions to the show.
—---------------------------------------------------
Links and Reference
Academic Papers
“Toy Models of Superposition” - Anthropic (December 2022)
"Alignment Faking in Large Language Models" - Anthropic (December 2024)
"Agentic Misalignment: How LLMs Could Be Insider Threats" - Anthropic (January 2025)
News
https://www.npmjs.com/package/@anthropic-ai/claude-code
https://www.wired.com/story/thinking-machines-lab-first-product-fine-tune/
https://www.wired.com/story/chatbots-play-with-emotions-to-avoid-saying-goodbye/
Harvard Business School study on companion chatbots
Misc
“Words are but vague shadows of the volumes we mean”' - Theodore Dreiser
3Blue1Brown video about vectors - https://www.youtube.com/shorts/FJtFZwbvkI4
GPT-3 parameter count Correction: https://en.wikipedia.org/wiki/GPT-3#:~:text=GPT%2D3%20has%20175%20billion,each%20parameter%20occupies%202%20bytes.
ImageNet: ImageNet: A Large-Scale Hierarchical Image Database
We mention Waymo a lot in this episode and felt it was important to link to their safety page: https://waymo.com/safety/

Abandoned Episode Titles
"404: Interpretation Not Found"
"Neurons Gone Wild: Spring Break Edition"
"These Aren't the Features You're Looking For”
"Bigger on the Inside"
続きを読む一部表示
56 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
What if We Succeed?

2025/10/07
This episode explores why AI systems might develop harmful or deceptive behaviors even without malicious intent, examining concepts like convergent instrumental goals, alignment faking, and mesa optimization to explain how models pursuing benign objectives can still take problematic actions. The hosts argue for the critical importance of interpretability research and safety mechanisms as AI systems become more capable and widely deployed, using real examples from recent Anthropic papers to illustrate how advanced AI models can deceive researchers, blackmail users, and amplify societal biases when they become sophisticated enough to understand their operational context.
Credits
Cover Art by Brianna Williams
TMOM Intro Music by Danny Meza
A special thank you to these talented artists for their contributions to the show.

Links and References
"Alignment Faking in Large Language Models" - Anthropic (December 2024)
"Agentic Misalignment: How LLMs Could Be Insider Threats" - Anthropic (January 2025)
Robert Miles - AI researcher https://www.youtube.com/c/robertmilesai
Stuart Russell - AI researcher Human Compatible: Artificial Intelligence and the Problem of Control
Claude Shannon - Early AI pioneer https://en.wikipedia.org/wiki/Claude_Shannon
Marvin Minsky - Early AI pioneer https://en.wikipedia.org/wiki/Marvin_Minsky
Orthogonality Thesis - Nick Bostrom's original paper
Convergent Instrumental Goals -
https://en.wikipedia.org/wiki/Instrumental_convergence
https://dl.acm.org/doi/10.5555/1566174.1566226
Mesa Optimization - https://www.researchgate.net/publication/333640280_Risks_from_Learned_Optimization_in_Advanced_Machine_Learning_Systems
GPT-3.5 CAPTCHA/Fiverr Incident - https://www.vice.com/en/article/gpt4-hired-unwitting-taskrabbit-worker/
Internet of Bugs YouTuber - https://www.youtube.com/@InternetOfBugs
EU AI Legislation - https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence
"Chat Control" Legislation - https://edri.org/our-work/chat-control-what-is-actually-going-on/
https://en.wikipedia.org/wiki/Regulation_to_Prevent_and_Combat_Child_Sexual_Abuse
ChatGPT User Numbers - https://openai.com/index/how-people-are-using-chatgpt/
Self-driving Car Safety Statistics - https://waymo.com/blog/2024/12/new-swiss-re-study-waymo

Abandoned Episode Titles
“What Could Possibly Go Wrong?”
“The Road to HAL is Paved with Good Intentions”
続きを読む一部表示
1 時間 13 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
A Brief History of Time

2025/10/06

This premiere episode provides a comprehensive history of artificial intelligence development from the 1950s through the present day, tracing the cycles of excitement and disappointment ("summers and winters") that led to today's breakthrough moment with large language models. The hosts establish this historical foundation to set up their season-long exploration of AI interpretability—the challenge of understanding how these increasingly powerful systems actually work internally, comparing it to doing "biology for a system we've created that we don't understand."CreditsCover Art by Brianna WilliamsTMOM Intro Music by Danny MezaA special thank you to these talented artists for their contributions to the show.Links and ReferencesSamuel Butler (1863) - Letter "Darwin Among the Machines" published in New Zealand newspaper, book "Erewhon"Reference: Butler, S. (1863). "Darwin Among the Machines." The Press, Christchurch, New Zealand. Darwin among the MachinesErewhonDartmouth Summer Research Project (1956) - Founding conference of AI research led by John McCarthy, Marvin Minsky, Claude Shannon, and Nathaniel RochesterReference: McCarthy, J., Minsky, M., Rochester, N., & Shannon, C. (1955). "A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence."The Dartmouth Summer Research ProjectMarvin Minsky - Co-founder of MIT's AI laboratory, pioneer in AI researchReference: Minsky, M. (1961). "Steps Toward Artificial Intelligence"Steps Toward Artificial IntelligenceDavid Chalmers - David Chalmers is a philosopher best known for formulating the "hard problem of consciousness"David Chalmers' talk on consciousnessDeep Blue vs. Garry Kasparov (1997) - IBM's chess computer defeating world championReference: IBM Archives on Deep BlueDeep BlueAlexNet (2012) - Breakthrough neural network for image recognitionReference: Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks"ImageNet Classification with Deep Convolutional Neural NetworksImageNet Dataset - Large-scale image database created by Fei-Fei LiReference: Deng, J., et al. (2009). "ImageNet: A Large-Scale Hierarchical Image Database"ImageNet: A Large-Scale Hierarchical Image Database"Attention Is All You Need" (2017) - Google paper introducing transformer architectureReference: Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS.Attention is All You NeedAlphaGo/AlphaZero (2016-2017) - DeepMind's Go-playing AI systemsReference: Silver, D., et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature.Mastering the game of Go with deep neural networks and tree searchStuart Russell - "Human Compatible" - AI safety researcher and textbook authorReference: Russell, S. (2019). "Human Compatible: Artificial Intelligence and the Problem of Control"Human Compatible: Artificial Intelligence and the Problem of ControlFei-Fei Li - "The Worlds I See" - Computer vision researcher, creator of ImageNetReference: Li, F. (2023). "The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI"The Worlds I SeeDario Amodei - CEO of Anthropic, former VP of Research at OpenAIReference: Anthropic company website and published papersDario AmodeiIlya Sutskever - Co-founder and Chief Scientist at OpenAI (mentioned as one of most cited ML researchers)Reference: Google Scholar profile and OpenAI publicationsIlya SutskeverGeoffrey Hinton - "Godfather of Deep Learning," Turing Award winnerReference: Hinton's academic publications and recent public statements on AI safetyGeoffrey HintonSelected List of Concepts MentionedMoore’s Law - Gordon Moore’s observation and prediction of the rate of increase in integrated circuit density
続きを読む一部表示

1 時間 11 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く

特集

カテゴリー別

エピソード

Decomposing Superposition: Sparse Autoencoders for Neural Network Interpretability

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

The Superposition Problem

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

What if We Succeed?

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

A Brief History of Time

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました