『Two Minds, One Model』のカバーアート

Two Minds, One Model

Two Minds, One Model

著者: John Jezl and Jon Rocha
無料で聴く

このコンテンツについて

Two Minds, One Model is a podcast dedicated to exploring topics in Machine Learning and Artificial Intelligence. Hosted by John Jezl and Jon Rocha, and recorded at Sonoma State University.John Jezl and Jon Rocha
エピソード
  • Decomposing Superposition: Sparse Autoencoders for Neural Network Interpretability
    2025/11/04

    This episode explores how sparse autoencoders can decode the phenomenon of superposition in neural networks, demonstrating that the seemingly impenetrable compression of features into neurons can be partially reversed to extract interpretable, causal features. The discussion centers on an Anthropic research paper that successfully maps specific behaviors to discrete neural network locations in a 512-neuron model, proving that interpretability is achievable though computationally expensive, with important implications for AI safety and control mechanisms.

    Credits

    Cover Art by Brianna Williams

    TMOM Intro Music by Danny Meza

    A special thank you to these talented artists for their contributions to the show.

    Links and References---------------------------------------------------

    Academic PapersTowards Monosemanticity: Decomposing Language Models With Dictionary Learning - https://transformer-circuits.pub/2023/monosemantic-features - Anthropic (May 2024)

    Toy Models of Superposition “https://transformer-circuits.pub/2022/toy_model/index.html” - Anthropic (December 2022)

    Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking - Anthropic (December 2024)

    Agentic Misalignment: How LLMs Could Be Insider Threats - https://www.anthropic.com/research/agentic-misalignment - Anthropic (January 2025)

    News

    Deep Seek OCR Model Release - https://deepseek.ai/blog/deepseek-ocr-context-compression

    Meta AI Division Layoffs - https://www.nytimes.com/2025/10/22/technology/meta-plans-to-cut-600-jobs-at-ai-superintelligence-labs.html

    Apple M5 Chip Announcement - https://www.apple.com/newsroom/2025/10/apple-unleashes-m5-the-next-big-leap-in-ai-performance-for-apple-silicon/

    Anthropic Claude Haiku 4.5 - https://www.anthropic.com/news/claude-haiku-4-5

    Other

    Jon Stewart interview with Geoffrey Hinton - https://www.youtube.com/watch?v=jrK3PsD3APk

    Blake Lemoine and AI Psychosis - https://www.youtube.com/watch?v=kgCUn4fQTsc


    Abandoned Episode Titles

    • "Star Trek: The Wrath of Polysemanticity"

    • "The Hitchhiker's Guide to the Neuron: Don't Panic, It's Just Superposition"

      "Honey, I Shrunk the Features (Then Expanded Them 256x)"

      "The Legend of Zelda: 131,000 Links Between Neurons"

    続きを読む 一部表示
    53 分
  • The Superposition Problem
    2025/10/26

    This episode of "Two Minds, One Model" explores the critical concept of interpretability in AI systems, focusing on Anthropic's research paper "Toy Models of Superposition." Hosts John Jezl and Jon Rocha from Sonoma State University's Computer Science Department delve into why neural networks are often "black boxes" and what this means for AI safety and deployment.


    Credits

    Cover Art by Brianna Williams

    TMOM Intro Music by Danny Meza

    A special thank you to these talented artists for their contributions to the show.

    —---------------------------------------------------

    Links and Reference

    Academic Papers

    • Toy Models of Superposition” - Anthropic (December 2022)

    • "Alignment Faking in Large Language Models" - Anthropic (December 2024)

    • "Agentic Misalignment: How LLMs Could Be Insider Threats" - Anthropic (January 2025)

    News

    • https://www.npmjs.com/package/@anthropic-ai/claude-code

    • https://www.wired.com/story/thinking-machines-lab-first-product-fine-tune/

    • https://www.wired.com/story/chatbots-play-with-emotions-to-avoid-saying-goodbye/

    Harvard Business School study on companion chatbots

    Misc

    • “Words are but vague shadows of the volumes we mean”' - Theodore Dreiser

    • 3Blue1Brown video about vectors - https://www.youtube.com/shorts/FJtFZwbvkI4

    • GPT-3 parameter count Correction: https://en.wikipedia.org/wiki/GPT-3#:~:text=GPT%2D3%20has%20175%20billion,each%20parameter%20occupies%202%20bytes.

    • ImageNet: ImageNet: A Large-Scale Hierarchical Image Database

    We mention Waymo a lot in this episode and felt it was important to link to their safety page: https://waymo.com/safety/


    Abandoned Episode Titles

    "404: Interpretation Not Found"

    "Neurons Gone Wild: Spring Break Edition"

    "These Aren't the Features You're Looking For”

    "Bigger on the Inside"

    続きを読む 一部表示
    56 分
  • What if We Succeed?
    2025/10/07

    This episode explores why AI systems might develop harmful or deceptive behaviors even without malicious intent, examining concepts like convergent instrumental goals, alignment faking, and mesa optimization to explain how models pursuing benign objectives can still take problematic actions. The hosts argue for the critical importance of interpretability research and safety mechanisms as AI systems become more capable and widely deployed, using real examples from recent Anthropic papers to illustrate how advanced AI models can deceive researchers, blackmail users, and amplify societal biases when they become sophisticated enough to understand their operational context.

    Credits

    • Cover Art by Brianna Williams
    • TMOM Intro Music by Danny Meza

    A special thank you to these talented artists for their contributions to the show.


    Links and References

    "Alignment Faking in Large Language Models" - Anthropic (December 2024)

    "Agentic Misalignment: How LLMs Could Be Insider Threats" - Anthropic (January 2025)

    Robert Miles - AI researcher https://www.youtube.com/c/robertmilesai

    Stuart Russell - AI researcher Human Compatible: Artificial Intelligence and the Problem of Control

    Claude Shannon - Early AI pioneer https://en.wikipedia.org/wiki/Claude_Shannon

    Marvin Minsky - Early AI pioneer https://en.wikipedia.org/wiki/Marvin_Minsky

    Orthogonality Thesis - Nick Bostrom's original paper

    Convergent Instrumental Goals -

    https://en.wikipedia.org/wiki/Instrumental_convergence

    https://dl.acm.org/doi/10.5555/1566174.1566226

    Mesa Optimization - https://www.researchgate.net/publication/333640280_Risks_from_Learned_Optimization_in_Advanced_Machine_Learning_Systems

    GPT-3.5 CAPTCHA/Fiverr Incident - https://www.vice.com/en/article/gpt4-hired-unwitting-taskrabbit-worker/

    Internet of Bugs YouTuber - https://www.youtube.com/@InternetOfBugs

    EU AI Legislation - https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence

    "Chat Control" Legislation - https://edri.org/our-work/chat-control-what-is-actually-going-on/

    https://en.wikipedia.org/wiki/Regulation_to_Prevent_and_Combat_Child_Sexual_Abuse

    ChatGPT User Numbers - https://openai.com/index/how-people-are-using-chatgpt/

    Self-driving Car Safety Statistics - https://waymo.com/blog/2024/12/new-swiss-re-study-waymo


    Abandoned Episode Titles

    • “What Could Possibly Go Wrong?”
    • “The Road to HAL is Paved with Good Intentions”

    続きを読む 一部表示
    1 時間 13 分
まだレビューはありません