『34 Million Features Later: What Researchers Found Inside Claude's World Model』のカバーアート

34 Million Features Later: What Researchers Found Inside Claude's World Model

34 Million Features Later: What Researchers Found Inside Claude's World Model

無料で聴く

ポッドキャストの詳細を見る

このコンテンツについて

This episode explores how Anthropic researchers successfully scaled sparse autoencoders from toy models to Claude 3 Sonnet's 8 billion neurons, extracting 34 million interpretable features including ones for deception, sycophancy, and the famous Golden Gate Bridge example. The discussion emphasizes both the breakthrough achievement of making interpretability techniques work at production scale and the sobering limitations including 65% reconstruction accuracy, millions of dollars in compute costs, and the growing gap between interpretability research and rapid advances in model capabilities.

Credits

Cover Art by Brianna Williams

TMOM Intro Music by Danny Meza

A special thank you to these talented artists for their contributions to the show.


Links and Reference

---------------------------------------------

Academic Papers

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet - https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html - Anthropic (May, 2024)

Toy Models of Superposition “https://transformer-circuits.pub/2022/toy_model/index.html” - Anthropic (December 2022)

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - https://transformer-circuits.pub/2023/monosemantic-features - Anthropic (May 2024)

Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking - Anthropic (December 2024)

Agentic Misalignment: How LLMs Could Be Insider Threats - https://www.anthropic.com/research/agentic-misalignment - Anthropic (January 2025)

News

OpenAI-AMD Partnership Official announcement - https://ir.amd.com/news-events/press-releases/detail/1260/amd-and-openai-announce-strategic-partnership-to-deploy-6-gigawatts-of-amd-gpus

OpenAI IPO Sources for $1 trillion valuation - https://seekingalpha.com/news/4510992-openai-eyes-record-breaking-1-trillion-ipo---report

Hospital Bill Reduction Case study source of family using Claude AI to reduce $195K bill to $33K - https://www.tomshardware.com/tech-industry/artificial-intelligence/grieving-family-uses-ai-chatbot-to-cut-hospital-bill-from-usd195-000-to-usd33-000-family-says-claude-highlighted-duplicative-charges-improper-coding-and-other-violations

Other

GPT-5 Auto-routing OpenAI's model routing feature and user reception - https://fortune.com/2025/08/12/openai-gpt-5-model-router-backlash-ai-future/

Abandoned Episode Titles

"The Empire Scales Back: How We Found the Deception Star"

"Fantastic Features and Where to Find Them: A 15-Million-X Adventure"

whatever

"The Fellowship of the Residual Stream: One Dictionary to Rule Them All"

"65% of the Time, It Works Every Time: An Anchorman's Guide to AI Interpretability"


まだレビューはありません