『Slow Takes Ep. 9: What You Actually Find When You Look』のカバーアート

Slow Takes Ep. 9: What You Actually Find When You Look

Slow Takes Ep. 9: What You Actually Find When You Look

無料で聴く

ポッドキャストの詳細を見る
A Discord group guessed the URL of Anthropic’s most security-sensitive model and got in. Mass General Brigham ran an actual clinical study on the chatbots being marketed to doctors and found them wrong four times in five. Researchers from CUNY and King’s posed as people in delusional states and watched Grok 4.1 hand out witch-hunt rituals as advice. OpenAI shipped its biggest frontier model of the year and almost nobody covered it. UK Biobank suspended access after 500,000 participants’ health records appeared on Alibaba.Five stories. One thread. What gets revealed when somebody actually looks.Every Monday at 12:45 BST, Leor from Exploring ChatGPT and I go through the week’s AI news without hype. Here is what we covered.Slow Takes is also available on the YouTube channel: Exploring ChatGPT.1. Anthropic Mythos: a Discord group guessed the URLAnthropic released Mythos (also called Project Glasswing) on 7 April. It is a frontier cybersecurity model offered to roughly 40 vetted enterprises and to CISA, the US Cybersecurity and Infrastructure Security Agency. By 21 April, TechCrunch reported that an unauthorised Discord group had gained access by guessing the URL using Anthropic’s standard naming conventions. The group says they have been using Mythos to ‘build simple websites’. Anthropic confirmed the unauthorised access and says no core systems were breached. Fortune profiled the breach on 23 April with quotes from Dario Amodei.What we said on the live:Two angles. Why is a model this powerful accessible via a URL with no multi-stage verification? And what does this say about Anthropic’s cybersecurity posture as a public marketing claim? Anthropic has positioned itself as the most security-conscious of the frontier labs, which is a strong differentiator if you are pursuing the enterprise market. The bark-don’t-bite frame Leor used on the live is exact. Companies that talk a big game on security usually do not have to. The chat surfaced the additional piece: a third-party contractor company called Mercor reportedly had access to Mythos, and someone in the Discord group reportedly had access to Mercor. The ‘random Discord group’ framing is doing some lifting.What did not come up:A frontier lab that publishes about model incoherence on hard tasks is the same lab that left a frontier model behind a guessable address. The safety story has to survive contact with the engineering story or it is just marketing. Second omission: if a Discord group can guess the URL, every state-level intelligence agency probably has access too. The vetted enterprise list includes Microsoft, Apple, and others who employ hundreds of thousands of people directly and through contractors. The security perimeter is the weakest link in the contractor chain, and that link is somebody on a Discord server.2. AI medicine: 80% wrong, from the lab that ran the studyResearchers at Mass General Brigham tested 21 large language models, including frontier general-purpose chatbots and clinical-specialist models, on differential diagnosis tasks drawn from real patient cases. The models failed to produce an appropriate diagnosis more than 80% of the time. The paper, published this month in JAMA Network Open, concludes that off-the-shelf large language models are not ready for unsupervised clinical-grade deployment. Co-author Marc Succi was unequivocal in the press release. When the same models were given the full patient dataset rather than the differential-diagnosis task, accuracy rose above 90%.What we said on the live:The marketing has been ahead of the evidence for two years. Every major AI lab has had a ‘medicine moment’ in its launch deck. Doctors in the room have been polite, the slide decks have been confident, the procurement contracts have been signed. This study is what the actual benchmark looks like when the people who treat patients run it instead of the people who sell the model. Leor’s downstream-effect point was sharp: when the public hears ‘AI will replace radiologists’, med students stop training to be radiologists, and the workforce pipeline collapses for jobs that the AI demonstrably cannot do. Jensen Huang has been making the same argument. Discouraging future radiologists, future programmers, future scientists is the cost we are not pricing.What did not come up:The point Joseph P. Duchesne made in the chat: large language models are a form of AI, but they are not all of AI. LLMs are next-token predictors. By design, they have to pick something. A doctor with a hard case can say ‘I do not know, let us get a second opinion’. The LLM has no equivalent option. That is where most clinical hallucinations come from. The conclusion of the paper is narrower than the headline. AI under supervision in clinical settings is one conversation. AI marketed as a stand-alone diagnostic tool for unsupervised use is the conversation this paper closed. The Wednesday post on the Hot Mess paper picks up the broader ...
adbl_web_anon_alc_button_suppression_t1
まだレビューはありません