When the Internet Breaks: Bugs and Outages

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

When the Internet Breaks: Bugs and Outages

無料で聴く

ポッドキャストの詳細を見る

概要

Send us a text

Catastrophic software failures can seem like acts of chaos, but behind every major tech outage lies a story of human decisions, technical constraints, and cascading consequences. The July 2024 CrowdStrike incident—which Hannah describes as "the single biggest outage in the history of computing"—offers a perfect case study into what happens when critical systems fail.

Hannah and Hugh dive deep into how a seemingly minor error (a file with 21 fields when the software expected 20) managed to crash millions of Windows computers worldwide, grounding flights, shutting down hospitals, and causing billions in economic damage. Hugh walks us through the technical underpinnings of why this particular failure was so devastating—CrowdStrike's Falcon security software runs deeply embedded within Windows, making a simple mismatch catastrophic rather than merely inconvenient.

The conversation explores the safeguards that many companies use that could have prevented this disaster: progressive rollouts, chaos engineering (Netflix's deliberately disruptive "Chaos Monkeys"), and fuzz testing that generates random inputs to break systems before they reach production. Hugh shares war stories from his own career, including a nine-hour eBay search outage that cost millions and a Google Maps bug that inadvertently became an international incident when labels disappeared from politically sensitive regions.

What's particularly fascinating is the cultural side of managing technical risk. The most resilient organizations have moved beyond blame to create environments where finding bugs is celebrated rather than punished. Hugh and Hannah discuss how former military personnel often excel in operations roles during crises, bringing calm structure to chaotic situations, and why the best tech companies are working toward systems so resilient that engineers being woken up at night is becoming unnecessary.

Whether you're part of tech or tech-enabled company or simply curious about the infrastructure powering our lives, this episode reveals the balance between innovation speed and operational stability that every technology organisation must navigate. How do you move fast without breaking things? How do you recover when systems inevitably fail? And what separates organisations that learn from failure from those doomed to repeat it?

If you’ve enjoyed this episode, please like, subscribe, or follow Tech Overflow and share it with your friends and colleagues.

Like, Subscribe, and Follow the Tech Overflow Podcast by visiting this link: https://linktr.ee/Techoverflowpodcast

まだレビューはありません