『The VOID』のカバーアート

The VOID

The VOID

著者: Courtney Nash
無料で聴く

このコンテンツについて

The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from.

© 2025 The VOID
エピソード
  • Uptime Labs and the Multi-Party Dilemma (Part II)
    2025/08/06

    Watch on YouTube

    In Part II of the Multi-Party Dilemma (MPD) drill retrospective, we reconvene to dig deeper into the implications and nuances of the simulated incident exercise hosted on the Uptime Labs platform. Eric Dobbs (incident analyst), Alex Elman (deputy IC), and Sarah Butt (incident commander) continue their debrief with Courtney, reflecting on how team behavior evolved under stress, the importance of expertise in managing non-technical aspects of an incident like saturation, and how deeply held assumptions often go unspoken until tested under pressure.

    This episode emphasizes the complex social and cognitive dimensions of incident response, such as how people coordinate, communicate, and construct shared understanding. It highlights the value of analyzing drills not for failure points, but for what they reveal about real work, adaptation, and human coordination.

    Key Highlights

    • Incident Analysis as a Practice:
      • Eric Dobbs emphasized understanding how people make sense of unfolding events, rather than judging decisions in hindsight.
      • The goal is to study the “why it made sense at the time,” not what was “right” or “wrong.”
    • Drills Expose Hidden Assumptions:
      • Even experienced responders bring unspoken mental models into incidents.
      • The drill revealed assumptions about communication flows, authority boundaries, and vendor interactions that were not made explicit in planning.
    • The Value of Human Expertise:
      • Everyone involved in this incident brought an unparalleled level of expertise to the work.
      • Often this kind of expertise goes unnoticed or is taken for granted, however this kind of knowledge is precisely what makes for smoother, better coordinated (and sometimes), faster incident response.
    • Importance of Framing:
      • The way questions are asked in retrospectives can shape what is revealed—e.g., “What made that hard?” is more productive than “What did you miss?”
      • Reframing incidents around constraints and tradeoffs leads to deeper insight.
    • Team Learning and Culture:
      • Safe, high-trust environments enable better learning during drills.
      • Psychological safety allows team members to admit confusion or raise alternate interpretations during real incidents.

    Resources and References

    • Episode I
    • Model of Overload/Saturation as part of the Theory of Graceful Extensibility
    • Lorin's Law
    続きを読む 一部表示
    56 分
  • Uptime Labs and the Multi-Party Dilemma (Part I)
    2025/07/29

    Watch on YouTube

    In this episode I'm joined by a group of seasoned incident response professionals to discuss a simulated incident drill conducted on the Uptime Labs platform. The conversation centers around the Multi-party Dilemma—the challenge of coordinating incident response across teams or organizations with different missions, contexts, or incentives.

    Eric Dobbs, our incident analyst, joins to break down the drill and provide deep insights into the incident dynamics, team interactions, and what true incident analysis looks like when it's done well. Participants Alex Elman and Sarah Butt, who served as deputy and lead incident commanders respectively during the drill, recount their roles and experiences, highlighting realistic stress responses, decision-making, and coordination failures and successes. Hamed Silatani, CEO of Uptime Labs, provides context and insights into the behind-the-scenes work he and his team provide as the other "characters" driving the narrative of the drill.

    The episode uniquely showcases the value of structured incident analysis and the benefits of using drills to expose hidden assumptions and improve resilience in complex systems.

    A few key highlights include:

    • How detailed incident analysis leads to an understanding of the context and rationale behind responders' actions, rather than identifying errors or assigning blame.
    • The real goal is to learn how the system and people actually function, not just fix a broken component.
    • Themes revealed by the analysis and subsequent discussion
      • Saturation and the value of trust in delegation (especially between Sarah and Alex).
      • The role of deep expertise and how it often makes work appear effortless.
      • Importance of recognizing the real work done during incidents—often messy and improvisational.

    References/Resources

    • What Experts See That the Rest of Us Miss During Incidents
    • Incident Fest (Uptime Labs event)
    • Law of Fluency
    • Handling the Multi-Party Dilemma (Sarah & Alex paper)
    • Embracing the Multi-Party Dilemma (Sarah & Alex conference talk)




    続きを読む 一部表示
    48 分
  • Canva and the Thundering Herd
    2025/05/14

    Greetings fellow incident nerds, and welcome to Season 2 of The VOID podcast. The main new thing for this new season is we’re now available in video—so if you’re listening to this and prefer watching me make odd faces and nod a lot, you can find us here on YouTube.

    The other new thing is we now have sponsors! These folks help make this podcast possible, but they don’t have any say over who joins us or what we talk about, so fear not.

    This episode’s sponsor is Uptime Labs. Uptime Labs is a pioneering platform specializing in immersive incident response training. Their solution helps technical teams build confidence and expertise through realistic simulations that mirror real-world outages and security incidents. When most of investment these days in the incident space goes to technology and process, Uptime Labs focuses on sharpening the human element of incident response.

    In this episode, we talk to Simon Newton, Head of Platforms at Canva, about their first public incident report. It’s not their first incident by any means, but it’s the first time they chose as a company to invest in sharing the details of an incident with the rest of us, which of course we’re big fans of here at the VOID.

    We discuss:

    • What led to Canva finally deciding to publish a public incident report
    • What the size and nature of their incident response looks like (this incident involved around 20 different people!)
    • Their progression from a handful of engineers handling incidents to having a dedicated Incident Command (IC) role
    • Avoiding blame when a known performance fix was ready to be deployed but hadn't yet, which contributed to the incident getting worse as it progressed
    • The various ways the people involved in the incident collaborated and improvised to resolve it


    続きを読む 一部表示
    37 分
まだレビューはありません