『The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering』のカバーアート

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

著者: Fexingo
無料で聴く

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know the basics and want to debate the hard edges: Is 99.999% uptime always worth the cost? When should you deliberately degrade service to improve reliability? How do you design for resilience when your system is already in production? Lucas and Luna don't pretend to have final answers — they build the conversation so you can draw your own. If you've ever argued about whether a page was necessary or whether an SLO should be tightened, this is your show. #SiteReliabilityEngineering #SRE #Uptime #ProductionEngineering #IncidentResponse #ErrorBudgets #SLOs #Postmortem #ToilAutomation #CapacityPlanning #Observability #DevOps #PlatformEngineering #Resilience #OnCall #FexingoBusiness #BusinessPodcast #Technology Keep every episode free: buymeacoffee.com/fexingo© 2026 Fexingo. All rights reserved. 経済学
エピソード
  • How SRE Teams Use Blameless Postmortems to Build Better Systems
    2026/06/06
    In this episode of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems go beyond simple incident analysis to drive real systemic improvements. Using the example of a major payment processor incident in early 2026, they break down the anatomy of an effective blameless postmortem: separating human error from system design flaws, writing actionable recommendations, and tracking follow-ups. They discuss common pitfalls like blame drift and incomplete data, and share how one SRE team at a mid-size SaaS company reduced repeat incidents by 40 percent after adopting a structured blameless process. If you're looking to turn outages into learning opportunities, this episode offers a practical playbook. #BlamelessPostmortems #SRE #SiteReliabilityEngineering #IncidentManagement #ProductionEngineering #Uptime #RootCauseAnalysis #DevOps #Reliability #LearningFromFailure #BlamelessCulture #IncidentResponse #SaaSSRE #TechOps #Technology #FexingoBusiness #BusinessPodcast #TheSiteReliabilityPodcast Keep every episode free: buymeacoffee.com/fexingo
    続きを読む 一部表示
    9 分
  • How SRE Teams Use Postmortems That Actually Change Behavior
    2026/06/06
    In this episode of The Site Reliability Podcast, Lucas and Luna dig into the one incident-documentation practice most teams get wrong: the postmortem. Most postmortems are filed and forgotten. Lucas walks through how Google's SRE team shifted from blame-free to action-oriented postmortems, using a concrete example from their own 2017 Gmail outage. He breaks down the difference between a cause and a contributing factor, and explains why the 'action items' list is usually the weakest part. Luna pushes back on the idea that postmortems should always be public, and they discuss how psychological safety changes whether people actually report the truth. The episode closes with a practical takeaway: if your postmortem doesn't change how you deploy, monitor, or alert, it's a report, not a postmortem. #SRE #SiteReliabilityEngineering #Postmortems #IncidentResponse #BlamelessCulture #GoogleSRE #GmailOutage #ActionItems #PsychologicalSafety #IncidentAnalysis #ReliabilityEngineering #DevOps #FexingoBusiness #BusinessPodcast #Technology #LearningFromFailure #ContinuousImprovement #RootCauseAnalysis Keep every episode free: buymeacoffee.com/fexingo
    続きを読む 一部表示
    8 分
  • How SRE Teams Use Runbook Automation to Reduce Human Error
    2026/06/05
    In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical side of runbook automation — moving beyond static documentation to executable, automated responses. They explore how companies like Google and Netflix use runbook automation to reduce mean time to repair by up to 60%, and discuss the common pitfalls: over-automation, stale runbooks, and the tension between speed and safety. Lucas shares a concrete example from a major e-commerce platform where automated runbooks cut incident response time from 45 minutes to under 5. Luna challenges whether automation can replace human judgment in complex outages. The conversation also touches on tools like Rundeck, PagerDuty Automation, and custom Slack bots. By the end, listeners will understand the key principles for building runbooks that actually get followed in the heat of an incident. #SiteReliabilityEngineering #RunbookAutomation #SRE #IncidentResponse #DevOps #Automation #GoogleSRE #Netflix #PagerDuty #Rundeck #MeanTimeToRepair #Technology #ProductionEngineering #Uptime #FexingoBusiness #BusinessPodcast #TechOps #OnCall Keep every episode free: buymeacoffee.com/fexingo
    続きを読む 一部表示
    8 分
adbl_web_anon_alc_button_suppression_t1
まだレビューはありません