『Reliability Enablers』のカバーアート

Reliability Enablers

Reliability Enablers

著者: Ash Patel & Sebastian Vietz
無料で聴く

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.comAsh P
経済学
エピソード
  • What the Agentic AI is happening to SRE?
    2026/06/12
    What if agentic AI makes SRE more important, not less? Bennett Gould explains why autonomous AI systems may create more demand for reliability thinking — not less.Everyone seems to think AI is coming for SRE in a hard way.You might have heard the same story:“AI will write the code.”“Agents will handle incidents.”“Copilots will generate the runbooks.”“Automation will reduce operational load.”Yes, the job question is real. If AI can write code, summarize incidents, query observability tools, generate runbooks, and operate across systems, then engineers are right to ask what happens to the work.But here’s the part that gets missed: AI does not just automate reliability work. It creates more objects and surface areas that need to be made reliable.Agentic AI is moving from demos into real workflows. These systems are no longer just answering questions. They are querying tools, pulling context, generating changes, and in some cases taking action around production environments.That makes this a Monday morning problem.Teams are already using LLMs for incidents, documentation, observability, infrastructure, and operational decision-making. Somewhere, a team is one demo away from giving an agent access to tools originally designed for humans.That is exactly why I wanted to have this conversation.Bennett Gould is currently a solution engineer at Neubird.ai. His career in SRE and SRE-adjacent work spans large enterprises, cloud, industrial technology, and startups, including AWS, IBM, Siemens, and a YC startup.I wanted to ask him a simple question: What in the agentic AI is happening to SRE?Here are 3 highlights from our talk:1. Agentic AI increases the reliability surface areaThe obvious fear is that AI reduces the need for reliability engineers. Bennett’s view was more nuanced. He was clear that engineers still need to adapt. If people do not reskill, stay current, and learn how these systems are forming, there may absolutely be pressure in the job market. But he also argued that AI could create more demand for reliability skills because production complexity is increasing.More code is going into production.More AI-generated code is going into production.More systems that people do not fully understand are going into production.And now autonomous agents are starting to enter production workflows too.That means more surface area. More automation. More operational uncertainty. More ways for things to go wrong.Bennett compared this to Terraform: Infrastructure as code created enormous efficiency gains. But it also created new ways to make very big mistakes very quickly.Before Terraform, most people could not delete all their production resources with a single command. After Terraform, that became technically possible if the system was designed badly enough.Agentic AI follows a similar pattern. With great automation comes great responsibility.Agents can help engineers move faster, query tools, summarize context, and reduce toil. But they can also amplify weak engineering practices, poor boundaries, bad assumptions, and unclear operational ownership. That is not the end of reliability work. That is reliability work entering a new phase.2. Agents can reduce toil, but context is the ceilingOne of the strongest parts of the conversation was Bennett’s explanation of where agents can help in incident response. A lot of SRE work involves moving across tools.You may need to query Prometheus, Dynatrace, logs, traces, cloud consoles, ticketing systems, documentation, runbooks, dashboards, and architecture diagrams.The problem is not always that the engineer lacks judgment.Sometimes the problem is that the information is scattered across too many tools, each with its own query language and interface. Bennett gave a simple example: an engineer might be very good at PromQL and very fast when Prometheus is the source of truth. But if the same engineer has to work in a different observability platform with a different query language, their response time can suffer. That is an obvious place where agents can help.The engineer may not need to know every query language perfectly. They need to know what they are looking for and how to reason about the system. The agent can help translate that intent into the right tool calls, queries, and summaries.That could reduce MTTR. It could reduce toil. It could help engineers move faster during incidents.But Bennett also made the limitation clear: You are only as good as the context you have. This is where he introduced two useful concepts:* Context mining* Context distillationContext mining means proactively finding the information that might be useful in a given operational situation.Context distillation means taking large amounts of information — runbooks, Confluence pages, diagrams, documentation, prior incidents — and reducing it into the minimum useful context an LLM or agent can use.That sounds powerful. But there is a catch. Sometimes the context simply is ...
    続きを読む 一部表示
    24 分
  • You (and AI) can't automate reliability away
    2025/12/02
    What if the hardest part of reliability has nothing to do with tooling or automation? Jennifer Petoff explains why real reliability comes from the human workflows wrapped around the engineering work.Everyone seems to think AI will automate reliability away. I keep hearing the same story: “Our tooling will catch it.” “Copilots will reduce operational load.” “Automation will mitigate incidents before they happen.”But here’s a hard truth to swallow: AI only automates the mechanical parts of reliability — the machine in the machine.The hard parts haven’t changed at all.You still need teams with clarity on system boundaries.You still need consistent approaches to resolution.You still need postmortems that drive learning rather than blame.AI doesn’t fix any of that. If anything, it exposes every organizational gap we’ve been ignoring. And that’s exactly why I wanted today’s guest on.Jennifer Petoff is Director of Program  Management for Google Cloud Platform and Technical Infrastructure education. Every day, she works with SREs at Google, as well as with SREs at other companies through her public speaking and Google Cloud Customer engagements.Even if you have never touched GCP, you have still been influenced by her work at some point in your SRE career. She is co-editor of Google’s original Site Reliability Engineering book from 2016. Yeah, that one!It was my immense pleasure to have her join me to discuss the internal dynamics behind successful reliability initiatives. Here are 5 highlights from our talk:3 issues stifling individual SREs’ workTo start, I wanted to know from Jennifer the kinds of challenges she has seen individual SREs face when attempting to introduce or reinforce reliability improvements within their teams or the broader organization.She categorized these challenges into 3 main categories* Cultural issues (with a look into Westrum’s typology of organizational culture)* Insufficient buy-in from stakeholders* Inability to communicate the value of reliability workOrganizations with generative cultures have 30% better organizational performance.A key highlight from this topic came from her look at DORA research, an annual survey of thousands of tech professionals and the research upon which the book Accelerate is based.It showed that organizations with generative cultures have 30% better organizational performance. In other words, you can have the best technology, tools, and processes to get good results, but culture further raises the bar. A generative culture also makes it easier to implement the more technical aspects of DevOps or SRE that are associated with improved organizational performance.Hands-on is the best kind of trainingWe then explored structured approaches that ensure consistency, build capability, and deliberately shape reliability culture. As they say – Culture eats strategy for breakfast!One key example Jennifer gave was the hands-on approach they take at Google. She believes that adults learn by doing. In other words, SREs gain confidence by doing hands-on work. Where possible, training programs should move away from passive listening to lectures toward hands-on exercises that mimic real SRE work, especially troubleshooting.One specific exercise that Google has built internally is Simulating Production Breakages. Engineers undergoing that training have a chance to troubleshoot a real system built for this purpose in a safe environment. The results have been profound, with a tremendous amount of confidence that Jennifer’s team saw in survey results. This confidence is focused on job-related behaviors, which when repeated over time reinforce that culture of reliability.Reliability is mandatory for everybodyAnother thing Jennifer told me Google did differently was making reliability a mandatory part of every engineer’s curriculum, not only SREs.When we first spun up the SRE Education team, our focus was squarely on our SREs. However, that’s like preaching to the choir. SREs are usually bought into reliability. A few years in, our leadership was interested in propagating the reliability-focused culture of SRE to all of Google’s development teams, a challenge an order of magnitude greater than training SREs. How did they achieve this mandate?* They developed a short and engaging (and mandatory) production safety training* That training has now been taken by tens of thousands of Googlers* Jennifer attributes this initiative’s success to how they“SRE’ed the program”. “We ran a canary followed by a progressive roll-out. We instituted monitoring and set up feedback loops so that we could learn and drive continuous improvement.”The result of this massive effort? A very respectable 80%+ net promoter score with open text feedback: “best required training ever.”What made this program successful is that Jennifer and her team SRE’d its design and iterative improvement. You can learn more about “How to SRE anything” (from ...
    続きを読む 一部表示
    28 分
  • #67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran
    2025/07/15

    A new or growing SRE team. A copy of the book. A company that says it cares about reliability. What happens next? Usually… not much.

    In this episode, I sit down with Dave O’Connor, a 16-year Google SRE veteran, to talk about what happens when organizations cargo-cult reliability practices without understanding the context they were born in.

    You might know him for his self-deprecating wit and legendary USENIX blurb about being “complicit in the development of the SRE function.”

    This one’s a treat — less “here’s a shiny new tool” and more “here’s what reliability actually looks like when you’ve seen it all.”

    No vendor plugs from Dave at all, just a good old-fashioned chat about what works and what doesn’t.

    Here’s what we dive into:

    * The adoption trap: Why SRE efforts often fail before they begin—especially when new hires care more about reliability than the org ever intended.

    * The SRE book dilemma: Dave’s take on why following the SRE book chapter-by-chapter is a trap for most companies (and what to do instead).

    * The cost of “caring too much”: How engineers burn out trying to force reliability into places it was never funded to live.

    * You build it, you run it (but should you?): Not everyone’s cut out for incident command—and why pretending otherwise sets teams up to fail.

    * Buying vs. building: The real reason even conservative enterprises are turning into software shops — and the reliability nightmare that follows.

    We also discuss the evolving role of reliability in organizations today, from being mistaken for “just ops” to becoming a strategic investment (when done right).

    Dave's seen the waves come and go in SRE — and he's still optimistic. That alone is worth a listen.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    31 分
adbl_web_anon_alc_button_suppression_t1
まだレビューはありません