『You (and AI) can't automate reliability away』のカバーアート

You (and AI) can't automate reliability away

You (and AI) can't automate reliability away

無料で聴く

ポッドキャストの詳細を見る

このコンテンツについて

What if the hardest part of reliability has nothing to do with tooling or automation? Jennifer Petoff explains why real reliability comes from the human workflows wrapped around the engineering work.Everyone seems to think AI will automate reliability away. I keep hearing the same story: “Our tooling will catch it.” “Copilots will reduce operational load.” “Automation will mitigate incidents before they happen.”But here’s a hard truth to swallow: AI only automates the mechanical parts of reliability — the machine in the machine.The hard parts haven’t changed at all.You still need teams with clarity on system boundaries.You still need consistent approaches to resolution.You still need postmortems that drive learning rather than blame.AI doesn’t fix any of that. If anything, it exposes every organizational gap we’ve been ignoring. And that’s exactly why I wanted today’s guest on.Jennifer Petoff is Director of Program  Management for Google Cloud Platform and Technical Infrastructure education. Every day, she works with SREs at Google, as well as with SREs at other companies through her public speaking and Google Cloud Customer engagements.Even if you have never touched GCP, you have still been influenced by her work at some point in your SRE career. She is co-editor of Google’s original Site Reliability Engineering book from 2016. Yeah, that one!It was my immense pleasure to have her join me to discuss the internal dynamics behind successful reliability initiatives. Here are 5 highlights from our talk:3 issues stifling individual SREs’ workTo start, I wanted to know from Jennifer the kinds of challenges she has seen individual SREs face when attempting to introduce or reinforce reliability improvements within their teams or the broader organization.She categorized these challenges into 3 main categories* Cultural issues (with a look into Westrum’s typology of organizational culture)* Insufficient buy-in from stakeholders* Inability to communicate the value of reliability workOrganizations with generative cultures have 30% better organizational performance.A key highlight from this topic came from her look at DORA research, an annual survey of thousands of tech professionals and the research upon which the book Accelerate is based.It showed that organizations with generative cultures have 30% better organizational performance. In other words, you can have the best technology, tools, and processes to get good results, but culture further raises the bar. A generative culture also makes it easier to implement the more technical aspects of DevOps or SRE that are associated with improved organizational performance.Hands-on is the best kind of trainingWe then explored structured approaches that ensure consistency, build capability, and deliberately shape reliability culture. As they say – Culture eats strategy for breakfast!One key example Jennifer gave was the hands-on approach they take at Google. She believes that adults learn by doing. In other words, SREs gain confidence by doing hands-on work. Where possible, training programs should move away from passive listening to lectures toward hands-on exercises that mimic real SRE work, especially troubleshooting.One specific exercise that Google has built internally is Simulating Production Breakages. Engineers undergoing that training have a chance to troubleshoot a real system built for this purpose in a safe environment. The results have been profound, with a tremendous amount of confidence that Jennifer’s team saw in survey results. This confidence is focused on job-related behaviors, which when repeated over time reinforce that culture of reliability.Reliability is mandatory for everybodyAnother thing Jennifer told me Google did differently was making reliability a mandatory part of every engineer’s curriculum, not only SREs.When we first spun up the SRE Education team, our focus was squarely on our SREs. However, that’s like preaching to the choir. SREs are usually bought into reliability. A few years in, our leadership was interested in propagating the reliability-focused culture of SRE to all of Google’s development teams, a challenge an order of magnitude greater than training SREs. How did they achieve this mandate?* They developed a short and engaging (and mandatory) production safety training* That training has now been taken by tens of thousands of Googlers* Jennifer attributes this initiative’s success to how they“SRE’ed the program”. “We ran a canary followed by a progressive roll-out. We instituted monitoring and set up feedback loops so that we could learn and drive continuous improvement.”The result of this massive effort? A very respectable 80%+ net promoter score with open text feedback: “best required training ever.”What made this program successful is that Jennifer and her team SRE’d its design and iterative improvement. You can learn more about “How to SRE anything” (from ...
まだレビューはありません