『What the Agentic AI is happening to SRE?』のカバーアート

What the Agentic AI is happening to SRE?

What the Agentic AI is happening to SRE?

無料で聴く

ポッドキャストの詳細を見る
What if agentic AI makes SRE more important, not less? Bennett Gould explains why autonomous AI systems may create more demand for reliability thinking — not less.Everyone seems to think AI is coming for SRE in a hard way.You might have heard the same story:“AI will write the code.”“Agents will handle incidents.”“Copilots will generate the runbooks.”“Automation will reduce operational load.”Yes, the job question is real. If AI can write code, summarize incidents, query observability tools, generate runbooks, and operate across systems, then engineers are right to ask what happens to the work.But here’s the part that gets missed: AI does not just automate reliability work. It creates more objects and surface areas that need to be made reliable.Agentic AI is moving from demos into real workflows. These systems are no longer just answering questions. They are querying tools, pulling context, generating changes, and in some cases taking action around production environments.That makes this a Monday morning problem.Teams are already using LLMs for incidents, documentation, observability, infrastructure, and operational decision-making. Somewhere, a team is one demo away from giving an agent access to tools originally designed for humans.That is exactly why I wanted to have this conversation.Bennett Gould is currently a solution engineer at Neubird.ai. His career in SRE and SRE-adjacent work spans large enterprises, cloud, industrial technology, and startups, including AWS, IBM, Siemens, and a YC startup.I wanted to ask him a simple question: What in the agentic AI is happening to SRE?Here are 3 highlights from our talk:1. Agentic AI increases the reliability surface areaThe obvious fear is that AI reduces the need for reliability engineers. Bennett’s view was more nuanced. He was clear that engineers still need to adapt. If people do not reskill, stay current, and learn how these systems are forming, there may absolutely be pressure in the job market. But he also argued that AI could create more demand for reliability skills because production complexity is increasing.More code is going into production.More AI-generated code is going into production.More systems that people do not fully understand are going into production.And now autonomous agents are starting to enter production workflows too.That means more surface area. More automation. More operational uncertainty. More ways for things to go wrong.Bennett compared this to Terraform: Infrastructure as code created enormous efficiency gains. But it also created new ways to make very big mistakes very quickly.Before Terraform, most people could not delete all their production resources with a single command. After Terraform, that became technically possible if the system was designed badly enough.Agentic AI follows a similar pattern. With great automation comes great responsibility.Agents can help engineers move faster, query tools, summarize context, and reduce toil. But they can also amplify weak engineering practices, poor boundaries, bad assumptions, and unclear operational ownership. That is not the end of reliability work. That is reliability work entering a new phase.2. Agents can reduce toil, but context is the ceilingOne of the strongest parts of the conversation was Bennett’s explanation of where agents can help in incident response. A lot of SRE work involves moving across tools.You may need to query Prometheus, Dynatrace, logs, traces, cloud consoles, ticketing systems, documentation, runbooks, dashboards, and architecture diagrams.The problem is not always that the engineer lacks judgment.Sometimes the problem is that the information is scattered across too many tools, each with its own query language and interface. Bennett gave a simple example: an engineer might be very good at PromQL and very fast when Prometheus is the source of truth. But if the same engineer has to work in a different observability platform with a different query language, their response time can suffer. That is an obvious place where agents can help.The engineer may not need to know every query language perfectly. They need to know what they are looking for and how to reason about the system. The agent can help translate that intent into the right tool calls, queries, and summaries.That could reduce MTTR. It could reduce toil. It could help engineers move faster during incidents.But Bennett also made the limitation clear: You are only as good as the context you have. This is where he introduced two useful concepts:* Context mining* Context distillationContext mining means proactively finding the information that might be useful in a given operational situation.Context distillation means taking large amounts of information — runbooks, Confluence pages, diagrams, documentation, prior incidents — and reducing it into the minimum useful context an LLM or agent can use.That sounds powerful. But there is a catch. Sometimes the context simply is ...
adbl_web_anon_alc_button_suppression_t1
まだレビューはありません