エピソード

  • Mastering SRE: Insights in Scale and at Capacity with Aimee Knight
    2025/06/21
    In this episode, Aimee Knight, an expert in Site Reliability Engineering (SRE) whose experience hails from Paramount and NPM, joins the podcast to discuss her journey into SRE, the challenges she faced, and the strategies she employed to succeed. Aimee shares her transition from a non-traditional background in JavaScript development to SRE, highlighting the importance of understanding both the programming and infrastructure sides of engineering. She also delves into the complexities of SRE at different scales, the role of playbooks in incident management, and the balance between speed and quality in software development.

    Aimee discusses the impact of AI and machine learning on SRE, emphasizing the need for responsible use of these tools. She touches on the importance of understanding business needs and how it affects decision-making in SRE roles. The conversation also covers the trade-offs in system design, the challenges of scaling applications, and the importance of resilience in distributed systems. Aimee provides valuable insights into the pros and cons of a career in SRE, including the importance of self-care and the satisfaction of mentoring others.

    The episode concludes with us discussing some of the hard problems such as the on-call burden for large teams, and the technical expertise an org needs to maintain higher complexity systems. Is the average tenure in tech decreasing, we discuss it and do a deep dive on the consequences in the SRE world.

    Picks​
    • The Adventures In DevOps: Survey
    • Warren's Technical Blog
    • Warren: The Fifth Discipline by Peter Senge
    • Aimee: Sleep Token (Band) - Caramel, Granite
    • Will: The Bear Grylls Celebrity Hunt on Netflix
    • Jillian: Horizon Zero Dawn Video Game
    続きを読む 一部表示
    1 時間 18 分
  • Exploring MCP Servers and Agent Interactions with Gil Feig
    2025/06/14
    In this episode, we delve into the concept of MCP (Machine Control Protocol) servers and their role in enabling agent interactions. Gil Feig, the co-founder and CTO of Merge, shares insights on how MCP servers facilitate efficient and secure integration between various services and APIs.

    The discussion covers the benefits and challenges of using MCP servers, including their stateful nature, security considerations, and the importance of understanding real-world use cases. Gil emphasizes the need for thorough testing and evaluation to ensure that MCP servers effectively meet user needs.

    Additionally, we explore the implications of MCP servers on data security, scaling, and the evolving landscape of API interactions. Warren chimes in with experiences integrating AI with Auth. Will stuns us with some nuclear fission history. And finally, we also touch on the balance between short-term innovation and long-term stability in technology, reflecting on how different generations approach problem-solving and knowledge sharing.

    Picks​:
    • The Adventures In DevOps: Survey
    • Warren: The Magicians by Lev Grossman
    • Gil: Constant Escapement in Watchmaking
    • Will: Dungeon Crawler Carl & Atmos Clock
    続きを読む 一部表示
    1 時間 5 分
  • No Lag: Building the Future of High-Performance Cloud with Nathan Goulding
    2025/06/09
    Warren talks with Nathan Goulding, SVP of Engineering at Vultr, about what it actually takes to run a high-performance cloud platform. They cover everything from global game server latency and hybrid models to bare metal provisioning and the power/cooling constraints that come with modern GPU clusters.

    The discussion gets into real-world deployment challenges like scaling across 32 data centers, edge use cases that actually matter, and how to design systems for location-sensitive customers—whether that’s due to regulation or performance. Additionally, there's talk about where the hyperscalers have overcomplicated pricing and where simplicity in a flatter pricing model and optimized defaults are better for everyone.

    There’s a section on nuclear energy (yes, really), including SMRs, power procurement, and what it means to keep scaling compute with limited resources. If you're wondering whether your app actually needs high-performance compute or just better visibility into your costs, this is the episode.

    Picks​
    • The Adventures In DevOps: Survey
    • Warren: Jetlag: The Game
    • Nathan: Money Heist (La Casa de Papel)
    続きを読む 一部表示
    1 時間 1 分
  • Ground Truth & Guided Journeys: Rethinking Data for AI with Inna Tokarev Sela
    2025/06/04
    Inna Tokarev Sela, CEO and founder of Illumex, joins the crew to break down what it really means to make your data “AI-ready.” This isn’t just about clean tables—it’s about semantic fabric, business ontologies, and grounding agents in your company's context to prevent the dreaded LLM hallucination. We dive into how modern enterprises just cannot build a single source of truth, not matter how hard they try. All the while knowing that it's required to build effected agents utilizing the available knowledge graphs and.

    The conversation unpacks democratizing data access and avoiding analytics anarchy. Inna explains how automation and graph modeling are used to extract semantic meaning from disconnected data stores, and how to resolve conflicting definitions. And yes, Warren finally coughs up what's so wrong with most dashboards.

    Lastly, we quickly get to the core philosophical questions of agentic systems and AGI, including why intuition is the real differentiator between humans and machines. Plus: storage cost regrets, spiritual journeys disguised as inference pipelines, and a very healthy fear of subscription-based sleep wearables.

    Picks​
    • The Adventures In DevOps: Survey
    • Warren: The Non-Computability of Intuition
    • Will: The Arc Browser
    • Inna: Healthy GenAI skepticism
    続きを読む 一部表示
    53 分
  • Incident Vibing: The Self-Healing System - DevOps 242
    2025/05/29
    Sylvain Kalache, Head of Developer Relations at Rootly joins us to explore the new frontier of incident response powered by large language models. We dive into the evolution of DevRel and how we meet the new challenges impacting our systems.

    We explore Sylvain's origin story in self-healing systems, dating back to his SlideShare and LinkedIn days. From ingesting logs via Fluentd to building early ML-driven RCA tools, he shares a vision of self-healing infrastructure that targets root causes rather than just restarting boxes. Plus, we trace the historical arc of deterministic and non-deterministic tools.

    The conversation shifts toward real-world applications, where we're combining logs, metrics, transcripts, and postmortems to give SREs superpowers. We get tactical on integrating LLMs, why fine-tuning isn't always worth it, and how the Model Context Protocol (MCP) could be the USB of AI ops, but how it is still insecure. We wrap by facing the harsh reality of "incident vibing" in a world increasingly built by prompts, not people—and how to prepare for it.Picks​
    • Warren: There is no AI Revolution
    • Sylvain: Incident Vibing and Rootly Labs SRE event on April 24th
    続きを読む 一部表示
    1 時間 10 分
  • Decentralized Chaos: Web3 Infra, NodeOps, and the Art of Blockchain Load Balancing - DevOps 241
    2025/05/22
    This week, Paul Marston from Ankr joins the crew to unpack the madness that is modern blockchain infrastructure. From his wild career transition out of financial services into 24/7 node ops for Web3, Paul shares the brutal truth about uptime expectations, decentralization challenges, and why hard forks are more like enterprise schema upgrades with a community twist. If you’ve ever wondered why managing a blockchain node is like owning a temperamental pet server, this one’s for you.

    The team goes deep on the nitty-gritty of load balancing across dozens of chains, explaining why routing traffic to the “wrong” archive node could ruin your day—and how Ankr’s custom load balancer is basically magic for JSON-RPC calls. Warren tosses out wild scenarios about encrypted data smuggling via blockchain, while Will confesses his angry typing habit (yes, it’s back). The discussion gets even more fun with debates on innovation vs. rigor, Web2's forgotten best practices, and why testing in prod might not be such a dirty word after all.

    But don’t think it’s all crypto and code. Paul shares battle-won wisdom from running over 100 chains across bare metal, giving us a peek at the operational sophistication and automation involved. From Terraform templates to Docker configs, he walks through the process of onboarding new chains and tuning for performance. The episode also touches on emerging risks like data exfiltration via public blockchains, and why AI (used wisely) might just be the sidekick DevOps always needed.

    And of course memes, we talk a bit about this one: Tree Swing Product Development

    Picks​
    • Warren: Dvorak Keyboard Setup and Logitech K295
    • Will: Quirky Record Player from Miniot
    • Paul: Super Whisper - Voice Transcription Tool
    続きを読む 一部表示
    1 時間 16 分
  • Observability in the CI/CD Pipeline with Adriana Villela - DevOps 240
    2025/05/15
    In this episode, Will and Warren welcome Adriana Villela — CNCF ambassador, Dynatrace advocate, and host of the Geeking Out podcast — for a wide-ranging conversation on observability in CI/CD pipelines. Adriana shares her journey from “On Call Me Maybe” to her own podcast, her work with OpenTelemetry, and why observability isn’t just for SREs anymore.

    The crew digs into how telemetry should be integrated across the software development lifecycle — from development to QA to production — and what that really looks like in modern teams. Adriana drops knowledge on CI/CD failures, distributed traces, and even how to bring observability to other parts of the business like recruiting and onboarding. She also explains how she got involved in the OpenTelemetry end-user SIG and what’s next for the observability movement.

    Things get persona as we trade war stories about SVN, terrible version control systems, reusable grocery bags, and the ethics of AI log parsers. Adriana closes with a powerful take: observability is a team sport, and the better we play it, the more effective — and environmentally conscious — our systems can become.Picks​
    • Warren: Adventures In DevOps survey - How can we make it better for you?
    • Adriana: Bouldering — she recommends it both as a physical activity and a therapeutic mental reset, especially when traveling
    • Jillian: Expeditionary Force
    • Will: Iron Neck and Purpose & Prophet
    続きを読む 一部表示
    1 時間 21 分
  • Building Engineering Excellence with Ganesh Datta of Cortex - DevOps 239
    2025/05/08
    In this episode, I (flying solo today!) sat down with Ganesh Datta, the CTO and co-founder of Cortex, to explore what it really means to drive engineering excellence at scale. And spoiler: it’s not just about better dashboards or fancy developer tools—it’s about treating software development like the competitive advantage it is.

    We went deep into the why behind internal developer portals (IDPs) and how they’re transforming platform engineering, developer experience, and organizational maturity. Ganesh shares how Cortex came to life—from being paged at 2am for a mystery Game of Thrones-named microservice (yep, we've all been there), to realizing that every other business function had a system of record—except engineering.

    Key Takeaways:
    • IDPs are like CRMs for Engineering: Just as sales teams wouldn’t function without a CRM, modern engineering orgs shouldn’t be flying blind without a structured, centralized developer portal.
    • Engineering Excellence = Business Outcomes: Whether it’s reliability, security, or platform efficiency, IDPs help codify best practices and align teams toward measurable goals.
    • Start Small to Win Big: You don’t need to overhaul everything on day one. Start with a pain point you already know—like production readiness—and improve that incrementally.
    • SREs and Platform Engineers Love IDPs: Because it gives them the data, ownership visibility, and real-time checks they need, without the honor-system chaos.
    • Developer Experience is Just the Beginning: Tools like Cortex aren’t just about dev productivity—they’re about creating resilient, aligned, scalable engineering orgs.
    We also geeked out about everything from naming services (“Brewer” for a feature extraction tool? Chef’s kiss.) to the surprising power of reading 15 minutes before bed to improve sleep quality—yep, we went there!

    If you’re part of an engineering team (or leading one) and want to know how to move faster and smarter, this is the episode for you.
    続きを読む 一部表示
    51 分