『Reliability Enablers』のカバーアート

Reliability Enablers

Reliability Enablers

著者: Ash Patel & Sebastian Vietz
無料で聴く

このコンテンツについて

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.comAsh P
経済学
エピソード
  • #65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability
    2025/06/17

    Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?”

    But in the energy sector? There is no acceptable downtime. Not even a little.

    In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it.

    What makes this episode different is that Wade isn’t a reliability engineer by title, but it’s baked into everything his team touches. And that matters more than ever as software creeps deeper into operational technology (OT), and the cloud tries to stake its claim in critical systems.

    We cover:

    * Why 100% uptime is the minimum bar, not a stretch goal

    * How the rise of renewables has increased system complexity — and what that means for monitoring

    * Why bespoke integration and SCADA spaghetti are still normal (and here to stay)

    * The reality of cloud risk in critical infrastructure (“the cloud is just someone else’s computer”)

    * What software engineers need to understand if they want their products used in serious environments

    This isn’t about observability dashboards or DevOps rituals. This is reliability when the lights go out and people risk getting hurt if you get it wrong.

    And it’s a reminder: not every system lives in a feature-driven world. Some systems just have to work. Always. No matter what.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    28 分
  • #64 - Using AI to Reduce Observability Costs
    2025/01/28

    Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions.

    It's been a hot minute since the last episode of the Reliability Enablers podcast.

    Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather.

    Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs from spiraling out of control.

    (To the skeptics, he did not pay me for this episode)

    Here’s an AI-generated summary of what you can expect in our conversation:

    In this conversation, we explore cutting-edge approaches to FinOps i.e. cost optimization for observability.

    You'll hear about three pressing topics:

    * Managing Tool Sprawl: Insights into the common challenge of juggling 5-15 tools and how to identify which ones deliver real value.

    * Reducing Observability Costs: Techniques to track and trim waste, including how to uncover cost hotspots like overused or redundant metrics.

    * AI for Observability Decisions: Practical ways AI can simplify complex data, empowering non-technical stakeholders to make informed decisions.

    We also touch on the balance between open-source solutions like OpenTelemetry and commercial observability tools.

    Learn how these strategies, informed by Ruchir's experience at Netflix, can help streamline observability operations and cut costs without sacrificing reliability.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    21 分
  • #63 - Does "Big Observability" Neglect Mobile?
    2024/11/12

    Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short.

    * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.

    * Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.

    * Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.

    * Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.

    * Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.

    * Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.

    * Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.

    * Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.

    * OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability.

    * SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.

    * Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.

    * Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    29 分

Reliability Enablersに寄せられたリスナーの声

カスタマーレビュー:以下のタブを選択することで、他のサイトのレビューをご覧になれます。