Building Reliable Systems at Bloomberg with Sal Furino cover art

Building Reliable Systems at Bloomberg with Sal Furino

Building Reliable Systems at Bloomberg with Sal Furino

Listen for free

View show details
In this episode of Alexa’s Input (AI), I sit down with Sal Furino to explore the hidden engineering work that keeps modern systems reliable.We break down what Service Level Objectives, Indicators (SLOs/SLIs), and error budgets actually mean in practice, why reliability is as much a cultural problem as a technical one, and how teams can better measure real user experience instead of just infrastructure health.Sal also explains reliability engineering and the challenges of reliability at scale, like:Why latency and correctness become harder to measure with GenAIThe difference between a bad incident and a fundamentally bad systemHow observability and telemetry shape modern engineering organizationsWhy most teams focus too much on infrastructure metrics and not enough on user happiness Why “the best systems are the ones nobody notices.”If you work in AI infrastructure, distributed systems, platform engineering, observability, or SRE, this episode is a must listen!SRECon Talk Dashboards & Dragons: Reliability Magic for AI Platforms by Alexa Griffith and Sal Furino: https://youtu.be/aWMB_7ksbkc?si=S49nPyAl_hCUIH7yGeneral Podcast LinksWatch: ⁠⁠⁠⁠⁠https://www.youtube.com/@alexa_griffith⁠⁠⁠⁠⁠Read: ⁠⁠⁠⁠⁠⁠⁠https://alexasinput.substack.com/⁠⁠⁠⁠⁠⁠⁠Listen:⁠⁠ ⁠https://creators.spotify.com/pod/profile/alexagriffith/⁠⁠⁠More: ⁠⁠⁠⁠⁠https://linktr.ee/alexagriffith⁠⁠⁠⁠⁠Learn more about the host atWebsite: ⁠⁠⁠⁠⁠https://alexagriffith.com/⁠⁠⁠⁠⁠LinkedIn: ⁠⁠⁠⁠⁠https://www.linkedin.com/in/alexa-griffith/⁠⁠⁠⁠⁠Find out more about the guest at:LinkedIn: https://www.linkedin.com/in/salvatore-furino/Rootly Interview: https://rootly.com/humans-of-reliability/salvatore-furinoReliability at Scale Talk: https://youtu.be/J-VrU5JHPlk?si=8aV8acy57NWX30KABloomberg Careers: https://bloomberg.avature.net/careers/SearchJobsChapters00:00 - Introduction: Reliability in a world reshaped by generative AI02:22 - The importance of seamless, background system design04:41 - Becoming a Customer Reliability Engineer at Bloomberg05:17 - Clarifying the CRE role and its customer focus08:02 - The importance of observability and high-scale performance in finance09:00 - Balancing technical and cultural aspects of reliability10:19 - Coaching teams to be proactive using error budgets and SLIs12:21 - The social-technical system: People, processes, and tools13:06 - Mediation of differing opinions on reliability practices15:06 - The nuanced approach to alerting and incident response17:08 - The significance of tiered SLOs and the concept of error budgets21:08 - Using signals like latency, correctness, availability, saturation in system measurement22:53 - The impact of service level "nines" on system design and resilience28:00 - Handling non-determinism and trust in AI responses33:01 - Error budgets and their role in managing deployments34:10 - The challenge of achieving five nines and data durability considerations40:03 - Adapting SLOs for GenAI systems: core principles remain intact42:23 - Measuring non-deterministic AI responses and quality proxies44:41 - The ongoing importance of reliability even in AI/ML contexts47:25 - Reacting to error budget exhaustion and proactive mitigation50:42 - The significance of involving cross-functional teams during outages55:36 - Advocating reliability investment to leadership56:24 - The customer perspective: reliability as a fundamental feature58:42 - Connecting with Sal Furino: where to follow his work and learn more about Bloomberg's engineering culture59:20 - Final advice: Focus on user happiness to avoid common pitfalls in adopting SLOs
adbl_web_anon_alc_button_suppression_t1
No reviews yet