Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News cover art

Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

By: Teller's Tech - DevOps SRE and Cloud Podcast
Listen for free

Ship It Weekly is a short, practical recap of what actually matters in DevOps, SRE, cloud infrastructure, and platform engineering.

Each episode, your host Brian Teller walks through the latest outages, releases, tools, and incident writeups, then translates them into “here’s what this means for your systems” instead of just reading headlines. Expect a couple of main stories with context, a quick hit of tools or releases worth bookmarking, and the occasional segment on on-call, burnout, or team culture.

This isn’t a certification prep show or a lab walkthrough. It’s aimed at people who are already working in the space and want to stay sharp without scrolling status pages, cloud updates, and blogs all week. You’ll hear about things like cloud provider incidents, Kubernetes and platform trends, Terraform and infrastructure changes, and real postmortems that are actually worth your time.

Most episodes are 15–30 minutes, so you can catch up on the way to work or between meetings. Every now and then there will be a “special” focused on a big outage or a specific theme, but the default format is simple: what happened, why it matters, and what you might want to do about it in your own environment.

If you’re the person people DM when something is broken in prod, or you’re building the cloud and platform everyone else ships on top of, Ship It Weekly is meant to be in your rotation.

Brian Teller - Teller's Tech - DevOps, SRE and Cloud
Politics & Government
Episodes
  • Ship It Conversations: Meta’s Francois Richard on AI Incident Response, SLOs, and Reliability at Scale
    Jun 16 2026

    This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps.

    In this Ship It: Conversations episode, I talk with Francois Richard, Engineering Director at Meta, about reliability at scale, how AI is changing production risk, what teams actually learn from incidents, and why recovery practice matters just as much as prevention.

    We talk about the proactive and reactive sides of reliability, why SLOs should represent a promise to users instead of just another dashboard number, how incident reviews should drive real system improvements, and how teams can practice recovery before production forces the lesson on them.

    The bigger theme here is that reliability is not just about avoiding failure. It is about knowing what happens when prevention fails. That means practicing regional failure, understanding overload behavior, improving incident response, using AI carefully during investigation, and making reliability targets match the actual lifecycle and importance of the system.

    Highlights

    • Why reliability work starts with both prevention and recovery

    • The difference between reactive incident response and proactive reliability engineering

    • How Meta thinks about disaster recovery testing and regional failure practice

    • Why an SLO should be treated like a promise to users, not just a dashboard metric

    • How SLO trends help teams decide when to invest more in reliability or take more product risk

    • What engineers actually learn during the “pressure cooker” of an incident

    • Why incident reviews should produce follow-up work, not just a nicer explanation of what broke

    • The difference between finding the cause of an incident and improving the system

    • Where AI agents can help with incident investigation, telemetry, metrics, and query building

    • Why AI-generated code can increase change volume while reducing human context

    • How faster code generation changes the kinds of reliability problems teams should expect

    • Why recovery practice matters, especially for region loss, traffic spikes, overload, and restart behavior

    • What smaller DevOps and SRE teams can learn from Meta-scale reliability patterns

    • Why not every system needs six nines, especially early in a product lifecycle

    • How to think about reliability investment based on user promise, product maturity, and operational risk

    • Why At Scale Systems & Reliability is focused on the infrastructure behind AI and the use of AI to operate large-scale systems

    Francois’ links

    • LinkedIn: https://www.linkedin.com/in/francoisrichard/

    At Scale links

    • Systems & Reliability 2026: https://bit.ly/4xd2FdG

    • At Scale Conferences: https://atscaleconference.com/

    Our links

    More episodes + show notes + links: https://shipitweekly.fm

    On Call Brief: https://oncallbrief.com

    Show More Show Less
    43 mins
  • Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production
    Jun 12 2026

    This episode of Ship It Weekly is about the hidden glue holding production together.

    Brian covers Coinbase’s May 7 outage postmortem, where an AWS us-east-1 cooling failure exposed the difference between being “multi-AZ” on paper and actually being able to recover when stateful, low-latency systems are tied to a failed zone.

    Then he looks at Meta’s AI-assisted Instagram support issue and why account recovery is identity infrastructure, not just customer support. If AI can influence password resets, email changes, MFA resets, or account ownership flows, that workflow needs to be treated like a production control plane.

    The episode also covers AWS AgentCore CLI CVE-2026-11393, where collaborator metadata could break out into generated Python code during agent import, and an Apigee cross-tenant issue from Google’s Apigee security bulletins that shows why tenant isolation has to be tested beyond the obvious happy path.

    Links

    Coinbase May 7 outage postmortem https://www.coinbase.com/blog/a-postmortem-of-our-may-7-2026-outage

    Meta AI support / Instagram account recovery reporting https://www.theverge.com/tech/945658/meta-ai-support-chatbot-exploit-instagram-accounts

    AWS AgentCore CLI CVE-2026-11393 https://aws.amazon.com/security/security-bulletins/2026-040-aws/

    AgentCore CLI GitHub advisory https://github.com/aws/agentcore-cli/security/advisories/GHSA-m4x6-gwgp-4pm7

    Google Apigee security bulletins https://docs.cloud.google.com/apigee/docs/security-bulletins/security-bulletins

    Cloudflare real-time threat intel WAF rules https://blog.cloudflare.com/realtime-threat-intel-waf-rules/

    AWS Lambda tenant isolation with event source mappings https://aws.amazon.com/blogs/compute/integrating-event-source-mappings-with-aws-lambda-tenant-isolation-mode/

    Amazon OpenSearch Serverless next generation https://aws.amazon.com/about-aws/whats-new/2026/05/amazon-opensearch-serverless-next-generation-generally-available/

    GitHub Enterprise Managed Users IP allow list coverage https://github.blog/changelog/2026-06-08-ip-allow-list-coverage-for-emu-namespaces-in-general-availability/

    This week’s On Call Brief https://www.tellerstech.com/on-call-brief-news/2026-W24/

    More episodes and show notes https://shipitweekly.fm/

    Show More Show Less
    23 mins
  • Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries
    Jun 5 2026

    This episode of Ship It Weekly is about automation’s hidden boundaries. Brian covers Kiro CLI CVE-2026-9255, where piped stdin could act like user approval, Amazon Braket SDK CVE-2026-9291 and the very normal Python pickle risk hiding inside quantum job results, AWS Organizations finally emitting CloudTrail events when accounts join or leave an org, and KEDA updates that remind us autoscaling upgrades are production behavior changes.

    The bigger thread this week is that automation does not remove boundaries. It moves them. Approval paths, trusted data, account membership, scaling signals, platform access, and AI-generated output all need clear ownership and visibility.

    Brian also covers Kubernetes Dashboard being archived with Headlamp as the path forward, Google Cloud Remote MCP Server for AlloyDB, Apache Kafka 4.3.0, and Atlassian’s AI-native SDLC productivity claims.

    Sponsored by @Scale: Systems & Reliability, happening June 25 at the Meydenbauer Center in Bellevue, Washington. Register at https://bit.ly/4xd2FdG

    Links

    Kiro CLI CVE-2026-9255 https://aws.amazon.com/security/security-bulletins/2026-035-aws/

    Amazon Braket SDK CVE-2026-9291 https://aws.amazon.com/security/security-bulletins/2026-036-aws/

    AWS Organizations CloudTrail account events https://aws.amazon.com/about-aws/whats-new/2026/05/aws-organizations-cloudtrail/

    KEDA v2.20.0 release https://github.com/kedacore/keda/releases/tag/v2.20.0

    KEDA v2.19.0 release https://github.com/kedacore/keda/releases/tag/v2.19.0

    Kubernetes Dashboard archived / Headlamp path forward https://kubernetes.io/blog/2026/06/04/dashboard-archived-what-now/

    Google Cloud Remote MCP Server for AlloyDB https://cloud.google.com/blog/products/databases/alloydb-remote-mcp-server-now-ga

    Apache Kafka 4.3.0 https://www.confluent.io/blog/apache-kafka-4-3-release-announcement/

    Atlassian AI-native SDLC productivity claims https://www.atlassian.com/blog/software-teams/ai-native-sdlc

    This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W23/

    More episodes and show notes https://shipitweekly.fm/

    Show More Show Less
    20 mins
adbl_web_anon_alc_button_suppression_t1
No reviews yet