Ownership over Blame
Accountability creates progress. Blame creates fear.
I remember the morning by the silence.
Not the peaceful kind—the brittle kind. The kind where dashboards stop updating, batch logs stop ticking, and no one wants to be the first to ask what’s wrong. I refreshed our monitoring tool three times. Then a fourth. The overnight load hadn’t landed. The tables were stale. And I knew before opening Slack that something upstream had buckled.
What I didn’t expect was the firestorm already in progress.
The BI team was tagging everyone. “Why is the revenue report blank?” “Why isn’t this alert triggering?” “Did someone break the permissions again?” In a thread already a hundred messages deep, my team’s name was surfacing in all the wrong places. Someone had already run git blame
on our access control module. Someone else was speculating—loudly—about a rollback.
I’m not a fan of posturing, especially in moments like that. You can always tell who’s trying to get ahead of the narrative and who’s actually reading the logs.
“There was more attention on who broke it than on what broke.”
I told my team to stay calm, stay quiet, and stay focused. We weren’t being silent. We were being disciplined. I started pulling traces and logs from our access service—the one we built last quarter to enforce row-level security across domains. It was solid. Audited. Tested. But custom. And that meant it was a magnet for suspicion.
Ten minutes in, I had confirmation: our service was throwing 403s. Not for everyone—just for service accounts tied to external job runners. These were accounts provisioned by another team, using a trust relationship we documented, handed off, and hadn’t heard about since.
I messaged the lead from that team. No reply.
By then, Slack was a war zone. Managers from three different departments were comparing outage windows like courtroom evidence. Someone posted a screenshot of a failed query with our service in the stack trace. “Can someone on the data platform side confirm if they changed auth yesterday?” a product lead asked.
“I hadn’t changed the code in two weeks. But the environment changed under us.”
Eventually I found it. A change in IAM permissions, merged and deployed to production sometime around 2 a.m., had rotated the assumed roles for every automation account in the system. No announcement. No warning. Just a silent upstream shift that invalidated the entire token exchange pattern we had built.
The rage started to bubble up. I felt it. I wanted to screenshot the commit, paste it in the thread, and say: “This. This is what broke it.”
But I didn’t.
“I didn’t take the fall—I took the lead.”
I knew what would happen if I didn’t step in soon—my team would get pulled into defense mode, start replying emotionally, maybe even take on guilt that wasn’t theirs to carry. That’s not leadership. Leadership is absorbing the blast radius so your people can stay focused. It was never about protecting our reputation. It was about protecting their time, their trust, and their belief that doing things the right way still matters.
I documented it.
The whole thing. Timeline. Impact. Logs. Root cause. I wrote it like I was explaining it to an engineer who had never seen our system before. I included the exact lines where the upstream trust relationship failed, and I offered two recovery paths: rollback the permission change or update our code to support temporary cross-account chaining.
I posted it, not in a DM or a private channel, but in the same public thread where the blame had started. I signed it with my name. And I ended it with a single sentence:
“This failure originated from an upstream change, but our system wasn’t resilient to it. That’s on us. I’ll fix it.”
The channel went quiet for a moment.
Then came the acknowledgments. Then the apologies. Then the messages from other engineers who had been struggling to understand why their tools were breaking in unpredictable ways. Turns out, we weren’t the only ones hit. We were just the first to name it.
I pushed a patch an hour later. Added a fallback check to auto-detect rotated roles and request new tokens on the fly. Deployed it. Monitored it. Every pipeline that had failed came back online in sequence.
That afternoon, the engineering manager from the IAM team pinged me. He hadn’t realized the cascade effect of their change. No one had. He thanked me—for not throwing his team under the bus. For explaining the issue without theatrics. For turning a finger-pointing disaster into a moment of clarity.
“Blame would’ve solved nothing. What we needed was someone to own the whole picture.”
The next day, I got pulled into a retrospective. One of the directors asked what we could do to prevent this in the future.
I said, “We can’t stop mistakes. But we can stop building systems that shatter when they happen.”
We agreed to set up contract tests between our services and theirs. To publish dependency documentation that was actually readable. And to treat upstream changes with the same level of rigor we expected from downstream consumers.
That outage didn’t go down as a hero moment. No one handed out trophies. But the system is stronger now. And more importantly, so is the trust.
“Ownership isn’t about taking the blame. It’s about clearing the fog so no one else has to stumble through it.”