Observability over Assumptions

Observability over Assumptions
You can’t fix what you can’t see. Instrument everything.

We don’t run refineries—we build the equipment they depend on.

Industrial process heating systems aren’t glamorous, but they’re essential. Our gear—massive reboilers, convective heaters, thermal loops—sits at the core of petroleum refining. Designed for uptime, engineered for heat transfer efficiency, optimized for brutal conditions.

One of our newer dual-fuel heater models had just gone live at a Gulf Coast site. The commissioning reports were clean. Control systems tuned. Ramp-up curve as expected. But seven weeks later, our account manager forwarded an urgent message from their ops team.

Crude throughput was down. Surge windows weren’t holding. Their engineers weren’t blaming our unit—not yet—but they were asking pointed questions. Something about the heater wasn’t behaving the way their DCS expected.

I pulled the logs.

From our side, everything looked textbook. Temperatures tracked. Power levels matched spec. Control valves modulated without fault. The data showed a compliant, stable system.

The data was “correct.” But the refinery’s decisions—based on that data—weren’t working.

The control team described intermittent lag in response times. Nothing dramatic—just enough to break their process automation model. We checked our PLC configs. Firmware versions. Historian feeds. All clean.

We traced the signal path. Our heater communicated with the refinery’s control environment via a legacy OPC-UA translator. It was reliable, and still met its SLA on delivery, but wasn’t designed to expose queue latency under load. And when one of our engineers compared source PLC logs to the refinery’s historian data, a pattern emerged—command execution drifted behind reality by up to 14 seconds during peak flow.

That delay was invisible in our reporting stack.

Last year, we migrated to a SQL-backed data mart to unify reporting across all product lines. One of the tradeoffs was granularity: we dropped sub-second timestamps on non-safety telemetry. That decision was intentional—most data consumers were human, not automated systems, and minute-level summaries delivered faster queries, lower storage costs, and simpler compliance audits.

It was the right choice for performance. But it blinded us to behavior that only showed up in milliseconds.

To rule out a data artifact, we deployed a GPS-synced edge logger onsite. It captured I/O events directly from the controller—down to the millisecond—and preserved both source and arrival timestamps. Within a single day of replayed logs, the cause was clear.

Our translator was buffering values under load, and retrying delivery instead of failing fast.

No alarms triggered. No packets lost. The system met every SLA—except the one it didn’t know existed: time-to-decision.

The heater responded perfectly. But the refinery’s process automation reacted late. And in a system tuned for flow, not fault tolerance, late was enough.

We patched the translator, added a queue depth metric, and extended our telemetry schema to log ingestion delay when source time was available. No expensive hardware. No overbuilt tracing stack. Just practical visibility for the signals that mattered.

We didn’t add observability everywhere. We added it where delay could break decisions.

The lesson wasn’t just technical. It was cultural. We updated our telemetry standards to include drift tracing for all high-SLA signals. We added a telemetry review to every product line QA cycle. And we wrote it down—clearly, simply, and in the language of risk, not performance.

Because this wasn’t just a support issue. It was a trust issue.

We had built a reporting layer for dashboards. But our customers were using it to drive refineries.

That shift matters. And once we saw it, we responded—not by gold-plating, but by right-plating: adding visibility where it enables judgment, and stopping short of complexity where it doesn’t.

We still aggregate when it makes sense. We still optimize for query speed where operators need summaries. But we now tag the raw source time, too—quietly, without overhead—because sometimes one second is the difference between stability and shutdown.

And that’s why every new system we ship starts from the same belief:

If it can delay, it gets logged.
If it can drift, it gets traced.
If someone might assume it works, it gets proven.

We don’t run the refinery.
But our assumptions go with the product.

And if we want our gear to be trusted in real time, we can’t rely on hindsight.