In April we shipped three builds in a row, each fix justified by the previous build's inference being correct. None of those inferences were correct. By the time we measured the actual cause, four hours of velocity had gone into shipping fixes for the wrong cause.

The pattern was specific enough — and the cost was visible enough — that we wrote it down as a permanent rule. This post is the public version of that rule.

The sequence

A deep audit said: /punch-list times out at 60s+ via WebFetch. WebFetch's timeout is 60 seconds. The audit was reporting that its own tool gave up.

B2424. We inferred from the audit: rendering is slow. Switched the page from force-dynamic to ISR for cheaper serving. Shipped. The Vercel build then failed because Next.js's per-page static-generation timeout is also 60 seconds — the same underlying bug from the build side. We had not made the page faster; we had moved the same failure into a different runtime.

B2429. We reverted with the inference "the file is too big." It's 203 lines / 31 KB. That was not the cause either. We measured nothing.

B2431. We finally measured. Rendering took 0.45 milliseconds. The slowness was an infinite-loop bug on lines starting with ** in the markdown-collection helper. The audit had been reporting "this hit the tool's timeout"; we had read that as "this is slow." Same words, different fact.

The rule

When a fix is informed by an audit observation or a perf report, follow these rules BEFORE choosing the fix strategy:

Reproduce the slow path with a measurement, not a tool's timeout. WebFetch's 60s ceiling, Vercel's 5-minute lambda wall-clock, GitHub Actions' default job timeout are all tooling artifacts — they tell you "this took at least N seconds" but not where the N went. If you'd write "X is slow" in a commit message, write a benchmark that proves it first.
The shape of the fix is dictated by what the measurement reveals. "Switch to ISR" is a valid fix for "render is slow on every request." It's the wrong fix for "render is an infinite loop." The same input ("page is slow") admits multiple causes; each cause has a different fix shape. Measure first.
Verify the fix on the production-shape path BEFORE shipping. Local typecheck + tests + check:all is one phase. Actually exercising the production-shape behavior (the next build / a preview deploy / a real HTTP fetch) is the second phase. Our B2424 ISR change passed every local gate AND still broke the Vercel build, because nothing in those gates exercised "what Next.js does at build time when revalidate is set."
The commit message SAYS the inference. Pre-rule, the B2424 commit confidently asserted "the WebFetch-observed 60s+ render time was dominated by per-request cold-start + zero CDN cache." That claim was a guess. Mark guesses as guesses: hypothesis: …, if this is right, …, measured: …. The reader of the commit message should be able to distinguish what was verified from what was inferred.
One inferred fix per build, max. If a session ships three builds in a row whose justification depends on the PREVIOUS build's inference being correct, the inference is load-bearing for the whole stack. Our B2424→B2429→B2431 sequence was exactly this pattern. When you catch yourself doing this, stop and measure.

What this looks like in practice

The rule of thumb mirrors a sibling rule we already had for audits: before writing any "this is broken" finding, prove it's broken with a citation a developer can follow. Both rules sit in the same shape:

If the proof is "the tool timed out," that's not proof — that's evidence of a symptom. Evidence of a symptom is not evidence of a cause.

We added a small adjustment to our commit-message convention to encode this. When a build's justification is INFERRED rather than MEASURED, the commit message marks it:

[SRC]       — source-grep / file:line citation
[BUILD]     — build-log / git history evidence
[INFERENCE] — best-guess explanation; flagged so the reader can weight it

The audit doc this came out of uses the same tags. Findings backed by [SRC] evidence rank above findings backed by [INFERENCE] evidence even when the inference sounds confident — confidence is not measurement.

Why we're publishing this

Three reasons.

First, the failure mode is universal in AI-product engineering. Tooling artifacts get read as causes constantly — the model timed out, the API returned 500, the build took too long. The pressure to ship a fix is real; the discipline of staying with "what actually happened" is unfamiliar territory. We wrote the rule because we kept failing it.

Second, our codebase is committed to standards transparency. The Lyric Scoring Standard, the Banned Clichés List, the Voice-Reference Discipline, and the Sacred Accidents log all operate under the same posture: the receipts ARE the discipline. Publishing internal engineering rules — including the unflattering ones — is part of that posture.

Third, four-hour debugging arcs cost more than the rule does to follow. Anyone using AI-assisted coding tools or shipping inference-heavy products is one timeout away from this loop. The rule below has saved us at least three repeats since we wrote it down.

The rule, condensed

Before shipping any "this fixes X" change informed by audit evidence, prove the diagnosis with a measurement a developer can reproduce. If the proof is "the audit said the page was slow," that's not proof — that's evidence of a SYMPTOM, not a CAUSE.

Mark inferences as inferences. Reproduce slow paths with benchmarks. Verify fixes against the production-shape path before pushing. One inferred fix per build, max. When the justification stack starts to lean on prior inference, stop and measure.

The B2435 amendment lives at the top of our operating playbook now. Three weeks in, no repeat of the B2424→B2431 arc. The rule paid for itself the first time.

The fix-side evidence rule: don't ship a 'this fixes X' change without measuring X

The sequence

The rule

What this looks like in practice

Why we're publishing this

The rule, condensed