Motivation

Every `/api/v1/score` response carries a reproducibility seal: `{ rubricVersion, model, temperature, buildSha, build }`. The seal is a B2B-grade trust primitive — it asserts "this exact recipe produced this exact score, and you can re-derive it."

But the assertion has never been tested. The closest thing is the golden-eval CI gate (B1177) which locks 4 system-prompt snapshots byte-identical. That's prompt-stability, not score-reproducibility end-to-end.

This RFC pins the methodology for the quarterly reproducibility audit referenced in Section 4 of /reports/calibration-2026-q2.

Proposal

What gets audited

Each quarter, sample 25 random songs from the prior 30 days where:

The song is public (eval_data + lyrics readable via service-role)
The seal field is fully populated (rubricVersion + model + temperature + buildSha + build all present)
The song has a primary composite + 12 metric scores

How the replay works

For each sampled row: 1. Read the seal: target rubricVersion, model id, temperature 2. Spin up an eval call against the SAME model id + temperature (NOT the current default — pin to the seal's recorded values) 3. Use the SAME rubric version (load from the package's versioned dist if rubricVersion < current) 4. Score the same lyrics + genre 5. Compare per-row: composite delta + per-metric delta 6. Aggregate: median delta, p95 delta, count of rows with |composite delta| > 5pts

Acceptance bands

What "broken seal" means

If the median composite drift exceeds 3pts, one of these is true:

The model itself drifted (Anthropic deprecated/changed the underlying model behind the same id)
Our rubric loaded for the audit doesn't match what shipped that quarter (versioning bug)
The eval prompt changed without bumping rubricVersion (process bug — shouldn't happen if RFC-0001 is enforced)

The audit's job is to surface the drift; the response is a separate build with its own commit + reasoning.

Cost

25 samples × ~$0.05/eval call = ~$1.25 per audit. Negligible.

Cadence

Quarterly, aligned with /reports/calibration-YYYY-qN. The audit sample lives in docs/REPRODUCIBILITY-AUDIT-YYYY-qN.jsonl (append- only).

Out of scope

Auditing the gauntlet's deterministic rerun (gauntlet is non- deterministic by design via Math.random feature gates)
Auditing the cross-family triangulation (different model family; reproducibility against gpt-4o is a separate concern)
Auto-remediation when drift is detected (human review only)

Comment window

This RFC is open for comment until 2026-05-03. Email support@songforgeai.com with the subject `RFC-0007`.

Resolution

(Pending — will be filled in after 2026-05-03 with a summary of comments + the accepted text. The thresholds above are the working defaults until then. First reproduction audit ships with /reports/calibration-2026-q3 once 30 days of post-RFC data exists.)