Motivation
Every `/api/v1/score` response carries a reproducibility seal: `{ rubricVersion, model, temperature, buildSha, build }`. The seal is a B2B-grade trust primitive — it asserts "this exact recipe produced this exact score, and you can re-derive it."
But the assertion has never been tested. The closest thing is the golden-eval CI gate (B1177) which locks 4 system-prompt snapshots byte-identical. That's prompt-stability, not score-reproducibility end-to-end.
This RFC pins the methodology for the quarterly reproducibility audit referenced in Section 4 of /reports/calibration-2026-q2.
Proposal
What gets audited
Each quarter, sample 25 random songs from the prior 30 days where:
- The song is public (eval_data + lyrics readable via service-role)
- The seal field is fully populated (rubricVersion + model + temperature + buildSha + build all present)
- The song has a primary composite + 12 metric scores
How the replay works
For each sampled row: 1. Read the seal: target rubricVersion, model id, temperature 2. Spin up an eval call against the SAME model id + temperature (NOT the current default — pin to the seal's recorded values) 3. Use the SAME rubric version (load from the package's versioned dist if rubricVersion < current) 4. Score the same lyrics + genre 5. Compare per-row: composite delta + per-metric delta 6. Aggregate: median delta, p95 delta, count of rows with |composite delta| > 5pts
Acceptance bands
median |composite delta| ≤ 1pt ✓ seal is honest median |composite delta| 1-3pts ⚠ acceptable; document drift median |composite delta| > 3pts ✗ seal is broken; investigate
What "broken seal" means
If the median composite drift exceeds 3pts, one of these is true:
- The model itself drifted (Anthropic deprecated/changed the underlying model behind the same id)
- Our rubric loaded for the audit doesn't match what shipped that quarter (versioning bug)
- The eval prompt changed without bumping rubricVersion (process bug — shouldn't happen if RFC-0001 is enforced)
The audit's job is to surface the drift; the response is a separate build with its own commit + reasoning.
Cost
25 samples × ~$0.05/eval call = ~$1.25 per audit. Negligible.
Cadence
Quarterly, aligned with /reports/calibration-YYYY-qN. The audit sample lives in docs/REPRODUCIBILITY-AUDIT-YYYY-qN.jsonl (append- only).
Out of scope
- Auditing the gauntlet's deterministic rerun (gauntlet is non- deterministic by design via Math.random feature gates)
- Auditing the cross-family triangulation (different model family; reproducibility against gpt-4o is a separate concern)
- Auto-remediation when drift is detected (human review only)
Comment window
This RFC is open for comment until 2026-05-03. Email support@songforgeai.com with the subject `RFC-0007`.
Resolution
(Pending — will be filled in after 2026-05-03 with a summary of comments + the accepted text. The thresholds above are the working defaults until then. First reproduction audit ships with /reports/calibration-2026-q3 once 30 days of post-RFC data exists.)