Skip to content
Lyric Scoring Standard
Calibration report · v0.1-draft·Q2 2026

Quarterly rubric calibrationInaugural edition · stubbed 2026-04-26

Five calibration questions, each backed by a live data source the operator can audit at any time. Three sections have data flowing today; one awaits its first automated sample; one ships methodology-only this quarter and gains data in Q3.

Why publish a quarterly calibration report

A scoring standard that never reports on its own calibration drifts into folk wisdom. The Lyric Scoring Standard ships under CC BY 4.0; third parties cite it; the reproducibility seal in every /api/v1/score response embeds its version. The discipline is incomplete without a regular public artifact answering: "is the rubric still doing what we said it does?"

This report is that artifact. The five sections each link a LIVE data source that produces the underlying numbers. When a section says "Q2 baseline N pts," the corresponding dashboard URL is how you verify it.

Five calibration questions

  1. 01

    Cross-family agreement (Sonnet ↔ GPT-4o)

    data flowing

    When the primary Sonnet eval scores a song at X, what does GPT-4o say? At what score bands do the families systematically disagree?

    The B1279 triangulation re-scores every internal eval (post-B1356 fix) on the same 12-metric rubric using GPT-4o-2024-11-20 at temperature 0.7. The corpus mean divergence becomes a primary calibration signal: if the families agree within 3pts on 80% of songs, the rubric specification is robust to model-family interpretation. Wider gaps signal either rubric ambiguity or family-specific bias.

    live triangulation status
  2. 02

    24-hour Hum Test (delayed M11 vs fresh M11)

    data flowing

    Is the fresh Memorability score systematically inflated relative to a 24-hour-delayed re-read? If yes, the rubric is rewarding hooks that don’t actually stick.

    The B1303 hourly cron picks ~10 songs forged 23-25 hours earlier and re-runs a minimal-context M11 score. RFC-0003 governs the calibration discipline: |median(delta)| > 10pts triggers an obligation to address M11 in the next quarterly bump. Q2 baseline established here.

    live hum-test dashboard
  3. 03

    Per-cohort engagement vs composite

    data flowing

    When a song scores 80 on the rubric but the listener panel’s engagement scores cluster at 30, the rubric is over-scoring relative to what listeners actually feel. How often does this gap exceed 15pts? Which persona is the most pessimistic?

    The five-persona listener panel (Spotify Skimmer, Lyrics Reader, Songwriter Peer, Emotional Listener, Genre Purist) produces an engagement score per song. Mean divergence (composite − mean engagement) is the headline number. Per-persona × score-band heatmap surfaces whether a specific persona drives divergence in a specific band.

    live cohort-divergence dashboard
  4. 04

    Reproducibility audit

    awaiting first sample

    For a sampled batch of 25 published scores from the last 30 days, do the rubric version + model + temperature + buildSha + build numbers in the seal field still match what reproduction would produce today?

    The seal field on every score response carries the reproduction recipe. This audit takes 25 random scored songs, replays the same lyrics + genre through the same model + temp + rubric version, and reports per-row score delta. Zero or near-zero deltas confirm the seal is honest. Material deltas trigger an investigation. First sample lands with the Q3 report once the audit script is automated; Q2 documents the methodology.

    standard + reproducibility seal spec
  5. 05

    Anti-inflation rule effectiveness

    methodology only

    The five anti-inflation rules (Gravity, Burden of Proof, Antagonist Ceiling, Historical Context, Anti-Platitude per RFC-0002) each defend against a specific inflation pattern. Which rules fire most? Which fire rarely? Are any candidates for removal as overspecified?

    Each anti-inflation rule has an internal "rule fired" telemetry hook. Q3 will surface a per-rule fire-count + per-rule mean score impact. A rule that fires <1% of evals is a candidate for removal (or merging with a related rule); a rule that fires >50% is a candidate for re-examination as a possible rubric default rather than an exception.

    whitepaper anti-inflation section

Cadence + future editions

The inaugural Q2 2026 edition is intentionally a SCAFFOLD. The five live data sources have ~30 days of accumulated telemetry by publication; the next edition (Q3 2026) ships with the first cross-quarter movement deltas and the first reproducibility-audit sample.

Each future edition lives at its own dated URL (/reports/calibration-YYYY-qN) so prior quarters remain stable historical artifacts. Cite by URL.