Hit Calibration Artifact · Public corpus
How the rubric scores songs you already know.
A curated corpus of historically-significant songs run through the same Lyric Scoring Standard the rest of SongForgeAI runs. Proves the rubric grades craft — not chart success. Every entry is frozen, citable, and answers the same question: does this rubric actually work?
Pre-publication phase. The corpus is being curated. Methodology + table schema are locked. First entries land soon. Check back, or read the methodology below to see how each entry is produced.
What a published entry will look likeSample, not real data
| Song | Score | Verdict |
|---|---|---|
Yesterday The Beatles | 94/ A | masterpiece |
The real corpus will replace this preview as the operator seeds it. Every row is a song's lyric run through the same 8-voice Crucible at /crucible — same rubric, same anti-inflation rules, frozen at first publication.
Methodology
Curation: Each entry is a historically-significant song. Selection draws from canonical lists (RIAA Diamond, Grammy SOTY, Billboard #1s across decades, songwriting-craft canon — Cohen, Dylan, McCartney, Joni Mitchell, Springsteen, Sondheim, etc). Curation is operator-led; the goal is craft diversity, not chart-rank ranking.
Scoring: Each song's lyric is run through the same 8-voice Crucible the public uses at /crucible. The result is frozen here at first publication — no re-scoring based on later rubric edits. If the rubric drifts, the calibration corpus drift becomes visible (which is the point).
What we publish: Title, artist, year, chart context, our score, our verdict, voice counts, and a 1-2 sentence critique summary that reflects the panel's actual reading. We never publish the lyric itself. Standard music-criticism practice.
Why no popularity bias: The rubric grades craft, not chart success. If a song that charted #1 scores 52 here because the lyric is craft-thin, we publish 52. If a song that never charted scores 92 because the craft is masterful, we publish 92. Auto-inflating known hits would break the entire claim the rubric stands on.
When entries change: Never. Once published, the row is frozen at frozen_at. Re-running the rubric on the same lyric should produce the same score (within ±2-3 points of natural Sonnet variance, per the consensus eval discipline). Significant drift is itself a published finding — the corpus's most useful property is its stability.
Why this corpus exists
Every AI scoring system answers the same skeptical question eventually: "how do I know your rubric isn't just grading its own output well?" The Hit Calibration Artifact is the most direct answer available — it runs the rubric against songs the skeptic ALREADY KNOWS and lets them check.
If "Hallelujah" scores low here, our rubric is broken. If "Achy Breaky Heart" scores 92, our rubric is sycophantic. The corpus is the public proof that neither failure mode applies.
Page revalidates hourly. Entries frozen at first publication. Build 2180.
The standard
12-metric rubric · CC BY 4.0 · machine-readable JSON
Whitepaper
Formal methodology, anti-inflation rules, calibration corpus
Inter-rater agreement
Pre-registered methodology · 30-rater human cohort · ICC
Reproducibility seal
ed25519 signature spec · pubkey JSON · verifier
Changelog
Full version history + accepted RFCs
Version diff
Compare any two rubric versions side by side
Model card
Reference-implementation model disclosure
Prior art
Conservatory rubrics + MIR research + industry frameworks the open standard extends
Sleeper ledger
Day-1 vs Day-2 cold-temperament drift aggregates. Auto-publishes at 200 sleeper-tested songs.
Register-aware craft
SA#32 — 10 emotional registers, per-register Gravity Rule + Burden of Proof modifiers. Quality is not universal; register defines craft.