Sleeper Test ledger · Published artifact
Day-1 vs Day-2 score drift, measured across the catalog.
Every song forged on SongForgeAI is re-scored 24 hours later by a COLD-temperament panel reading lyrics only — no title, no genre, no prior score. The delta between Day-1 and Day-2 is the song's holding power. This page aggregates the drift across the public catalog as a Lyric Scoring Standard companion artifact.
Pre-publication phase. Ledger publishes once the catalog reaches 200 sleeper-tested songs. At current forge volume that's roughly 22 days from now. Methodology + cron is live; the cron writes to a JSONB column on the songs table daily at 04:00 UTC.
Methodology (pre-registered)
Schedule: Daily cron at 04:00 UTC (public infra). Each run picks up to 10 songs forged 23–25 hours earlier that haven't been sleeper-tested yet.
Re-score conditions: COLD temperament (B2127 panel discipline) reading lyrics only — no title, no genre, no prior score, no room notes. The cold pass is deliberately context-stripped (B2128) so the panel reads as a stranger encountering the song for the first time.
Trust labels: |delta| <= 3 → stable (scored on signal). 3 < |delta| <= 8 → mild_drift. |delta| > 8 → momentum_score (the room got hot; the cold reader caught it).
Publication threshold: 200 sleeper-tested songs. Below 50: methodology only. Between 50 and 200: early-signal aggregates with a calibration warning. At 200+: full public ledger, citable as a SongForgeAI standard artifact.
Public catalog only: The aggregate counts public songs (is_public = true). Private songs are also sleeper-tested by the same cron but are excluded from public statistics.
Why this is the strongest external-validity claim a rubric can publish
Score-on-its-own-work is the credibility break in every AI scoring system: the model that wrote the song is reading the song. The Sleeper Test breaks that loop by making the second reading happen 24 hours later, under a different temperament, with the original context withheld. The delta between Day-1 and Day-2 is unfakeable by the forging system. It IS the song's holding power.
No other AI lyric tool publishes Day-2 drift statistics because no other AI lyric tool runs a Day-2 cold re-read against its own published rubric. The cron has been running since the migration applied; this page is the artifact that converts the cron's output from a private telemetry blob into a citable public standard.
Page revalidates hourly. Cron writes daily at 04:00 UTC. Build 2158.
The standard
12-metric rubric · CC BY 4.0 · machine-readable JSON
Whitepaper
Formal methodology, anti-inflation rules, calibration corpus
Inter-rater agreement
Pre-registered methodology · 30-rater human cohort · ICC
Reproducibility seal
ed25519 signature spec · pubkey JSON · verifier
Changelog
Full version history + accepted RFCs
Version diff
Compare any two rubric versions side by side
Model card
Reference-implementation model disclosure
Prior art
Conservatory rubrics + MIR research + industry frameworks the open standard extends
Hit Calibration
Curated corpus of historically-significant songs scored by the rubric — proves it grades craft, not chart success.
Register-aware craft
SA#32 — 10 emotional registers, per-register Gravity Rule + Burden of Proof modifiers. Quality is not universal; register defines craft.