Motivation
The internal eval scores M11 (Memorability) based on the model's IMMEDIATE impression of memorability potential. That is a fundamentally different question from "what does the model remember after time has passed?" — and "what stays remembered" is the actual property the metric is supposed to measure.
The Hum Test cron (B1303) closes that gap empirically. Every hour the cron picks ~10 songs forged 23-25 hours earlier and re-runs a minimal-context M11 score. The 24-hour-delayed score is now writing into the `hum_score` column on every eligible song.
After the first ~24h of cron runs the data is real but the **process for using it** is not yet defined. This RFC pins it.
Proposal
Definition of the calibration signal
For any song with both a fresh M11 (in `eval_data->metrics` where `shortName = 'memorability'`) and a Hum score (`hum_score` column), define:
> `m11_delta = hum_score - fresh_m11`
A negative delta means the fresh M11 over-scored memorability relative to the 24-hour re-read.
Calibration thresholds
For the rolling 30-day corpus of songs with both values:
- **`|median(delta)| ≤ 5pts`** → no action; rubric calibrated.
- **`|median(delta)| in (5, 10]pts`** → record as drift in the Quality Council log. No rubric change yet.
- **`|median(delta)| > 10pts`** → triggers an obligation: the next quarterly rubric bump (per RFC-0001 cadence) MUST address M11. Either: 1. Adjust the M11 prompt to bias toward 24h-survival characteristics (chorus repetition, lyrical hook, vowel structure), OR 2. Reweight M11's contribution to the composite, OR 3. Both.
Surface
The /admin/hum-test dashboard (B1339) already surfaces this exact statistic. Its existence is part of the calibration contract — the operator can see at any time whether the threshold is approaching.
Reproducibility
The seal field on every `/api/v1/score` response carries `rubricVersion`. When a calibration-driven adjustment ships, it gets a MINOR version bump per RFC-0001 (delta < 5pts on golden-eval) or a MAJOR bump if the M11 reweight changes the composite by more than 5pts on golden-eval.
Out of scope
- The Hum Test prompt itself (already shipped + pinned in src/app/api/cron/hum-test/route.ts; this RFC takes it as given)
- A more sophisticated longitudinal model (multiple time windows: 72h, 1-week, 30-day). Future work — would need a separate cron + a separate RFC.
- Per-genre calibration. The current Hum signal aggregates across all genres. If genre-specific drift emerges, a future RFC will propose splitting.
Comment window
This RFC is open for comment until 2026-05-03. Email support@songforgeai.com with the subject `RFC-0003` to leave a comment.
Resolution
(Pending — will be filled in after 2026-05-03 with a summary of comments received and the accepted text. The thresholds above are the working defaults until then.)