Motivation

The internal eval scores M11 (Memorability) based on the model's IMMEDIATE impression of memorability potential. That is a fundamentally different question from "what does the model remember after time has passed?" — and "what stays remembered" is the actual property the metric is supposed to measure.

The Hum Test cron (B1303) closes that gap empirically. Every hour the cron picks ~10 songs forged 23-25 hours earlier and re-runs a minimal-context M11 score. The 24-hour-delayed score is now writing into the `hum_score` column on every eligible song.

After the first ~24h of cron runs the data is real but the **process for using it** is not yet defined. This RFC pins it.

Proposal

Definition of the calibration signal

For any song with both a fresh M11 (in `eval_data->metrics` where `shortName = 'memorability'`) and a Hum score (`hum_score` column), define:

> `m11_delta = hum_score - fresh_m11`

A negative delta means the fresh M11 over-scored memorability relative to the 24-hour re-read.

Calibration thresholds

For the rolling 30-day corpus of songs with both values:

**`|median(delta)| ≤ 5pts`** → no action; rubric calibrated.
**`|median(delta)| in (5, 10]pts`** → record as drift in the Quality Council log. No rubric change yet.
**`|median(delta)| > 10pts`** → triggers an obligation: the next quarterly rubric bump (per RFC-0001 cadence) MUST address M11. Either: 1. Adjust the M11 prompt to bias toward 24h-survival characteristics (chorus repetition, lyrical hook, vowel structure), OR 2. Reweight M11's contribution to the composite, OR 3. Both.

Surface

The /admin/hum-test dashboard (B1339) already surfaces this exact statistic. Its existence is part of the calibration contract — the operator can see at any time whether the threshold is approaching.

Reproducibility

The seal field on every `/api/v1/score` response carries `rubricVersion`. When a calibration-driven adjustment ships, it gets a MINOR version bump per RFC-0001 (delta < 5pts on golden-eval) or a MAJOR bump if the M11 reweight changes the composite by more than 5pts on golden-eval.

Out of scope

The Hum Test prompt itself (already shipped + pinned in src/app/api/cron/hum-test/route.ts; this RFC takes it as given)
A more sophisticated longitudinal model (multiple time windows: 72h, 1-week, 30-day). Future work — would need a separate cron + a separate RFC.
Per-genre calibration. The current Hum signal aggregates across all genres. If genre-specific drift emerges, a future RFC will propose splitting.

Comment window

This RFC is open for comment until 2026-05-03. Email support@songforgeai.com with the subject `RFC-0003` to leave a comment.

Resolution

**Accepted as-written, 2026-05-03.**

Comment window closed without proposed amendments. One clarifying question received via support@songforgeai.com:

Q: How does the hum-test handle songs with very short M11 history (< 5 prior hum-scored songs)? Does the calibration threshold still apply? A: The calibration thresholds (`|median delta|` bands at 5 and 10pts) operate on the rolling 30-day corpus, not on per- song history. A song with low individual M11 history just doesn't contribute much weight to the median; the bands work the same. We considered adding a per-song confidence-interval but decided that lives in the user-visible hum-score presentation layer, not in the calibration policy itself.

The thresholds are now the canonical calibration policy. The admin/hum-test dashboard surfaces the rolling-30-day median in real time; Quality Council reviews it on the standard cadence.