Three pre-LLM classics, scored against the rubric
What does a 95-point lyric actually look like? We score three canonical pre-LLM songs against the open Lyric Scoring Standard, with full per-metric breakdowns. The result is the calibration anchor every AI lyric tool quietly competes against.
The Lyric Scoring Standard is calibrated against a 1949 country song. Hank Williams's "I'm So Lonesome I Could Cry" is the canonical 95-point anchor. We picked it because its specificity, its arc, and its emotional truth are operating at a band the modern AI-output corpus rarely touches.
This post does for two more pre-LLM classics what the calibration page does for Williams: full per-metric breakdown, honest scoring, no softball. If you want to know what 90+ actually looks like, this is the proof.
Hank Williams — "I'm So Lonesome I Could Cry" (1949)
Composite: 95. Craft 92, Expression 96, Impact 95.
Why it scores: every line has a named anchor (whippoorwill, midnight train, falling star, robin), the imagery sequence builds an arc from sound to sight to silence, the meter holds without forcing, and the emotional truth is doing the heaviest lifting in the lyric — nothing is described that doesn't carry the speaker's loneliness. The Specificity score (98) is what most AI lyric output cannot reach because every concrete image is doing two jobs at once.
Where it loses points: Memorability is "only" 92 because the song's hook is the title, repeated — not a unique catchphrase the corpus could quote without context. The 92 is still S-band; the deduction is honest.
Joni Mitchell — "A Case of You" (1971)
Composite: 94. Craft 93, Expression 96, Impact 92.
Why it scores: the lyric is conducting a complicated emotional argument — "I could drink a case of you and I would still be on my feet" is simultaneously a flex, a confession, and a warning — and the rubric's Voice metric scores this kind of layered POV in the high-90s. The map / blueprint / bar imagery cluster is rare-band Imagery Originality (94). The bridge ("just before our love got lost you said...") executes a perfect Arc pivot.
Where it loses points: Genre-Fit is 88 because the song doesn't sit cleanly inside the singer-songwriter conventions of its era; the rubric reads this as a strength of the song but a slight Genre-Fit deduction by the metric's strict definition. Memorability is 92 — "I am a lonely painter, I live in a box of paints" is quotable but not chart-hook quotable. The Impact tier composite is 92 against the 96 Expression score; that 4-point gap is the one most working songwriters would accept.
Leonard Cohen — "Hallelujah" (1984, original recording)
Composite: 96. Craft 94, Expression 97, Impact 96.
Why it scores: every verse is a different argument, each one carrying its own tonal stance, and yet the chorus refrain rebinds all of them. The rubric's Wholeness metric — "do the parts add up to a coherent whole?" — scores this in the 98-99 band, which we almost never see. The biblical-allusion + sexual-confession + secular-grace combination is doing simultaneous work the rubric weighs as the highest band of Voice + Imagery Originality.
Where it loses points: Prosody is 92 (still S-band) because the meter is deliberately variable — the rubric reads this as intentional but the metric's strict definition deducts for variability. Memorability gets a 95 against an unfair benchmark: the song's cultural saturation may be unfalsifiable as a measurement of the lyric per se vs. the lyric + the cultural memory of countless covers.
What this calibration tells us
Three songs, three different decades, three different genres, all scoring 94-96 against the same 12-metric rubric. The calibration corpus says: this is what 90+ looks like. The honest comparison for any AI-generated lyric isn't to the median chart pop song. It's to these.
The average AI-generated lyric we sampled in May 2026 scored a composite 52. The gap from 52 to 95 is 43 points. The Refine Mode pass typically moves a 65 to an 84 (a +19 lift). Even an aggressive Refine pass on a sub-50 input rarely crosses 80. The 90+ band, today, requires a human writer with the kind of lived specificity these three songs demonstrate.
The rubric is calibrated against this band on purpose. We don't want a tool that promises 90+ output and ships 70+ scoring. We want a tool that scores honestly, surfaces the gap, and helps a working songwriter close it.
Score these against your favorite AI tool
The rubric is open. The calibration entries above are reproducible: load @songforgeai/scoring-rubric, paste the lyrics, score under v1.1.0. If your AI tool's "Hallelujah-tier" output scores in the same 90+ band, please email it — we'd add it to the cited-by network effect at /cited-by. If it scores in the 70s and the tool calls that 90-tier, the gap is where the buyer education work needs to happen.