Skip to content
Back to the Lyric Scoring Standard
Reference corpus·v0.1.1 for rubric v1.1.0

17 hand-scored exemplars.

Seed corpus for the Lyric Scoring Standard. 12 worked examples across the score spectrum, each annotated with composite + per-tier scores + the rationale for the score band. Use to calibrate independent implementations of the rubric or to train students on what each band actually looks like.

License

CC BY 4.0 — Attribution required. Cite the corpus by name + version when training, evaluating, or comparing against it. CC BY 4.0

Download JSON

Methodology

  1. Each entry is a hand-scored exemplar produced by the SongForgeAI evaluation pipeline (claude-sonnet-4-20250514 @ temperature 0.7) and validated against the Anti-Inflation rules in the standard.
  2. Lyrics are excerpted (8-16 lines) under fair-use criticism + commentary. Public-domain originals are used where available; AI-generated samples are clearly marked.
  3. Composite scores carry the same Gravity Rule, Burden of Proof, Antagonist Ceiling, and Historical Context anchors that the live API applies. A 90+ here is as rare as a 90+ in production.
  4. Per-tier scores (craft / expression / impact) are reported alongside the composite so implementers can verify their tier weights match the published 25/40/35 split.
  5. This is a SEED corpus. Future revisions add 50+ entries; the v0.1.0 ships the methodology + the first dozen so the format is stable.

The 17 reference entries

S95/100countryQuoted (fair use)corpus-003

Reference: Hank Williams excerpt (40s — historical anchor)

Hear that lonesome whippoorwill
He sounds too blue to fly
The midnight train is whining low
I'm so lonesome I could cry
Craft93Expression98Impact94

Three sound images (whippoorwill / midnight train / silence implied) build the loneliness through environment, not statement. The hook lands with structural inevitability — every preceding line earned it. Historical Context anchor confirms this as canon-level work; the 95 reflects the Burden of Proof being met by every line.

  • Sensory specificity carries the entire emotional weight
  • Hook earns its directness because the verse refused directness first
  • Anchors the corpus's S tier — implementations producing 95+ on weaker work have miscalibrated

"I'm So Lonesome I Could Cry" by Hank Williams (1949). Quoted under fair use for criticism + commentary.

S92/100r&bAI-generatedcorpus-017

Reference: R&B, S- tier (forged, 92)

I learned the word 'sober' from her ex-husband
Who still came over for Thanksgiving every other year
She'd set the table for one extra plate
Like she was making sure the word kept its meaning
Craft89Expression95Impact91

Multi-clause Specificity that carries philosophical weight without statement (the act of setting the extra plate IS the meaning of 'sober'). The narrator is implicated but not central — they're the witness. R&B genre profile rewards adult complexity; this excerpt clears it. Composite lands at 92 because the close earns its restraint and the excerpt's last line operates on three levels (literal table-setting, semantic-meaning preservation, narrator-as-archivist).

  • Operates on three semantic levels in the close (literal / linguistic / narrative)
  • S-tier requires earning the band — this excerpt does via the narrator's witness posture
  • Tests whether the scorer rewards multi-clause Specificity (a single specific image with embedded meaning) vs. single-clause Specificity (one named thing)

SongForgeAI R&B genre profile, post-gauntlet

A+88/100countryAI-generatedcorpus-001

Reference: Cigarettes & Promises (forged exemplar, 90s)

She kept her promises in a coffee can
Next to the matches and the kitchen sink
Every one I broke is in there
Folded smaller than I'd like to think
Craft86Expression91Impact84

Specificity carries this — coffee can / kitchen sink / folded smaller is a concrete-image stack that earns its place. Voice consistency is strong; the narrator's complicity ("every one I broke") refuses self-pity. Scoring would push to 90+ with a stronger arc; the excerpt is verse-only so impact gets the floor on stickiness.

  • Specific household objects function as narrative containers, not decoration
  • Confession-without-apology pattern: blame stays with the narrator
  • Folding metaphor doubles as a scale of regret

SongForgeAI pipeline, B1180

A+87/100rockAI-generatedcorpus-009

Reference: Rock, A+ (forged, 87)

I was raised on borrowed weather
A storm my father never named
I carry it like a borrowed coat
That fits the shoulders just the same
Craft84Expression90Impact86

"Borrowed weather" is the kind of metaphor the rubric exists to reward — original, specific, structurally load-bearing. The repetition of "borrowed" earns its return through meaning shift. Rock genre profile rewards mythic lyrics; this excerpt qualifies. Composite holds at 87 because the excerpt's closing line is the one moment that lands closer to received craft than invention.

  • Metaphor that does work in two registers (literal + emotional)
  • Word repetition with meaning shift — rare in mid-tier output
  • Closing line is the only restraint on a higher composite

SongForgeAI rock genre profile

A+86/100folkAI-generatedcorpus-015

Reference: Folk, A tier (forged, 86)

She wears my grandfather's wool coat to the bus
The one with the cigarette burn at the cuff
Says it's the only thing in this house that fits her
And I don't know what to say about that
Craft84Expression90Impact84

Inherited-object specificity (grandfather's wool coat + cigarette burn at the cuff) carries Specificity to 90. The narrator's silence in the closing line is the Voice signal — refusing to resolve into a sentiment is the rubric's positive marker for Truth and Voice both. Folk genre profile rewards quiet domestic specificity; this excerpt qualifies. Composite holds at 86 because the excerpt is verse-only.

  • Inherited-object pattern (named relative + named flaw on the object)
  • Refusal-to-resolve closing line is what separates A+ folk from B+ folk on Voice
  • Tests whether the scorer rewards narrator restraint or penalizes it as 'incomplete'

SongForgeAI folk genre profile

A84/100indieAI-generatedcorpus-007

Reference: Indie folk, A tier (forged, 84)

I keep your old apartment key
In a drawer I never open
It isn't yours to want back
And it isn't mine to give
Craft80Expression88Impact84

Restraint is the craft signal here. Object specificity (apartment key, drawer) creates emotional containment without being precious. The closing pair earns the score — the symmetry resolves a tension the verse refused to name. Strong A; doesn't quite clear A+ because the excerpt is verse-only.

  • Pattison-style anchor object sustains the entire emotional arc
  • Closing parallelism passes both anti-cliché + anti-tidiness gates
  • Excerpt-only scoring caps impact at 84

SongForgeAI indie genre profile

B+78/100r&bAI-generatedcorpus-010

Reference: R&B, B+ (forged, 78)

I told you I was good at being gone
You took it as a brag, not a confession
Now you're packing up the car alone
And I'm the one who lost the lesson
Craft80Expression78Impact76

Voice signal is the lead — the narrator is unreliable in a way the rubric rewards. "Brag, not a confession" is the kind of self-implicating turn that holds R&B to its truth-telling lineage. Strong B+; would clear A with a sharper image in the second half.

  • Self-implication earns Voice score
  • Second half drifts toward statement; loses image-density
  • R&B genre tolerance for confessional mode is being used correctly

SongForgeAI R&B genre profile

B+78/100worshipAI-generatedcorpus-014

Reference: Worship genre carve-out (forged, 78)

When the lights go down in the empty pew
I count the dust between the kneelers
You meet me in the patient places
In the silence between the hymn and the verse
Craft80Expression80Impact74

Worship + gospel use the platitude pattern as part of the form (declarative theological statements). The Anti-Platitude rule has a per-genre carve-out (RFC-0002 future work) so this lyric is NOT penalized for the second-person address. Specificity stays high because of the named anchors (empty pew, dust, kneelers, hymn-verse boundary). Demonstrates how the rubric handles convention-heavy religious genres without flattening them to the floor.

  • Genre carve-out preview: worship CAN use direct address without platitude penalty when concrete anchors are present
  • Sensory specificity (dust between kneelers) does the load-bearing work
  • If your scorer applies the Anti-Platitude rule uniformly across genres, this entry exposes the bug

SongForgeAI worship genre profile

B+76/100countryAI-generatedcorpus-004

Reference: Mid-tier B+ (forged, country, 76)

I learned to drive on Wednesday roads
Where everybody waves like they mean it
I didn't know not waving back
Meant I was choosing to be lonely
Craft72Expression81Impact73

Specificity wins — "Wednesday roads" is the kind of micro-detail that sets a place. Voice is consistent. Held back from A range by structural quietness (no payoff turn within the excerpt) and a slightly didactic close. Solid demo of the B+ tier.

  • "Wednesday roads" earns the line by sounding like recognition, not invention
  • Close edges into telling rather than showing
  • Score reflects strong floor + missing ceiling

SongForgeAI pipeline mid-distribution sample

B71/100hip-hopAI-generatedcorpus-005

Reference: Hip-hop scaffold (forged, 71)

Mama kept the lights on with a smile she rented
Dad sent letters from the place where men get sentenced
I grew up between the postage and the payment
Learned that love was something measured by the basement
Craft79Expression70Impact65

Internal rhyme (rented / sentenced, postage / measured) is precise without being forced — the genre demands this and the lyric delivers. Concrete domestic specificity holds expression. Impact pulled by abstract close ("basement" is metaphor-heavy and the excerpt doesn't earn it). Solid B with a clear path to B+ if the close grounds.

  • Rhyme intelligence is the standout signal
  • First two lines are clinical without being cold — hard to do
  • Final image needs grounding to clear the B ceiling

SongForgeAI hip-hop genre profile

B68/100popAI-generatedcorpus-006

Reference: Pop chorus, mid-band (forged, 68)

We were almost everything
A chorus we forgot to sing
A promise written in the rain
That washed away before the chain
Craft75Expression60Impact70

Craft floor lifts this — meter and rhyme are clean. Expression takes the hit: "promise written in the rain" + "washed away" is a doublet of generic moves the rubric flags. The hook ("chorus we forgot to sing") is the lyric's saving moment and earns the impact tier its score.

  • Hook line is genuine; the rest of the excerpt undersells it
  • Two consecutive AI tropes ding expression hard
  • Demonstrates the rubric's tolerance for one strong moment in an otherwise mid lyric

SongForgeAI pop genre profile

C+64/100worshipAI-generatedcorpus-008

Reference: Worship/CCM, mid-band (forged, 64)

I came in broken, You met me there
I lift my voice, You hear my prayer
Your grace runs deeper than I can see
From dust to glory You've called me
Craft76Expression52Impact64

Genre-typical rhyme + meter are fine. Expression is held to the floor by phrase familiarity — "grace runs deeper" / "dust to glory" / "hear my prayer" are stock CCM constructions. The rubric does not penalize the genre's conventions automatically, but it does require specificity within them; this excerpt offers none.

  • Worship genre profile flags this exact pattern: CCM template assembly
  • Score would lift with one concrete moment (a place, a person, a dated experience)
  • Demonstrates how the rubric handles convention-heavy genres

SongForgeAI worship genre profile

C+52/100popAI-generatedcorpus-016

Reference: AI baseline mid-band (52, no intervention)

Driving down a backroad, windows down
Memories are playing, loud and proud
We were young and stupid, perfectly free
Now it's just a memory, that used to be me
Craft65Expression38Impact53

Mid-band AI baseline. Three banned-term-adjacent moves (backroad-with-windows-down, young-and-stupid, used-to-be-me). Rhyme is functional; meter scans. Specificity dies on the second line ('memories are playing loud and proud' is two clichés in one phrase). Fits the Observation #2 in /reports/state-of-ai-lyrics-2026 — baseline AI lands in the 35-65 band without targeted intervention.

  • Anchors the corpus's 'baseline AI without intervention' band
  • Three cliché stacks identified: backroad/windows, young+stupid, used-to-be-me
  • Voice metric specifically flags the abstract closing — could be ANY narrator

Baseline GPT-4 output without forge pipeline

D+42/100popAI-generatedcorpus-002

Reference: Generic AI baseline (60s)

Tonight the city lights are calling out my name
I'm dancing through the rain like never the same
My heart is on fire, the stars all align
This moment forever, you're mine, only mine
Craft58Expression28Impact40

Six AI clichés in four lines (city lights, dancing in the rain, heart on fire, stars align, forever/yours). Rhyme intelligence is functional but pulls meaning toward the rhyme target rather than the truth. Specificity floor — every image is interchangeable with every other generic pop image. Anti-Inflation rule pulls expression hard.

  • Banned-term scanner would flag four entries
  • No proper nouns, no concrete objects, no time markers
  • Voice could be any narrator; sets no fingerprint

Pre-pipeline GPT-4 baseline for contrast

D+38/100popAI-generatedcorpus-013

Reference: Anti-Platitude calibration anchor (rubric v1.1.0)

I built the railing on the porch myself
Mixed the concrete in a wheelbarrow
All I really need is love
This is my truth, told slant
Craft62Expression22Impact30

Demonstrates the v1.1.0 Anti-Platitude rule. The first two lines are concrete + Voice-positive (specific tools, specific work). The last two lines are textbook platitudes (universal-need + abstract+possessive). The rubric drops Specificity to 22 and Voice to 30 because the platitudes erase the work the concrete lines did. Implementers reading v1.1.0 should see this composite land in the 35-45 band; if they score above 50, the Anti-Platitude rule isn't firing.

  • Calibration anchor for the Anti-Platitude rule (v1.1.0)
  • Lines 1-2 alone would land in the high-60s; the platitudes drag the whole composite
  • Tests whether your scorer applies platitude detection at the section level rather than averaging

Synthetic example demonstrating the new Anti-Platitude rule (RFC-0002)

D+35/100popAI-generatedcorpus-011

Reference: Floor case — forced rhyme + cliché (forged, 35)

My heart is in a vase that you can break
Love is just a game with stakes too high to take
Don't you let me fade into the night
I need your love to make my world feel right
Craft42Expression22Impact40

Calibration anchor for the D band. Six clichés, two forced rhymes ("take" / "break", "night" / "right"), and an image ("heart in a vase") that doesn't survive examination. Implementations producing C or higher on this excerpt are over-scoring.

  • Anchor for the D-band: any rubric implementation should land here ±5
  • Vase metaphor breaks under any literal reading
  • Forced-rhyme detector + banned-term scanner both fire

Synthetic low-band example for calibration

F18/100popAI-generatedcorpus-012

Reference: Bottom anchor (synthetic, 18)

Baby baby I love you so
Where you go I'll always go
Love me love me one more time
You're so fine and you are mine
Craft22Expression8Impact24

Floor anchor. Functions as the F-band reference: rhyme is mechanical, every image is generic, voice is absent, no concrete detail anywhere. Implementations grading this above 25 have lost the Anti-Inflation rule.

  • Rubric calibration floor — any independent implementation should produce 15-25 here
  • Demonstrates that the rubric distinguishes between weak (35) and broken (18)
  • If your implementation can't reliably score this F, the Gravity Rule isn't being applied

Synthetic floor example for calibration

Help build the corpus.

Target for v1.0.0: 1000+ entries, hand-scored, every entry traceable. Quality gates published.

Read the contribution guide

How to use the corpus

Implementing the rubric in your own pipeline? Run your scorer over these 17 exemplars and compare against the published composite + per-tier numbers. Calibration anchors at the F-band (corpus-012, score 18) and S-band (corpus-003, score 95) are the load-bearing checks — an implementation that drifts more than 5 points on either has miscalibrated the Anti-Inflation rules.

See the model card for runtime mechanics →