Skip to content
All posts
Product2026-04-235 min readBy Todd Nigro

Publishing the Lyric Scoring Standard v1.0

An open rubric for measuring whether an AI lyric feels alive. 12 metrics, three weighted tiers, four anti-inflation rules. CC BY 4.0. Published.

I kept asking AI to write me a song, and it kept writing me a résumé for a song.

The verses were structured. The choruses rhymed. The emotional vocabulary was acceptable. And every single one felt like it had been assembled rather than written. I would read back a lyric about heartbreak and notice that no specific heart had actually broken.

That problem is why SongForgeAI exists. It's also why we're publishing the Lyric Scoring Standard v1.0 today, as of this post. An open rubric, under CC BY 4.0, for measuring whether a song lyric feels lived.

What the standard does

It scores a lyric across 12 metrics organized into three weighted tiers:

  • Craft (25%) — Prosody & Musicality, Structural Architecture, Rhyme Intelligence, Economy of Language.
  • Expression (40%) — Sensory Specificity, Image System Coherence, Emotional Truth, Narrative Voice.
  • Impact (35%) — Memorability, Singability, Replay Value, Category Fit.

Expression gets the highest weight because it's where AI lyrics most often fail. A technically solid lyric with weak expression may function. It rarely moves anyone. The four Expression metrics are the hardest for a language model to fake and the easiest for a listener to miss, until they notice the song has left them cold.

Four anti-inflation rules

AI evaluators drift upward. They're trained to be helpful, and helpfulness without skepticism becomes flattery. Left alone, a general-purpose LLM judge produces scores in the 70s and 80s for lyrics that are merely coherent. The standard uses four rules to prevent that drift:

  • The Gravity Rule. The default score is 50, not 75. A lyric starts at average and earns excellence through evidence.
  • Burden of Proof. Any score above 75 must cite specific evidence in the lyric. No unsupported high scores.
  • Antagonist Ceiling. The composite can't outrun the weakest tier. A song with beautiful imagery and broken structure is not excellent.
  • Historical Context Anchor. 90+ means rare achievement, not "nice first draft."

Two examples

The whitepaper includes two before/after chorus rewrites — one heartbreak, one worship — that demonstrate how the standard exposes the difference between abstraction and specificity. Same emotional territory, two scores 37 and 33 points apart. No new vocabulary required. The revision simply answered one question: what is in the room?

Why publish this

A scoring system that no one can audit is a scoring system no one should trust. This standard goes out under CC BY 4.0 so that other AI lyric tools, songwriting educators, and researchers can use it, adapt it, argue with it, and break it. A standard that nobody argues with is a standard nobody is using.

If you implement it in your own tool, preserve the three-tier weighting, the 12 metrics, and the four anti-inflation rules. Without those, the system is no longer the same standard and should be renamed. Everything else is yours to modify.

What's next

We'll publish periodic calibration reports as our reference corpus grows enough for statistically meaningful percentile reporting. Each will disclose corpus size, genre distribution, score distribution, and any changes to the percentile bands.

If you find a place where the rubric fails — a genre it mis-scores, a craft rule it misses, a bias baked into the anti-inflation rules — tell me. The standard is version 1.0 because I expect version 1.1.

Read & download

A score is a mirror. The song is still yours to make.

— Todd Nigro, SongForgeAI. April 23, 2026. v1.0 · CC BY 4.0.