If you ask a large language model to score a song lyric on a 0-to-100 scale, you will almost always get a number between 75 and 90.

It doesn't matter what the lyric is. It could be a masterpiece. It could be word salad. It could be twelve lines of pure cliché assembled by an intern. The average LLM, unprompted, will grade them all somewhere in the same narrow band around 82. That number carries the shape of a compliment. It will mean nothing.

This is the first problem any serious AI lyric evaluator has to solve. And it's the reason the Lyric Scoring Standard v1.0 starts every song at 50, not at 75. We call the rule that enforces this the Gravity Rule, and everything about how our scoring actually measures quality hinges on it.

Why models default to flattery

LLMs are trained on oceans of human feedback, and human feedback about creative work skews positive. People praise. People encourage. Reviewers in datasets are instructed to be helpful, which almost always reads as "be kind." The model learns that a safe, helpful response to "rate this lyric" is a number in the upper range with a softened observation or two.

When you give that model a rubric with a 0-100 scale and no constraints, it treats the rubric as suggestive, not binding. It reaches for 80 the way a nervous teacher reaches for a B+. An 80 is high enough that the writer feels seen, low enough that the model has room to seem discerning. It is a number engineered to avoid confrontation, not a number engineered to measure craft.

The result is the grade-inflation we all know from school, imported into AI evaluation. An A-minus for everything. A model that cannot tell you which of two drafts is stronger, because both get 83.

The Gravity Rule, in one sentence

Every song starts at 50, and only rises when the lyric has earned it.

Fifty is not a bad score. Fifty is the gravitational center. It means the lyric has not yet demonstrated anything beyond existing. It has not yet earned specificity, emotional weight, or craft. The evaluator's job is not to find reasons to be nice. Its job is to watch the lyric work for points.

Every metric begins at 50. The evaluator reads the lyric, and for each of the twelve metrics, asks what does this lyric do, on this axis, that is better than nothing? If the answer is "nothing specific, just adequate language," the score stays near 50. If the answer is "there is one genuinely specific image here," the score moves to 60. If the answer is "the imagery is consistent, the voice is distinct, and one line would be remembered months later," the score climbs toward 80. Ninety only lands when a lyric does something a model cannot fake: a level of truth, specificity, or craft that reads as lived.

Burden of Proof

The Gravity Rule's companion is what we call Burden of Proof. For any metric to score above 70, the evaluator must be able to cite specific line numbers or phrases from the lyric that justify the score. No handwaving. No "feels strong." The evidence lives in the text or the score doesn't move.

This matters more than it sounds. It means a lyric cannot be rewarded for almost being specific. It cannot be rewarded for suggesting an emotional center. It has to put the specificity on the page, visible, quotable, scoreable. A model that has to cite evidence cannot drift into politeness.

Antagonist Ceiling

We added a third rule for the hard cases. When a lyric is technically competent but has a genuine craft problem (say, a chorus that collapses the narrative voice, or a verse that drowns in cliché), the evaluator is instructed to actively cap the score on the affected metric at 65. No upward drift, even if the rest of the lyric is solid. One broken metric, one capped score.

We call this the Antagonist Ceiling. The point is to stop excellence in one dimension from covering for weakness in another. A lyric with a beautiful chorus and a broken second verse should not score the same as a lyric with a beautiful chorus and a strong second verse. The ceiling makes that difference visible.

Historical Context Anchor

The fourth anti-inflation rule is the quietest one, and maybe the most important. It reminds the evaluator, before scoring, of what historically great lyrics look like. Not as a threshold to match. As a reminder that 90 is a real number that means something.

"A Case of You" scores 95 or so on this rubric. "Hallelujah" scores in the mid-90s. "Hurt" (the Trent Reznor version; the Cash cover earns a higher performance grade but the written lyric scores the same) sits around 93. These are not the ceiling. They are the reference. When the evaluator is tempted to give a contemporary AI draft a 92, the Historical Context Anchor is there to ask: is this lyric actually as good as "A Case of You"?

The answer is almost always no. And the score moves to where it belongs.

What 50-as-default feels like in practice

When a user runs their first forge on SongForgeAI and sees a 73, they are sometimes surprised. They expected an 87 because every other AI tool they've ever used returns an 87.

The 73 is real. It means the lyric earned 23 points above gravity. That is a genuinely solid draft. The metrics at the bottom of the stack (often Narrative Arc, or Memorability) show where the lyric has not yet earned its keep. Those are the pages where the writer has work to do.

When the same user runs the refine pipeline and the composite moves to 79, they can see the specific metrics that moved. They can read the gauntlet's notes and see what the revision targeted. The number has become a tool, not a compliment.

Getting above 85 is hard. Above 90 is rare. Above 93 is something we see a few times a week across thousands of songs. That rarity is the system working as intended. It means the number carries information.

The larger point

An AI evaluator that cannot distinguish between drafts is not an evaluator. It is a flattery engine wearing the costume of one. The Gravity Rule exists to strip the costume off.

We published the rubric under CC BY 4.0 in part because we think the songwriting community deserves a scoring standard that refuses to inflate, and we think the way to earn trust in such a standard is to let anyone audit the rules. The Gravity Rule, Burden of Proof, the Antagonist Ceiling, and the Historical Context Anchor are all in the whitepaper. They're there because a score that cannot be argued with cannot be trusted.

If the number on your lyric feels low, that is not the evaluator being harsh. That is the evaluator telling you what you haven't earned yet. The next draft is where you earn it.

That's the deal, and the whole point.

Why the default score is 50, not 75