Published 2026-05-05. This is the buyer-friendly companion to the Lyric Scoring Standard. The whitepaper is the canonical reference; this post is what to read first.

The headline most AI tools won’t tell you

Most AI lyric tools that surface a score — the ones that even have a score — calibrate to make their users feel good. You write something, the model evaluates it, the number comes back at 87. You write something objectively worse and it comes back at 84. Both numbers are inflated against any honest reading of the lyric, and you can’t tell which output was actually closer to ready.

We do it differently. The Lyric Scoring Standard we publish at /scoring/standard uses a rule called the Gravity Rule: 50 is the median of human-written popular music, not the median of AI output. A song scoring 50 is competent and unsurprising. A song scoring 65 is doing something the median doesn’t. A song scoring 85 is in career-grade territory.

This means most of your forged songs will land between 60 and 78. That’s the shape of the actual distribution; the rubric is honest about it. Here’s how to read each band.

What each band actually says

Below 50: the song isn’t finished

Either a cliché cluster, a structural problem (verses of wildly different lengths, a chorus that doesn’t repeat), or a vagueness problem (every image could belong to a different song). The score is telling you the lyric isn’t ready to record yet, regardless of how it sounds in your head.

50–64: competent, average, recognizable

This is most AI output. The lyric works mechanically; nothing’s broken; you could put it through Suno and it would sing. But it sounds like a hundred other songs because it is a hundred other songs averaged together. If you ship at this level, you’re not doing the part that distinguishes a writer from a generator.

65–74: above the median, room to grow

One specific image is doing real work. The chorus has a returnable phrase. There’s a hook that feels like the song’s own. This is where Refine Mode earns its keep — lock the lines that work, raise the rest one band.

75–84: genuinely strong

Top quartile. Multiple specific images, a coherent emotional through-line, a chorus that earns its repetition. If a working songwriter scored a draft at 75-79 they’d send it to a co-writer; at 80-84 they’d save it to record. This is the band where AI assistance starts producing material a human writer can sign their name to.

85–89: career-grade, rare

Roughly 14% of forge runs land here, and most of those took two or more revision passes. The lyric has a specific anchor detail that wasn’t in the prompt — the model added something. It avoids every cliché cluster. It has a structural move (a turn in the bridge, an unexpected POV shift) that pulls the song above the average of its training distribution.

90+: rarer than the marketing of every AI tool implies

3% of runs. A 90 is the Lyric Scoring Standard’s way of saying the lyric is in the same room as canonical popular music. We’ve audited the corpus; we know what 90 means; if a song’s scoring 90 it’s because the eval panel could find five different reasons to push the score up, not because the prompt said “please be amazing.”

How to read the per-metric breakdown

The composite score is one number; the rubric breaks it into 12 metrics across three tiers (Craft 25%, Expression 40%, Impact 35%). On your dashboard, click into any score chip and you’ll see all 12.

The metric to watch most carefully is Specificity. AI’s default failure mode is vagueness — emotional categories instead of concrete scenes. A high Specificity score (80+) is the strongest single signal that the lyric earned its band. A low Specificity score (below 60) on an otherwise high-scoring song is a warning that the eval panel was charitable somewhere it shouldn’t have been; we know about this drift and we’re narrowing it.

The metric most users misread is Memorability. AI lyric tools love to score memorability high because the chorus repeats — that’s not what the metric measures. Memorability asks whether the line is so specific you can’t replace it with another lyric without losing the song. The chorus of a cliché ballad isn’t memorable; it’s recognizable. Recognizable is a 50.

When to stop revising

The Refine + Re-score loop in our forge is good. It’s not infinite. Diminishing returns kick in around the 80-band; getting from 82 to 84 takes about as much work as getting from 60 to 72. If you’re iterating on the same song past the third revision pass and the band hasn’t moved, the issue is structural — the prompt or the section markers, not the line-by-line word choice. Step back, revise the prompt, re-forge the section, then re-enter Refine.

Songs that are worth the seventh pass are rare and you’ll know them when you have them.

The receipt

Every score on this site carries a reproducibility seal: rubric version, model version, temperature, build SHA. It’s in the URL of the share page; it’s in the API response if you call /api/v1/score directly; it’s on the JSON download. If you ever wonder whether your 78 was scored against the same rubric as your friend’s 81, the seal answers it.

That’s the deal. The score isn’t a vibe.

Reading your lyric score: what each band actually means