A fair question about any AI scoring system: if you score the same lyrics twice, do you get the same number? We tested this extensively during development. The answer is instructive.

The experiment

We took five songs at different quality levels — one weak draft, one average, one strong, one excellent, and one that had previously scored above 90. We ran each through the scoring engine 20 times with no changes to the lyrics.

What we found

Composite scores varied by 3-5 points. A song that scored 78 on one pass might score 74 or 82 on another. That variance is real, and it is intentional. The scoring uses a multi-voice multi-voice evaluation where different evaluators must reach consensus. Different runs produce slightly different deliberations, which produce slightly different scores.

Tier placement was consistent. A song in the 75-85 range never scored below 70 or above 88. The variance stays within a band. You will never see a 60 become a 90 or vice versa. The scoring reliably distinguishes quality levels even if the exact number shifts.

Individual metrics varied more than composites. A single metric like Prosody might swing 8-10 points between runs. But because the composite is a weighted average of 12 metrics, the swings cancel out. This is by design — the composite is more stable than any individual score.

Transcendent lines were identified consistently. If a line was flagged as transcendent on one pass, it was flagged on 17 of 20 passes. The system agrees on which moments are exceptional even when the exact numbers drift.

Why variance is not a bug

A scoring system with zero variance would mean a single fixed algorithm with no deliberation. That produces scores that are precise but wrong in a consistent way — like a bathroom scale that always reads 150 but is always off by 3 pounds in the same direction.

The multi-voice evaluation approach produces scores that are slightly different each time but correct on average. The 3-5 point variance is the honest representation of the fact that evaluating creative work involves judgment, not measurement. Two skilled human critics would disagree by at least that much.

What this means for you

Do not chase a specific number. If your song scored 79, it is a 79 — whether the "true" score is 77 or 81 does not change what the lyrics need. Focus on the per-metric breakdown and the specific evidence cited. Those are more stable and more useful than the composite. And if you want to see a score improve, the path is always the same: make the weak metrics stronger. The scoring rubric shows you exactly what each metric rewards.

What Happens When You Score the Same Song Twice

The experiment

What we found

Why variance is not a bug

What this means for you