Skip to content
All posts
Behind the Scenes2026-05-037 min readBy the SongForgeAI team

We found a 6.2-point bias in our own rubric. Here’s how we’re fixing it.

A 200-song operational audit revealed the published Lyric Scoring Standard systematically inflates scores for non-English texts. We’re publishing the finding, the methodology, and the fix plan before patching, because that’s what a published standard owes its users.

Published 2026-05-03 (Build 1980). The full methodology + fix plan lives in docs/CROSS-LANGUAGE-FAIRNESS.md on GitHub. This post summarizes the finding for a non-engineering audience.

The headline

On May 3, 2026, we ran an audit of 200 songs the SongForgeAI forge had produced over the prior week. Each song carries a composite score from the published 12-metric Lyric Scoring Standard (CC BY 4.0, version 1.2.0). Looking at the score distribution split by language of the lyric, we found something we did not expect:

  • Non-English songs (Latin, Italian, Spanish, French): mean score 82.8 (n=68)
  • English songs: mean score 76.6 (n=132)
  • Gap: 6.2 points

16 of the 16 songs scoring 91 or higher were Latin or Italian. The bottom of the distribution — songs scoring 72-75 — was dominated by English vernacular pieces of comparable or stronger craft.

The published Lyric Scoring Standard, in other words, has a measurable cross-language bias.

Why we’re publishing this

The Lyric Scoring Standard is a public artifact. It ships under a Creative Commons license, the JSON is downloadable, the npm package is on the public registry, and it is positioned as a citeable rubric for the entire field of lyric quality measurement. External researchers are encouraged to test it against their own data.

A 6.2-point bias would be the first thing a competent external auditor would find. It would be found within a week of any serious adoption attempt. The right move — the only move consistent with the rubric’s positioning — is to publish the finding ourselves, before someone else does, with the methodology fully exposed.

If you ever encounter a published standard whose maintainers don’t audit themselves in public, the standard is not what it claims to be. We’re audit-able because we audit. That’s the deal.

What we think is happening

Five hypotheses, ranked by likelihood. The full technical breakdown lives in the methodology doc; here’s the short version:

1. The 21-song calibration corpus is English-only. The rubric is anchored against canonical English-language popular music. When the eval panel encounters Latin or Italian, it has no anchor for “well-crafted non-English in the 80-range” — and the LLM-grader’s default prior is “formal language equals high craft.” This is the most likely root cause and the easiest to fix.

2. The eval panel can’t detect vernacular weakness in non-English. Several rubric metrics rely on the grader recognizing when a phrase is too plain. For English, the model has strong intuition for plainness. For Latin, plainness reads as “neutral” instead of “weak” and the song gets credit it didn’t earn.

3. The banned-cliché scanner is English-only. The 87-phrase banned-term list is entirely English. Latin and Italian texts pass the scanner automatically — even if they contain the equivalent of “shimmer,” “neon,” or “ethereal.” This is confirmed and is the smallest, highest-leverage fix.

4. The Historical Context Anchor anti-inflation rule may inadvertently favor classical/sacred music. The rule says scores should be calibrated against the historical canon. For Gregorian chant or opera, “the canon” is centuries deep and almost universally revered. Applying the rule to a chant text may produce inflation because the comparison set is, by curation, the high end of human creative output.

5. The Cultural Resonance metric may favor formality. Formal Latin liturgical language carries strong cultural resonance by virtue of the form. The metric may be reading inherited cultural weight rather than the song’s specific engagement with its tradition.

The fix plan

Five phases, all dated, all public:

  • Phase 1 — Disclosure (today, B1980). This blog post + the methodology doc + cadence-ritual entries. Done.
  • Phase 2 — Banned-cliché scanner multilingual (B1983, this week). Add Italian, Spanish, French, Latin cliché arrays. Auto-detect language; scan the appropriate one.
  • Phase 3 — Eval-panel prompt update (B1990, next week). Add explicit instruction to apply anti-inflation rules across languages. Form’s inherited cultural weight is not credit toward the song.
  • Phase 4 — Calibration corpus extension (B1995, ~14 days). Add 3-5 non-English anchors at honest score points (75-85, not 90+). Document them in the published rubric. Bumps rubric version 1.2.0 → 1.3.0.
  • Phase 5 — Re-score validation (B2000+, ~21 days). Re-run the eval panel against the 68 non-English songs from the May 3 corpus. Target: gap closes from +6.2 to within ±1.5. Publish the result.

What we’re NOT doing

We are not retroactively re-scoring production songs. Users have already received their scores; changing them after the fact damages trust more than the original drift. The reproducibility seal on each score is timestamped to a specific rubric version; old scores stay valid against rubric v1.2.0, and new forges run against v1.3.0 once it ships.

We are not delaying the disclosure to research more. The disclosure is the work. Researching the fix in private and quietly patching would be the wrong posture for a published standard.

What this means for you

If you’re a SongForgeAI user with non-English songs in your dashboard scoring 85+, your scores are real for rubric v1.2.0 — but rubric v1.2.0 has a documented language bias. New scores under v1.3.0 (shipping ~2026-05-17) will be calibrated more fairly. Both scores remain verifiable against their respective seals; both are part of the historical record.

If you’re a researcher implementing the rubric externally, please factor this finding into your own validation. The phase 2-5 fixes will land in the public JSON at /scoring-standard.json as they ship; the npm package will publish a v1.3.0 once the calibration corpus extends. The CC BY 4.0 license still holds; cite the v1.3.0 changelog when you cite.

If you’re a working songwriter, the practical implication is small: the rubric still works for the lyric you’re writing today. The change is in how the eval panel calibrates against non-English texts, which most users don’t produce.

The trust math

A standard that audits itself in public is more credible, not less, than a standard that ships pristine. Pristine published standards are statistically improbable; the question is whether the maintainers will surface their own drift or wait to be caught.

The Lyric Scoring Standard is a young artifact. It will have more drift findings before it has fewer. Each one will be published the same way: methodology first, fix second, validation third, all dated, all on the public record. That is the contract.

The next entry on this same beat ships when phase 2 lands. Watch /blog.