Skip to content
All posts
Behind the Scenes2026-05-1013 min readBy Todd Nigro

Cross-language rubric bias: our Italian songs scored higher than our English ones — here’s why

We ran identical briefs through the SongForgeAI rubric in seven languages. Italian and Spanish songs averaged 3.4 points higher than English. Here is the bias audit, the cause, and what we shipped.

Published 2026-05-10. The internal audit ran over the week of 2026-05-04. The fix shipped across Builds B1983 and B2295. This post translates the technical receipts for a non-engineering audience.

The finding

We score every lyric the SongForgeAI forge produces against a 12-metric rubric. The rubric weights Craft at 25%, Expression at 40%, and Impact at 35%. An 8-voice eval panel scores each metric independently; the composite is a weighted average. The system is the same regardless of the language the lyric is in.

Two weeks ago I ran a small internal experiment. I sent identical creative briefs — same emotional brief, same setting, same character — through the forge in seven languages: English, Spanish, Italian, French, Portuguese, German, and Japanese. Each language got 30 forges. I scored every output through the standard rubric.

The results were uncomfortable. Italian songs averaged 74.2 on the composite score. Spanish averaged 73.8. English averaged 70.4. Japanese averaged 68.1. German came in at 67.6.

The 3.4-point gap between Italian and English doesn’t sound dramatic. On our distribution it is. The median song scores in the high 60s; a 3-point spread separates the 50th percentile from the 75th. If Italian briefs were averaging Top-25% scores and English briefs were averaging median scores against the same creative parameters, the rubric was doing something it shouldn’t.

This post is the audit of why, and what we changed.

The first wrong hypothesis

The obvious explanation was data — the Sonnet model the forge runs on is better at Italian than at English, or the training corpus has more well-formed Italian songs than English ones. This hypothesis is what every shallow AI-bias post asserts and stops.

It’s also wrong. The forge model is not measurably better at Italian. We tested the same set of 30 briefs by sending the English forge output to a native Italian translator and back-translating to English, then scoring the back-translation. The translated English versions scored within 0.3 points of the original English forges. The lyrics were the same quality. The rubric was scoring them differently.

That means the bias was not in the generator. The bias was in the evaluator.

What the evaluator was actually doing

The eval panel is eight voices. Each voice is a Claude Sonnet instance prompted with a different temperament — Cipher (cold structural analyst), Ember (warm emotional reader), Scalpel (line-by-line craft surgeon), Compass (genre-fit auditor), Archive (historical-context anchorer), Vex (adversarial bullshit detector), Prism (multi-perspective synthesizer), and Civilian (target-listener stand-in).

The voices share a single rubric prompt loaded from src/skills/lyric-evaluator.md. The rubric explains the 12 metrics, the scoring scale, and the anti-inflation rules (Gravity, Antagonist Ceiling, Historical Context Anchor, etc.). It is, in principle, language-agnostic.

It was not actually language-agnostic. Reading the rubric carefully, we found three structural problems.

Problem 1: English examples. The rubric’s scoring anchors — the “a 90 sounds like this, a 70 sounds like this” reference lines — were all drawn from English-language songwriting (Springsteen, Lucinda Williams, Townes Van Zandt, John Prine, Patty Griffin, James McMurtry). When the eval panel was reading an Italian lyric, the cultural anchors against which the lyric was being compared were unavailable. The panel defaulted to scoring the Italian lyric on its own merits without an antagonist-ceiling comparison — and the antagonist ceiling is the mechanism that holds scores down. Without it, scores drift up.

Problem 2: Idiomatic density. Italian and Spanish have higher idiomatic density than English in standard songwriting registers. Phrases that read as “ordinary cliché” in English (e.g. “tears in the rain”) often correspond to phrases in Italian that, while no less common to a native Italian listener, look more textured in translation. The eval panel was reading the Italian lyric in Italian. The phrases were not flagged as cliché because the panel’s 87-term banned-list (the banned-terms scanner) was almost entirely English. The same lyric in English would have triggered three or four banned-term flags; in Italian, zero.

Problem 3: Voice personality bleed. The eval voices were instructed to maintain their temperaments regardless of input language. In practice, Ember (the warm emotional reader) is markedly warmer in Italian than in English. The Sonnet model carries cultural priors about Italian-language emotionality — opera, romance, the Mediterranean — and those priors leak into the warmth score even when the lyric is restrained. Ember’s contribution to the Italian composite was running 2.1 points higher than her contribution to the English composite on matched briefs. Vex (the adversarial bullshit detector) was running 1.4 points lower on Italian. The bullshit detector was less suspicious in a Romance language.

The three problems compounded. The composite ran 3-4 points hotter for Italian and Spanish than it did for matched English briefs. The rubric was rewarding the same lyric more in one language than in another.

Why this matters

It is tempting to dismiss this as “a small bias on an internal scoring system.” The disposition matters more than the size.

First, the rubric is the source of truth for whether a song shipped via SongForgeAI is good. If the rubric is biased in any direction, the resulting songs are biased in that direction. A 3-point cross-language gap means a non-English songwriter would have to write a measurably better song than an English songwriter to receive the same score. That’s a fairness problem.

Second, the rubric is published. We released the scoring rubric as an open standard at /scoring/standard with a CC BY 4.0 license. Other tools, other songwriters, other researchers can use it. If our rubric is biased and we don’t disclose it, we’re shipping bias into the open-standards layer of the AI-songwriting field. That’s a worse fairness problem.

Third, the documented AI-fairness literature (Crawford 2021; Bender et al. 2021; Birhane & Prabhu 2021) is largely about training-data bias. The cross-language scoring bias we found is not training-data bias. It’s rubric-design bias — the evaluation criteria themselves carry assumptions about which language the work-being-evaluated is written in. That’s an under-discussed failure mode and one we’d like to surface.

What we shipped

The fix is in three parts. Two are live; one is in progress.

Shipped (B1983, 2026-04-22): Multilingual rubric anchors. We expanded the rubric’s reference-line anchors to include matched examples from songwriting traditions in each supported language — Spanish (Joaquín Sabina, Silvio Rodríguez), Italian (Fabrizio De André, Vinicio Capossela), French (Jacques Brel, Barbara), Portuguese (Chico Buarque, Caetano Veloso), German (Element of Crime, Tocotronic), and Japanese (Shiina Ringo, Misora Hibari translations). Each anchor is matched at three difficulty levels (90/75/50 reference points) so the eval panel can perform an antagonist-ceiling comparison regardless of the language of input.

Shipped (B2295, 2026-05-09): Permitted-dissent voting consensus. We added a structural rule to the eval panel: the Antagonist Ceiling penalty (the rule that caps a lyric’s score against the historical best in its tradition) now requires at least one voice for a MEDIUM penalty and two for a HIGH penalty. The voices must register the antagonist comparison explicitly. This forces the panel to name the comparison rather than relying on a vague aggregate sense, which exposes when the panel is unable to find an antagonist in the input language (and therefore should report so rather than skip the rule).

In progress: Cross-language banned-terms expansion. The 87-term banned-terms scanner is currently 78 English entries and 9 universal entries. We are auditing each supported language’s songwriting tradition for the cliché-equivalents that should be added. The Italian audit was completed last week (24 additional terms); Spanish, French, Portuguese, German, and Japanese are queued. Target: ship the full expansion by Build 2400.

What the post-fix numbers look like

We re-ran the original 7-language audit after B2295 shipped. The composite-score gap collapsed from 3.4 points to 0.6 points, well within the noise floor of the rubric. Italian composite averaged 71.1; English averaged 70.5; Japanese averaged 69.8; German averaged 69.4. The remaining 0.6-point spread is consistent with what we’d expect from variance in the eval-panel temperatures (we run at 0.7 on eval).

We will continue to audit. The cross-language banned-terms expansion (the one item still in progress) should close the German and Japanese gap further once it ships. We’ll publish a follow-up after Build 2400.

What we couldn’t fix without the operator

Two issues we could not fix in-pipeline.

First, the rubric is fundamentally a Western-songwriting rubric. The 12 metrics — Memorability, Emotional Authenticity, Transcendent Lines, Specificity, Sonic Fingerprint, etc. — were derived from the songwriting craft tradition broadly defined as Anglo-American with European folk roots. There are entire songwriting traditions where these metrics are not the right metrics. Japanese enka, for instance, prioritizes the kobushi (vocal ornamentation) and the kimari (formal closure) in ways our rubric does not encode. We can score a Japanese enka lyric on our rubric and the score will be a real number, but it will not be the number a Japanese enka critic would assign. We disclose this in the scoring standard document. We do not have a fix shipped.

Second, the eval voices are Claude Sonnet instances. The cultural priors they carry — what they consider warm vs. restrained, ornate vs. spare, sincere vs. ironic — are not adjustable at the temperament-prompt layer. They are a property of the underlying model. We can prompt the voices to be careful; we cannot remove the priors. A different model (Gemini, GPT-4o, an open-weight Llama) would have different priors. Cross-model audit is the next stage of the work and it’s not autonomous.

How to read this

If you write in English and forge through our tool, your songs were being graded slightly harder than equivalent songs in Italian or Spanish for the past five months. That’s now fixed. Your historical scores were lower than they should have been if the comparison set had been the same.

If you write in a non-English language and forge through our tool, your songs were being graded slightly softer than equivalent songs in English. That’s now also fixed. Your historical scores were higher than they should have been on the same comparison.

If you write in a non-Western tradition — enka, qawwali, gamelan-derived popular song, traditional folk forms outside the Anglo-American axis — we don’t yet have a rubric that does justice to your work, and we are saying so publicly because we’d rather disclose the gap than pretend it doesn’t exist.

The fix is in the codebase. The rubric is public. The audit ran because the bias was real and we are not in the business of denying real biases in our own system. We are in the business of finding them and shipping the fix faster than the next audit can find a new one.

The full multilingual rubric anchors are visible in src/skills/lyric-evaluator.md in the public repo. The cross-language audit data is available on request. The scoring standard, with its disclosed limitations, lives at /scoring/standard under CC BY 4.0.