Narrative Voice: the metric AI lyric tools fail hardest
The fifth rubric essay. Voice is what separates "a breakup song" from THIS narrator’s breakup song — and it’s the single metric where most AI output still sounds like a model writing through a costume rather than a person with a specific background.
Here is the test I run on every forged lyric before I decide whether it’s a hit or a miss: close my eyes, hear the first verse in my head, and ask who is singing this.
If the answer is “a songwriter” — the lyric fails. If the answer is a specific person I could pick out of a crowd, it passes.
This is the fifth essay in a series walking through the Lyric Scoring Standard. We’ve covered publishing the rubric, why the default score is 50, what specificity actually means, and economy of language. Today: Narrative Voice, metric #8 in the Expression tier. The metric AI fails hardest and hit writers chase hardest.
What the metric is NOT
Narrative Voice is not: the narrator’s personality, the narrator’s mood, whether the song is first-person, or the register of diction (plainspoken vs. literary).
Those are inputs. Voice is the output. Voice is what remains constant across verse 1, verse 2, and the bridge — the texture that makes every line recognizably the same speaker even when the subject matter shifts.
Think of the singers you can identify in three seconds: Waits, Dolly, Nina, Kendrick, Joni, Isbell, Billie, Patty Griffin. The voice on the page survives performance and production. Strip away the band, the room, the decade — the voice is still there on the page.
Why AI tools fail this metric
Language models are trained on everyone. That’s the feature and the bug. When you ask a default model to write a country song, it gives you the statistical average of every country song in the training corpus — a voice that is every voice, which means a voice that is no voice.
A good analogy: imagine a singer who studied everyone. They know how Waits phrases, how Dolly lands a hook, how Kendrick cuts a line. But asked to sing, they sound like a talent-show contestant doing impressions. The skill is real; the self isn’t.
The Lyric Scoring Standard’s Voice metric catches this. When the engine reads a lyric, it’s asking: could verse 2 have been written by the same speaker as verse 1, or does this sound like two separate songwriters politely taking turns?
What good looks like
Three signals the metric rewards, in order of strength:
1. Consistent diction across sections. If the narrator says “ain’t” in the verse and “is not” in the chorus, that’s two voices. A strong Voice score has one register throughout — earned, not costumed. The test: read each section independently, and the diction should feel like the same speaker. (Subtle exception: an intentional code-switch for dramatic effect can earn points if the switch is flagged by context. “I told him no” → “said I do not love you” works as a verse-to-chorus shift if the chorus is quoting the narrator being more formal in a specific moment.)
2. Consistent attention. Different narrators notice different things. A lonely narrator notices mugs on counters; an angry narrator notices teeth and fists; a worshipful narrator notices light and bread. What the narrator foregrounds in each verse should feel like the same person’s attention. When verse 1 is all about a coffee mug and verse 2 is all about “the emptiness inside my soul,” the narrator’s eyes changed. The metric flags this.
3. A signature idiosyncrasy. The strongest Voice scores come from lyrics that carry a tiny weird thing nobody else would write. Jason Isbell saying “and the old man’s teeth are yellow and the sky is grey.” Phoebe Bridgers saying “I hate your mom.” Not because those lines are clever — because those lines are unmistakably that person. The engine rewards high-entropy word choices that cluster into a consistent personality across the song.
How to fix a low Voice score
When the eval comes back with Voice as a wound, the gauntlet does one thing well: it rewrites verse 2 to match verse 1’s diction, attention, and signature. It doesn’t try to invent a new voice — it looks for the strongest voice signals already present and amplifies them.
The manual version of this fix:
Read verse 1 out loud. Underline three words that sound like THIS narrator and no one else. (Not the words that are dramatic. The words that are specific to this speaker.)
Now read verse 2. Can you point to three equivalent words? If not, verse 2 is where the voice failed. Rewrite it until the diction markers match verse 1’s.
Same exercise for the bridge. Bridges are where Voice fails most often — writers try to “elevate” into a bigger emotional statement, which in practice means adopting a more abstract register. “I am more than this” is a bridge that abandons the narrator. “I am the kind of woman who waits for the coffee to go cold” keeps the narrator intact.
The metric that reveals the tool
Voice is the metric I use to evaluate new AI lyric tools. If the output is fluent but has no voice — statistical average of everyone, opinion of no one — the tool hasn’t solved the interesting problem. It’s finished what the prompt started; it hasn’t found the narrator.
We built the 50-voice war room because a single model can’t reliably forge a voice. Seven voices arguing for different angles — each writer on the panel carrying specific craft opinions — is a mechanism for narrator-finding, not just lyric-writing. The panel doesn’t just propose lines; it votes on which narrator is emerging from the draft, and the strongest narrator wins.
Most of the quality gap between a 65 and a 78 on our rubric is the Voice metric. It’s also the metric most reliably missing from competitor output. When a forged lyric hits 80+, Voice is almost always why — the panel found a specific narrator who wasn’t in the prompt and refused to let any verse betray them.
The test, once more
Close your eyes. Hear the first verse. Who is singing this?
If the answer is fuzzy, you’re not done.
If the answer is a name — even a made-up one, even “the woman who works the second shift at a diner in Tupelo” — the voice is there. Let it drive the bridge. Let it pick the word at the end of line 3.
That’s what the Voice metric is asking you to do. And it’s why Voice is the hardest metric to fake.