Who Is Singing This? The Voice Gap in AI Lyrics
AI can imitate genres. It cannot reliably sound like one specific person. A deep dive on the Voice & POV metric — the sneakiest craft failure in AI lyrics, and how to prompt for a narrator who actually exists.
Two AI-generated breakup songs. Different prompts, different genres. Read the first verse of each:
Song A: I keep your photo where I used to sit The kitchen feels too big for one I still set the table out of habit Then eat standing at the counter Song B: Your picture stays above the sink The rooms grew wider when you left I keep two place mats, still, unused Eat standing up, so I don't feel it
These are different songs. Different prompts. Different requested genres (one indie folk, one americana). But read them closely. Are these two different narrators, or are they the same person wearing slightly different clothes?
They are the same person. The diction is identical. The observational posture is identical. The rhythm of "I did X / then Y" repeats. Even the metaphor grammar is the same — an absence described as the presence of extra space. Both songs were written by the same narrator, who happens to be nobody in particular.
This is the voice problem, and it is the single most overlooked failure mode in AI lyric generation. Genre differences are visible. Theme differences are visible. Narrator differences are not, until you stop to listen for them. And once you hear it, you cannot unhear it.
What voice actually means in a lyric
Voice is not the singer. It is the narrator — the invented person whose perspective the song inhabits. Every song has one, whether the writer designed them deliberately or accepted the default. Voice lives in:
- Diction. Which specific words the narrator reaches for. "I reckon" vs. "I guess" vs. "I think." "Bodega" vs. "corner store." "Sofa" vs. "couch" vs. "davenport." Each choice is a fingerprint.
- Reference set. What the narrator notices — and equally, what they do not. A 22-year-old notices brand names, hashtags, bar tabs. A 55-year-old notices weather, property lines, what is on the news.
- Posture. Are they defending, confessing, explaining, mocking, apologizing, remembering? The same sentence reads completely differently in each posture.
- What they admit to. A narrator's honesty level is itself a voice feature. Some narrators tell you everything. Some narrators reveal the truth by accident in verse three. A voice is partly what it will not say.
When a lyric has voice, you can picture the person saying it. You know their age within five years. You can guess what they drive. You would recognize them on the phone. That is not a vibe — it is a measurable craft outcome, and it is the Voice & POV Integrity metric (number 8 in our 12-metric rubric, inside the Expression tier that carries 40% of the composite score).
Why AI blurs voice by default
Foundation models average their training data. Asked to write "a breakup song," a model produces the statistical mean of every breakup song it has seen — which means it performs the genre's surface tropes (kitchens, empty rooms, habits continuing past their cause) but strips out the narrator-specific weirdness that made any individual source song memorable.
This happens for a reason. Training a model to produce voice-consistent output would require the training objective to reward specificity over genre-fit, which directly contradicts the objective that makes models usable for everything else. A model that produced a strong narrator by default would produce the wrong strong narrator most of the time — rejection-rate would shoot up. So the default is the medium-bland narrator who fits every prompt because they fit none specifically.
The observable consequence: median first-pass AI lyrics score 55-65 on the Voice metric. Voice is in the bottom third of where AI performs, alongside Specificity (the number one weakness) and Imagery. All three weaknesses are the same weakness at different zoom levels.
The narrator experiment
We ran the identical prompt through SongForgeAI three times, changing only the narrator anchor:
Prompt base: "A country song about leaving the first person you ever loved."
Narrator A: A 27-year-old paralegal in Cincinnati, drives a Civic, never left her hometown.
I drove past his mama's house today She was taking out the trash Still waves at me like I'm family I wave back, I don't know why I wave back
Narrator B: A 54-year-old who has been through this before, remarried once, drives a pickup.
I'm too old for this particular heartbreak I thought I was done paying for it But here it is, three in the morning, and the ceiling Remembers every apology I never made
Narrator C: A 19-year-old on a tour bus, first real heartbreak, everything is theatrical.
I told the whole bus about it Three different ways, three different versions By Salt Lake I was sure I was fine By Denver I was writing this song
Same prompt. Three different people. None of them is the median AI narrator, because the median AI narrator does not have an age, a city, or a specific relationship to their own heartbreak. The moment you give the model those anchors, the narrator appears.
What to prompt for
Three anchors do about 80% of the work:
- Age within a decade. The 19-year-old, the 27-year-old, and the 54-year-old above are different people because they notice different things and talk differently. You do not need the exact age — the decade does most of the work.
- A specific place. Not "a small town." Cincinnati. Akron. Tacoma. A specific place evokes a specific set of daily textures that a generic anchor cannot.
- A posture toward the song's subject. Still angry? Over it? Ambivalent? Nostalgic? The same situation narrated from three postures produces three songs.
The second-tier anchors — job, what they drive, one small thing they notice — sharpen the narrator further. But age plus place plus posture alone is enough to move a Voice score from the mid-60s to the high 70s.
Where ghost mode fits in
Our Ghost Collaborators feature installs a stylistic scaffold — diction tendencies, imagery habits, structural preferences — over whatever narrator the prompt specifies. Ghost mode improves Voice scores measurably, by about 4 to 7 composite points on average.
What ghost mode does not do: substitute for narrator identity. The same ghost scaffold writing for a 19-year-old on a tour bus produces a recognizably different lyric than that same scaffold writing for a 54-year-old. The ghost colors the voice. It does not replace the narrator.
This is why combining ghost mode with narrator anchoring produces the highest Voice scores we see. Ghost supplies the stylistic fingerprint. Narrator supplies the person. Both together produce a lyric that sounds like this specific poet writing for this specific person — which is what "voice" has always meant in songwriting.
The voice self-check
Three questions to ask any lyric you have generated:
- Can you picture the narrator? If you can only picture "someone sad," the voice is not there yet.
- Do you know their age within a decade? If the same lyric could be written by a 22-year-old or a 55-year-old, it was written by neither.
- Would they talk the same way if they were saying this out loud, to a stranger, at a bar? If the diction feels writerly rather than spoken, the narrator is a version of the writer, not a version of a person.
A lyric that fails all three is working at about 55 on the Voice metric. A lyric that passes all three is working in the high 70s or 80s. The jump is not talent. It is narrator anchoring. You can ask for it. Most people do not.
Why this matters now
As AI lyric tools proliferate, the surface features — rhyme, structure, vocabulary — converge. Every tool is going to produce technically competent verses about empty kitchens. The differentiator, for writers and for tools, is going to be voice: how specifically and convincingly the narrator exists as a single person.
You cannot fake it. A lyric with voice reads as itself. A lyric without voice reads as "a lyric." Once you have heard the difference, you cannot stop hearing it, and you will not settle for the second one.
— The SongForgeAI team