The Lyric Scoring Standard
A 12-metric, evidence-based rubric for evaluating AI-generated song lyrics.
Craft (25%) · Expression (40%) · Impact (35%). Five anti- inflation rules. Reproducibility seal on every score (rubric version + model + temperature + build SHA). Installable as @songforgeai/scoring-rubric.
By Todd Nigro, founder, SongForgeAI
First published April 2026 · Current version v1.2.0 (May 2026) · Published under CC BY 4.0
This standard measures the difference between lyrics that merely work and lyrics that feel lived.
The Lyric Scoring Standard is a 12-metric rubric for evaluating AI-generated song lyrics, published under CC BY 4.0. It scores three tiers — Craft (25%), Expression (40%), Impact (35%) — with five anti-inflation rules that keep the average at 50, not 75. Every score response carries an ed25519 reproducibility seal binding the rubric version, model, temperature, and deploy SHA, so the same input produces the same output verifiably.
What makes it different: the rubric is specifically wrong-aboutable. Where most AI tools return a number from a closed system, this standard publishes the metrics, the weights, the calibration corpus (21 hand-scored entries spanning folk, hip-hop, country, MPB, sonnets, ghazals), and the per-subgenre overlays. A reader who disagrees with metric M7’s definition can argue with that specific text instead of arguing in the abstract about what makes a song good.
Honest limits: the eval is currently Claude scoring against the rubric — an LLM grading LLM-generated text. The rubric is published so that limit can be addressed: a 30-rater human cohort is in recruitment to score the public corpus and publish ICC inter-rater statistics. Methodology pre-registered at /scoring/standard/inter-rater; the κ result auto-renders there when the cohort completes.
Known cross-language drift (disclosed 2026-05-03): an operational audit found that v1.2.0 of the rubric scores non-English texts approximately 6.2 points higher than English texts of comparable craft. The 21-entry calibration corpus is currently English-only; non-English scores carry an inherited bias the human cohort work and v1.3.0 corpus extension will close. Methodology and fix plan: /blog/cross-language-scoring-fairness-2026.
For implementers: the machine-readable JSON is at https://songforgeai.com/scoring-standard.json; the npm package @songforgeai/scoring-rubric ships the rubric + helper functions; the calibration corpus is at https://songforgeai.com/scoring-corpus-v1.json. Cite as “Lyric Scoring Standard v1.2.0 (SongForgeAI, 2026), CC BY 4.0.”
1. Why This Exists
The origin story of a scoring system, told honestly.
I kept asking AI to write me a song, and it kept writing me a résumé for a song.
The verses were structured. The choruses rhymed. The emotional vocabulary was acceptable. And every single one felt like it had been assembled rather than written. I would read back a lyric about heartbreak and notice that no specific heart had actually broken. I would read a lyric about faith and notice that no one had actually prayed.
I built SongForgeAI because I wanted AI-assisted lyrics to sound like a person wrote them, not like a system output them. The Lyric Scoring Standard is the evaluation layer that sits underneath that mission. It is how we decide whether a lyric is ready, whether it is pretending, and whether it is worth another pass.
A lyric can be fluent and still not feel alive. This standard exists to tell the difference.
That sentence is the entire paper. Everything below is how we make it true.
2. The Problem with AI Lyrics Right Now
Why general-purpose evaluators keep getting this wrong.
Most AI systems are very good at producing lyrics that look like songs. They generate verses, choruses, rhymes, and emotional language at speed. What they often fail to produce is the harder thing: a lyric that feels specific, singable, emotionally earned, and worth returning to.
The failures show up in recognizable patterns. Images are generic. Emotional claims are unearned. The narrator’s voice shifts mid-song. The chorus lacks a memorable turn. The lines do not sing naturally. Verses sound like the singer is narrating their own emotions from the outside rather than living them from the inside. These patterns appear frequently in AI-assisted lyric workflows, including workflows that begin in music-generation platforms, and they are rarely caught by a general-purpose language-model evaluator, because general-purpose evaluators reward fluency, and fluency is not feeling.
A fluent lyric can still be forgettable. A polished lyric can still be emotionally empty. Song lyrics need their own standard because songs are not essays, and they are not poems placed over music. A lyric must carry melody. It must land in the mouth. It must create emotional movement in very little space. It must repeat without growing dull. It must be clear enough to sing and specific enough to remember. Most general-purpose evaluators are not built for that job. This standard is.
3. The Core Principle
One belief underneath every metric that follows.
A great lyric is not just well-written. It feels necessary.
That necessity shows up in different shapes. It may arrive as a single image that will not let the listener go. It may arrive as a chorus that names something the listener could not name for themselves. It may arrive as a narrator whose voice is so specific that the song could not have been written by anyone else. Whatever its shape, necessity is the feeling that the writer had to write this, and the listener wants to return to it.
The difference between a well-made lyric and a necessary one is whether the writer had to write it. The Lyric Scoring Standard is built to detect that difference without pretending to replace human taste. It asks six honest questions of every lyric it sees:
- Does this sound like it belongs to a real person?
- Do the words support melody, or fight it?
- Is the emotion shown through detail, or announced through language?
- Is the chorus memorable enough to return to?
- Does the lyric improve with attention, or collapse under it?
- Does it avoid the common patterns of AI-generated sameness?
Scoring will never fully capture greatness. But scoring can expose the difference between a lyric that is merely competent and a lyric that is worth refining. That is the job.
4. The Twelve Metrics
Three tiers. Twelve measurements. Each one named for what it actually tests.
Every lyric scored by SongForgeAI is evaluated across twelve metrics, organized into three weighted tiers: Craft, Expression, and Impact. The weighting is deliberate: Craft counts for 25 percent, Expression for 40 percent, and Impact for 35 percent. The reasoning behind that balance appears in the next section.
Tier 1 — Craft (25%)
Is this lyric built well enough to function as a song?
- Prosody and Musicality. The dance of syllable and stress. Do the words want to be sung? This metric evaluates stress patterns, line movement, phrasing, awkward consonant clusters, and whether the lyric feels natural in the mouth of a singer rather than on the page of a reader.
- Structural Architecture. Does the song develop with purpose? This measures the relationship between verses, chorus, pre-chorus, bridge, and payoff. A strong structure creates movement rather than repetition without growth. A lyric with solid architecture earns its final chorus.
- Rhyme Intelligence. Are the rhymes musical, fresh, and emotionally useful? This metric rewards rhyme that supports meaning rather than calling attention to itself. It penalizes forced rhyme, predictable rhyme, and rhyme that distorts natural speech for the sake of a pairing.
- Economy of Language. The discipline of the fewest possible words, each one carrying weight. Great lyrics often say more by cutting more. This metric evaluates compression, redundancy, filler, and whether any line wastes the limited emotional space a song has to work with.
Tier 2 — Expression (40%)
This tier carries the highest weight. This is where AI lyrics most often fail.
- Sensory Specificity. Can the listener enter the scene? This metric rewards concrete, sensory detail: the object on the table, the smell in the room, the light through the window, the physical evidence of an emotion. Generalities describe emotions. Specifics make the listener feel them.
- Image System Coherence. Do the images belong to the same emotional world? Strong lyrics often build a consistent image system where every detail reinforces a central feeling. Weak lyrics scatter unrelated metaphors and hope the listener connects them. This metric measures whether the imagery deepens the song, or decorates it.
- Emotional Truth. Does the feeling feel earned? The feeling arrives through consequence, not announcement. This metric evaluates whether the lyric reveals emotion through situation, choice, contradiction, and detail, rather than declaring sadness, love, regret, hope, or pain. A lyric that says the singer is heartbroken fails this metric. A lyric that shows the singer leaving the porch light on for someone who is not coming back passes it.
- Narrative Voice (Voice & POV Integrity). Does the lyric sound like someone specific is speaking, and does its point of view hold by intent? AI lyrics often drift into a generic narrator that could be anyone. This metric rewards point of view, personality, diction, and vulnerability. As of v1.2.0 it scores intentional POV rather than mere uniformity: a deliberate switch — a K-pop multi-vocalist hand-off, a hip-hop feature verse, a gospel call-and-response — is a craft choice, not a drift failure. What it penalizes is the accidental slide, where the speaker changes and the song does not seem to know it.
Tier 3 — Impact (35%)
Does this lyric matter to a listener once the song is over?
- Memorability. What remains after the song is over? Through v1.1.0 this was a single one-listen test. As of v1.2.0 it is a four-signal cumulative read — hook integration, phonemic distinctiveness, chorus repetition strategy, and one-listen recall — so cumulative, oral-tradition, and ritual-repetition forms (where the hook earns its hold by returning, not by landing once) are no longer false-scored as forgettable. The honest test still stands: what does the listener sing to themselves in the car the next morning?
- Singability. Can the lyric live inside a melody? A lyric may read beautifully and sing terribly. This metric measures vocal flow, phrasing simplicity, vowel openness, chorus lift, and live-performance usability. A lyric that a singer fights is a lyric that loses.
- Replay Value. Does the lyric reward repeated listening? Replay value comes from emotional layers, subtle turns, unresolved tension, and language that keeps revealing meaning the tenth time through. Memorability is the hook test. Replay value is the test of everything that happens around the hook.
- Category Fit. Does the lyric succeed at what it is trying to be? A country ballad, a worship anthem, a pop single, a folk confession, and a cinematic indie song should not be judged by identical surface expectations. This metric evaluates whether the lyric fulfills the genre contract, audience expectation, and emotional promise it set out to meet.
5. The Scorecard at a Glance
One page. The whole system.
The scorecard below shows the full standard in a single view: the three tiers, their weights, the twelve metrics, the five anti-inflation rules, and the final composite output. This is the standard compressed to its essentials.
- Prosody & Musicality. Do the words want to be sung?
- Structural Architecture. Does the song develop with purpose?
- Rhyme Intelligence. Does rhyme serve meaning?
- Economy of Language. Does every line earn its place?
- Sensory Specificity. Can the listener enter the scene?
- Image System Coherence. Do the images share a world?
- Emotional Truth. Is the feeling earned, not announced?
- Narrative Voice. Does someone specific speak?
- Memorability. What remains after the song ends?
- Singability. Can this live inside a melody?
- Replay Value. Does it reward the tenth listen?
- Category Fit. Does it meet its genre contract?
Gravity Rule (default = 50) · Burden of Proof (evidence above 75) · Antagonist Ceiling (weakest tier limits the whole) · Historical Anchor (90+ is scarce) · Anti-Platitude (generic emotional summaries hit the lowest band)
Composite Score (0–100) + Percentile within SongForgeAI corpus + Strongest & Weakest Metrics + Revision Recommendation.
Implementation note. The scorecard is designed to be readable by writers and auditable by evaluators: every high score must point back to evidence in the lyric.
6. Why Expression Leads
On the decision to weight emotional truth above technical correctness.
The three tiers could have been weighted equally. They are not, and the reason matters.
The most common weakness in AI-generated lyrics is not broken structure. It is emotional interchangeability. Most AI lyrics are clean, balanced, and plausible. They scan, rhyme acceptably, and deliver a chorus on schedule. But they often lack the fingerprints of a lived experience. They tell the listener what the singer feels without giving the listener a reason to believe it.
A technically solid lyric with weak expression may function. It rarely moves anyone. The four Expression metrics — Sensory Specificity, Image System Coherence, Emotional Truth, and Narrative Voice — are the hardest for a language model to fake and the easiest for a listener to miss, until they notice the song has left them cold. Weighting Expression at 40 percent forces the score to respect what the listener actually feels.
7. The Anti-Inflation Rules
Why a score of 80 should be rare, and how we keep it that way.
A scoring system is only useful if its scores mean something. AI evaluators tend to overpraise because they are trained to be helpful, and helpfulness without skepticism becomes flattery. Without constraints, most AI lyric evaluators produce scores in the 70s and 80s for lyrics that are merely coherent and formatted like songs. The Lyric Scoring Standard uses five rules to prevent that drift.
The Gravity Rule
The default score is 50, not 75. A lyric does not start as good and lose points. It starts at average and must earn excellence through evidence.
A lyric does not start as good and lose points. It starts at average and must earn excellence.
The Gravity Rule
The Burden of Proof Rule
Any score above 75 must be supported by specific evidence. If a lyric receives a high score for imagery, the evaluator must name the image. If it receives a high score for emotional truth, the evaluator must identify the moment where that truth becomes visible. Unsupported high scores are rejected.
If a lyric is called specific, the evaluator must name the image. If it’s called true, the evaluator must point to the moment the truth becomes visible.
Burden of Proof
The Antagonist Ceiling
The composite score cannot outrun the lyric’s weakest major tier by more than a controlled margin. A song with beautiful imagery but broken structure is not truly excellent. A song with clever rhymes but no emotional truth is not truly excellent. The weakest tier acts as a ceiling on the whole.
The Historical Context Anchor
Scores are anchored against real scored lyric populations, not imaginary perfection. A score of 90 or above should mean rare achievement, not simply a good first draft. Excellence is scarce because excellent lyrics are scarce. The standard protects that scarcity. As of v1.1.0 the anchor is concrete: “professional craft” means the published calibration corpus at /scoring/corpus, and the Antagonist Ceiling now requires evidence — specific lines, specific failure modes — before it can lower a score; vague disagreement does not.
The Anti-Platitude Rule (added v1.1.0)
A line that resolves with a generic emotional summary — “all I need is love,” “this is my truth,” “love wins” — hits the rubric’s lowest Sensory Specificity and Narrative Voice band regardless of surface polish. Platitude is the most common way a competent-looking lyric says nothing; the rule is published so an implementer can cite it rather than rediscover it empirically. This was the first rule added through the standard’s public versioning cadence (RFC-0002), bringing the count from four to five.
8. Diagnostic Axes
Mechanical checks that inform scoring without deciding it.
Some lyric problems are better handled as diagnostics than as subjective scores. The SongForgeAI evaluator runs deterministic checks that inform revision but do not directly change the composite score. These include prosody lint (stress and syllable counts), meter variance across matched sections, point-of-view drift detection, external-versus-internal detail balance, section structure validation, and rhyme-family classification.
This separation matters. A diagnostic can flag a mechanical weakness while the score evaluates artistic effect. A lyric may have uneven syllable counts and still feel musically alive. Another lyric may have perfect structural symmetry and feel emotionally dead. Diagnostics inform judgment. They do not replace it.
9. Percentile Anchoring
Making raw scores legible, without overclaiming.
Raw scores are difficult to interpret in isolation. A 78 may sound good, weak, or excellent depending on the scoring culture around it. To make scores meaningful, SongForgeAI pairs every composite score with a percentile label drawn from the platform’s own reference corpus of previously evaluated lyrics. For example, a composite score of 78 may be reported as the top 12 percent of SongForgeAI’s evaluated body of work. Percentile figures given in this document are illustrative. Live percentile values are calculated against SongForgeAI’s internal reference corpus as of the date of scoring and will be recalibrated as the corpus expands. Full calibration data is disclosed in the methodology appendix at the end of this paper.
10. The Eight-Voice Evaluation Panel
Because one evaluator is never enough.
The reference framework is organized around eight evaluator lenses. Each lens scores the twelve metrics independently within a single structured evaluation pass, bringing a different professional perspective to the lyric. The median result across lenses produces the final composite, with major divergences flagged for manual review.
- The Critic — identifies weaknesses, clichés, and structural failures.
- The Songwriter — evaluates craft, phrasing, voice, and revision potential.
- The Listener — asks whether the lyric connects emotionally on first hearing.
- The Industry Insider — evaluates category fit, market readiness, and audience transfer.
- The Devil’s Advocate — challenges inflated praise and forces evidence.
- The Prosodist — inspects stress, meter, rhyme, and vowel placement.
- The Producer — asks how the lyric survives arrangement, mix, and performance.
- The Cultural Historian — evaluates the lyric against the lineage of its genre.
Eight lenses do not eliminate subjectivity. They prevent any single perspective from defining the result.
On the Panel
Lyric evaluation is partly subjective. Eight lenses do not eliminate subjectivity, but they reduce the risk that one overly generous or overly harsh perspective defines the result. When lenses disagree by more than 15 points on the same metric, the lyric is flagged for human review before a final score is issued.
11. How to Use This Standard
A five-step loop for songwriters and producers.
The standard is a revision tool, not a verdict machine. It is at its most useful inside a tight iteration loop. The five steps below are the workflow SongForgeAI uses internally and recommends to any writer evaluating their own drafts.
- Score the lyric. Run the full twelve-metric evaluation and capture the composite score, the three tier scores, and the percentile.
- Identify the weakest tier. Craft, Expression, or Impact — whichever tier scores lowest is where the revision will have the largest effect on the composite.
- Revise the lowest two metrics first. Do not try to fix everything. Target the two individual metrics holding the weakest tier down, and let the rest of the lyric remain until the next pass.
- Re-score. Run the full evaluation again on the revised lyric to confirm the fix landed where intended.
- Compare percentile movement. A revision that improves the composite score but does not move the percentile is cosmetic. A revision that shifts the lyric into a higher percentile band has genuinely improved the work.
Most lyrics need two to four passes through this loop to reach their ceiling. A lyric that does not improve after four honest passes is usually signalling that the core concept, not the lines, needs rework.
12. Example One — A Heartbreak Chorus
The standard in action, on a single eight-line chorus.
The following example shows the same emotional territory rendered two ways: first as a typical AI-generated draft, then as a revised version that passes the standard. The lyric is a chorus about waiting for someone who is not coming home.
I'm feeling lost in the dark tonight My heart is broken, nothing feels right You're gone and I'm alone again I don't know how this story ends
Composite score: 42. The lyric names emotions instead of showing them. Every image is interchangeable with every other breakup lyric ever written. There is no object in the room, no specific person, no evidence of a lived night.
The porch light's still on, I left it on for you I drink my coffee cold because I forget Your jacket's on the hook, I haven't moved it yet I keep forgetting that I should
Composite score: 79. The same emotional territory, rendered through objects a listener can see and actions a listener can believe. Sensory Specificity gains 22 points. Emotional Truth gains 18 points. Narrative Voice gains 15 points. The chorus is now singable, memorable, and particular to one speaker in one house on one night.
Nothing about the revised version is fancy. No new vocabulary was required. The revision simply answered one question: what is in the room?
13. Example Two — A Worship Chorus
The same standard, applied in a different genre.
A worship chorus operates under different expectations than a country or pop chorus. Repetition is permitted. Direct address to the divine is expected. The genre contract rewards communal singability. But the standard still asks whether the lyric is specific, whether the emotion is earned, and whether the voice belongs to a real person praying rather than a generic worshipper performing prayer.
Your love is amazing, Your grace is so true I lift up my heart and I give it to You Forever I'll praise You, forever I'll sing You are my Savior, my Lord, and my King
Composite score: 48. The category fit is strong. The singability is acceptable. The Expression tier is where the lyric collapses. Every adjective is a worship-music default. The narrator could be any worshipper in any room in any decade. There is no witness here, only vocabulary.
I came in late, I almost didn't come The song was halfway through when I sat down You met me in the second verse anyway You always do, You always do
Composite score: 81. The category fit is preserved; the song is still a worship chorus, still communally singable, still directed toward God. But Sensory Specificity gains 19 points. Emotional Truth gains 21 points. Narrative Voice gains 17 points. The lyric now belongs to one person walking into one sanctuary on one Sunday. The repetition of “You always do” works harder because it follows a scene the listener can picture.
Same standard. Different genre. The questions do not change. Neither does the answer: specificity is how emotion becomes believable.
14. Score Interpretation
A legend for reading composite scores. Percentile bands are illustrative and recalibrated as the corpus grows.
| Composite Score | Interpretation |
|---|---|
| 90 – 100 | Rare. Release-level excellence. Roughly the top 2% of SongForgeAI’s corpus. |
| 80 – 89 | Strong. Commercially promising. Ready for selective refinement. |
| 70 – 79 | Competent. Clearly working, but likely needs another targeted pass. |
| 60 – 69 | Functional draft. Visible weaknesses in one or more major tiers. |
| 50 – 59 | Average. The lyric exists but has not yet earned its place. |
| Below 50 | A major craft, expression, or impact failure. Recommend re-forging. |
15. What This Standard Is For
Who benefits from a scored, honest evaluation.
The Lyric Scoring Standard is designed for:
- AI lyric generation platforms building honest evaluation layers.
- Songwriters using AI as a co-writer and wanting real feedback.
- AI-music platform users running ten or more generations per song.
- Producers screening large volumes of lyric drafts.
- Artists comparing multiple versions of the same song.
- Educators teaching lyric craft with a shared vocabulary for critique.
- Researchers studying AI-generated creative writing.
- Any creator who wants honest feedback instead of automatic praise.
The standard is especially useful when evaluating many drafts quickly. It helps answer the questions that songwriters and producers actually ask: Which version is strongest? Why is this one better? What should be revised first? Is the chorus memorable enough? Is the lyric emotionally specific or generic? Does the song sound like a person wrote it?
16. What This Standard Is Not
The limits of scoring, stated plainly.
The standard is not a machine for declaring artistic truth. It does not claim that a single number can measure beauty, grief, faith, romance, rage, longing, or cultural meaning. It does not replace the songwriter. It does not replace the listener. It cannot tell you whether your song is worth your life. It can only tell you whether it is worth another pass.
Taste is not universal. The standard does not pretend otherwise. It offers a shared vocabulary for disagreement, not a verdict that ends it. A score is a mirror held up to a lyric at a moment in time. The mirror reveals what is working, what is pretending, and what might still become something better. The mirror is not the song.
17. Known Limitations
What version 1.0 cannot yet do.
The current standard is English-first. Prosody, rhyme, stress, and idiom behave differently across languages, and non-English adaptation requires language-specific research that is not yet encoded.
The current reference implementation is genre-sensitive but not genre-complete. Hip-hop, R&B, and global genres in particular require scoring priors that version 1.0 does not yet fully encode. Worship, Nashville country, hyperpop, folk ballad, and theatrical song all require different expectations around repetition, metaphor density, directness, and structure. The standard will be updated as genre-specific calibration work continues.
Subjective metrics such as Emotional Truth and Narrative Voice still depend on evaluator quality. The rubric reduces subjectivity but does not eliminate it. Percentile anchoring also depends on the quality and diversity of the reference corpus. A scoring system is only as meaningful as the body of lyrics it is calibrated against, and the SongForgeAI corpus continues to grow.
18. Ethical and Creative Commitments
What SongForgeAI will and will not do.
This standard is built around a simple creative ethic: AI should help people make more human songs, not more generic ones. The goal is not to automate taste into sameness. The goal is to expose cliché, reward specificity, protect the writer’s voice, and help songwriters move beyond first-draft fluency.
SongForgeAI does not train its evaluation models on user-submitted lyrics without explicit consent. User lyrics remain the property of their writers. A score is a gift from the system to the writer, not an extraction.
A lyric score should not shame the writer. It should give the writer a path. The best use of this standard is not judgment. It is revision.
The commitments above are mirrored in the Terms of Service and Privacy Policy. In any conflict between this whitepaper and those documents, the Terms and Privacy Policy control.
19. References and Influences
Where the thinking comes from.
This standard draws on established songwriting pedagogy, particularly work on prosody, destination writing, object writing, sensory detail, and structural form. The following sources are acknowledged as pedagogical influences. No endorsement by the authors is implied.
- Pat Pattison — Writing Better Lyrics; Songwriting: Essential Guide to Lyric Form and Structure.
- Andrea Stolpe — Popular Lyric Writing.
- Jimmy Webb — Tunesmith: Inside the Art of Songwriting.
- The Berklee College of Music songwriting curriculum tradition.
20. License and Attribution
How to use this document.
This whitepaper and the associated rubric are published as an open rubric under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You may share and adapt the material for any purpose, including commercial use, provided you credit the source.
Recommended attribution string:
SongForgeAI, The Lyric Scoring Standard, v1.2.0, 2026. Licensed under CC BY 4.0.
For implementers. Any adaptation of this standard should preserve the three-tier weighting (Craft 25%, Expression 40%, Impact 35%), the twelve-metric structure, and the five anti-inflation rules. Without those three constraints, the scoring system is no longer equivalent to the SongForgeAI Lyric Scoring Standard and should be described under a different name.
The standard is published as an installable package with the rubric JSON + four helper functions (scoreToGrade, scoreToPercentileLabel, computeComposite, isCompatibleRubricVersion). Zero runtime dependencies.
Source on GitHub: packages/scoring-rubric · View on npm
Cite this document
For academic and technical references.
@techreport{nigro2026lyricscoringstandard,
title = {The Lyric Scoring Standard: A 12-Metric Open Rubric for Evaluating AI-Generated Song Lyrics},
author = {Nigro, Todd},
institution = {SongForgeAI},
year = {2026},
month = {May},
number = {v1.2.0},
url = {https://songforgeai.com/scoring/standard/whitepaper},
note = {Licensed under CC BY 4.0}
}Nigro, T. (2026). The Lyric Scoring Standard: A 12-metric open rubric for evaluating AI-generated song lyrics (v1.2.0). SongForgeAI. https://songforgeai.com/scoring/standard/whitepaper
Nigro, Todd. "The Lyric Scoring Standard: A 12-Metric Open Rubric for Evaluating AI-Generated Song Lyrics." Version 1.2.0, SongForgeAI, May 2026, songforgeai.com/scoring/standard/whitepaper.
SongForgeAI, The Lyric Scoring Standard, v1.2.0, May 2026. Licensed under CC BY 4.0. https://songforgeai.com/scoring/standard/whitepaper
If you publish a paper, article, or tool that uses or adapts this rubric, we’d love to know. Drop a line at support@songforgeai.com.
Appendix A — Methodology
Calibration, panel structure, scoring mechanics, and update policy.
This appendix documents the mechanics of the Lyric Scoring Standard as of the publication date. It exists to make the scoring system auditable, to allow external evaluators to understand how composite scores are produced, and to provide a reference point for future calibration updates.
A.1 Reference Corpus
The reference corpus is SongForgeAI’s internal body of previously scored lyrics. As of April 2026, the corpus is actively growing, with new lyrics added through user sessions and through calibration work conducted by the SongForgeAI team. Corpus size, genre distribution, and scoring date ranges will be disclosed in calibration updates as they are released. Percentile figures cited in this paper should be treated as illustrative until the first public calibration report is released.
A.2 Scoring Model
The reference implementation runs on a large language model directed by the SongForgeAI evaluation prompt suite. The evaluation is structured as eight distinct lenses applied within a single evaluation pass, each scoring the twelve metrics independently before the median composite is calculated. The specific model version in use, along with the full prompt architecture, is recorded in the evaluation metadata attached to every scored lyric.
A.3 Panel Structure
Each of the eight lenses — Critic, Songwriter, Listener, Industry Insider, Devil’s Advocate, Prosodist, Producer, Cultural Historian — contributes one score per metric. The final metric score is the median of the eight lens scores. The tier score is the weighted average of its metric scores. The composite score is the weighted sum of the three tier scores, with Craft at 25 percent, Expression at 40 percent, and Impact at 35 percent.
A.4 Outlier Handling
When lens scores for a single metric diverge by more than 15 points, the metric is flagged for manual review. Persistent divergence across multiple metrics triggers a full human review of the lyric before a final composite is released. Flagged evaluations are retained in the corpus with a divergence marker for later calibration analysis.
A.5 Percentile Calculation
Percentile bands are computed against the current reference corpus using standard rank-based percentile methodology. A composite score’s percentile is the share of corpus lyrics scoring at or below that value, calculated at the time of scoring. As the corpus expands, the percentile bands shift, and the score interpretation table may be recalibrated accordingly.
A.6 Anti-Inflation Enforcement
The five anti-inflation rules — Gravity, Burden of Proof, Antagonist Ceiling, Historical Context Anchor, and Anti-Platitude (added v1.1.0) — are enforced both in the evaluation prompt architecture and in the post-scoring validation layer. Composite scores that violate the Antagonist Ceiling or that exceed 75 without supporting evidence are automatically flagged for review before release.
A.7 Update Policy
The Lyric Scoring Standard is a living document, versioned under a published policy (RFC-0001): PATCH for docs and typos, MINOR for clarifications and additive rules, MAJOR for any change that would move historical composite scores by more than five points. The rubric has already moved through this cadence — v1.0.1 (reproducibility seal), v1.1.0 (the Anti-Platitude rule), and v1.2.0 (the Intentional-POV and four-signal-Memorability refactors). The full version history, with the RFC that ratified each bump, lives at /scoring/standard/changelog. All versions remain available under CC BY 4.0.
A.8 Disclosure Commitment
SongForgeAI intends to publish periodic calibration reports as the reference corpus becomes large and diverse enough for statistically meaningful percentile reporting. Each calibration report will aim to disclose corpus size, genre distribution, score distribution, median scores per tier, divergence rates, and any changes to the percentile bands.
A.9 Rhythmic Dimension (Phase 3)
The 12-metric rubric measures semantic craft, emotional expression, and listener impact. It does not directly measure structural compression — the property that makes a chorus chant-able. SongForgeAI ships a separate rhythmic-dimension analyzer that operates on the same lyric text and surfaces three signals to the writer:
- Chorus / verse compression ratio. Average syllables per chorus line divided by average syllables per verse line. Ratios below 0.85 indicate compressed chant-able choruses; ratios above 1.15 indicate bloated choruses that risk losing the hook — the failure mode external reviewers most consistently flag in AI-generated lyrics.
- End-rhyme architecture. Per-section rhyme schemes (AABB, ABAB, AAAA, ABCB) detected via heuristic letter-tail matching with silent-e and syllabic-le handling. Surfaces structured rhyme presence; misses purely- phonetic rhymes (smoke / oak) that letter-level matching cannot resolve.
- Internal-rhyme density. Rhymes within a single line, normalized per 100 words. High density indicates Eminem-style multi-rhyme construction; low density is conventional prosaic writing.
These signals are surfaced in the writer-facing UI on both the moment-of-forge result page and the dashboard song-detail page. The signals are NOT yet integrated into the M11 (Memorability) score. Phase 4 of the rhythmic dimension will introduce the integration only after production telemetry calibrates the wound-trigger thresholds against real distribution data.
The current implementation uses a heuristic syllable counter accurate to within approximately 10 percent for most English words. This precision is sufficient for the relative comparisons the analyzer makes (chorus density versus verse density). A CMU-dict integration is planned for Phase 5 when absolute syllable precision becomes necessary for cross-language calibration.
Closing
I built this standard because I wanted honest feedback from a system I could actually trust. I wanted a score that meant something. I wanted to stop reading AI lyrics that were fluent and forgettable and start building AI lyrics that felt like something was at stake.
This is version 1.0. It will improve. It will be argued with, broken, and rebuilt, hopefully by people who care about songs as much as I do. That is the point of publishing it under an open license. A standard that no one argues with is a standard that no one is using.
SongForgeAI scores lyrics not to replace human judgment, but to protect the human things AI most often forgets: voice, truth, specificity, and the line that feels lived.
A score is a mirror. The song is still yours to make.
— SongForgeAI · v1.2.0 · CC BY 4.0 —
Working rubric + scoring tool
Read the full 12-metric rubric, or run your own lyrics through it for free.
The standard
12-metric rubric · CC BY 4.0 · machine-readable JSON
Inter-rater agreement
Pre-registered methodology · 30-rater human cohort · ICC
Reproducibility seal
ed25519 signature spec · pubkey JSON · verifier
Changelog
Full version history + accepted RFCs
Version diff
Compare any two rubric versions side by side
Model card
Reference-implementation model disclosure
Prior art
Conservatory rubrics + MIR research + industry frameworks the open standard extends
Sleeper ledger
Day-1 vs Day-2 cold-temperament drift aggregates. Auto-publishes at 200 sleeper-tested songs.
Hit Calibration
Curated corpus of historically-significant songs scored by the rubric — proves it grades craft, not chart success.
Register-aware craft
SA#32 — 10 emotional registers, per-register Gravity Rule + Burden of Proof modifiers. Quality is not universal; register defines craft.