The first published rubric for AI-graded lyrics
A 12-metric public standard. Versioned. Signed. Open-licensed (CC BY 4.0). SongForgeAI uses it to grade lyrics across Craft, Expression, and Impact — but the rubric belongs to anyone who wants to copy, cite, or argue with it. A 50 is average. 80+ is strong. 90+ is rare.
No signup
Score a lyric publicly
5 free scores per IP per day. Same 12 metrics, no account needed. Best for skimming the rubric.
Browse the library
399 curated artist briefs
What the forge knows about every artist in its library — hand-authored briefs, version-controlled.
Try the rubric on your lyrics
Score a draft right now.
Paste lyrics, get a 12-metric breakdown — composite score, transcendent lines, wounds, and per-metric reasoning. Same rubric the documentation below describes; same one the forge runs internally.
Minimum 50 characters.
Sign-in required (free tier includes scoring). See the published rubric →
Craft (25%)
Can this person write? Mechanics, structure, rhyme, and word choice.
Expression (40%)
Does it say something worth hearing? Specificity, originality, truth, and voice.
Impact (35%)
Will anyone remember it tomorrow? Transcendence, arc, stickiness, and genre fit.
Sample scorecard
What an actual evaluation looks like — annotated.
Strong natural rhythm, one forced rhyme in V2
Clean arc, bridge earns its place
Good slant rhyme use, one predictable end-rhyme
Tight overall, two filler words in chorus
"Tangerines and someone else's smile" — earned
Original governing image, one stock metaphor
Rings true. Bridge vulnerability is genuine.
Consistent narrator, one POV slip in V3
Line 14 is the one. "Drove home with the windows down to forget it."
Moves from avoidance to acceptance. Could push further.
Chorus hook is sticky, verses less so
Authentic country with modern specificity
“And drove home with the windows down to forget it”
Marked by 3 of 8 panel voices. Physical action carrying unspoken grief.
The 12 Metrics
Lyrical Specificity
Concrete imagery, sensory detail, proper nouns, time anchors. The opposite of abstract generalities.
The song lives in a real place with real objects. "Tangerines and someone else's smile" instead of "memories of you."
Imagery Originality
Fresh metaphors, defamiliarized objects, governing images that haven't been written to death.
Images that surprise on first read and deepen on second. No shattered hearts, no oceans of tears, no wings of freedom.
Emotional Truth
The ring-test: does it feel true? Earned emotion, unforced vulnerability, no borrowed sentiment.
The emotion arrives through specificity and honesty, not through telling the listener what to feel.
Voice & POV Integrity
Narrator consistency, perspective clarity, and a credible speaker. Does this sound like one person talking?
A distinct human presence. Word choices, diction, and references that belong to one coherent narrator.
Why scores are hard to game
We built anti-inflation into the scoring system so that high scores actually mean something.
Gravity Rule
The default is 50, not 80. Every point above average must be earned with specific evidence from the lyrics.
Burden of Proof
Scores above 80 require the scorer to cite specific lines and explain why they justify the number.
Antagonist Ceiling
A dedicated critical voice challenges every score. If it finds a real weakness, the score drops.
Historical Context
Scores are anchored to professional craft standards. A 90+ means near-flawless execution across all 12 metrics — intentionally rare.
Methodology: how scoring works
Every song is scored by a separate AI evaluation pass — not the same model that wrote the lyrics. Multiple evaluators with different perspectives must reach consensus on each of the 12 metrics.
A dedicated critical voice challenges every score. If it identifies a real weakness — a cliché, a broken meter, a forced rhyme — the score drops. Unresolved objections cap the composite depending on severity.
This rigorous multi-voice process prevents the inflated scores that single-pass AI evaluation produces. Scores are calibrated relative to professional songwriting craft, not to other AI output.
What “deliberately hard” means: a single-pass AI scorer will give most output 80+. Our multi-voice process produces a distribution centered around 50, because the default assumption is “average until proven otherwise.” Scores above 80 require the scorer to cite specific lines. Scores above 90 require near-flawless execution across all 12 metrics — which is why they are rare in practice, not by arbitrary design.
Grade Scale
Near-flawless across all 12 metrics. Exceptionally rare in practice.
Exceptional. Every line earns its place with cited evidence.
Outstanding. Minor imperfections only.
Strong. Craft is evident throughout.
Good. Solid work with room to grow.
Competent. Foundation is there.
Developing. Moments of promise.
Average. Functional but unremarkable.
Below average. Significant gaps.
Needs fundamental rework.
How the composite score works
Each metric scores 0-100. The composite is a weighted average across the three tiers:
What a score should help you do
Open standard
The rubric is public. Adopt it, cite it, argue with it.
Lyric Scoring Standard v1.0 is published under CC BY 4.0. Full spec + machine-readable JSON.
Read the Lyric Scoring StandardRead deeper
The Lyric Scoring Standard v1.0
The full open rubric — 12 metrics, three weighted tiers, four anti-inflation rules, citations, BibTeX.
Read the whitepaperWhy the default score is 50, not 75
The Gravity Rule, Burden of Proof, Antagonist Ceiling, and Historical Context Anchor — why our 73 carries more information than another tool's 87.
Read the essayWhat “specificity” actually means in a lyric
Deep-dive on the Sensory Specificity metric. Detail vs. specificity, the Stolpe test, the one-line rule.
Read the essayEconomy of Language: why the gauntlet cuts lines you love
Every word is a tax on attention. The Economy metric is why the gauntlet hates verse 2 line 3 even when it reads pretty.
Read the essayAnatomy of a Forge: one song from prompt to final
A complete trace of one forge run, all seven phases, with the specific before/after score and the three gauntlet-repaired lines named.
Read the case studyScoring by genre
How the rubric reads different genres — what specificity, voice, and hook-strength actually look like when the genre changes.
See it in action
Every song you forge or evaluate gets a full 12-metric breakdown with reasoning per metric.