Skip to content
Back to the standard
Whitepaper draft v0.9

The Lyric Scoring Standard

A 12-metric open-standard rubric for AI-generated song lyric evaluation

SongForgeAI · April 2026 · CC-BY-4.0 · v0.9 draft

Abstract

We introduce the Lyric Scoring Standard, a 12-metric rubric for evaluating the craft of song lyrics produced by large language models. The standard addresses three gaps in existing AI content evaluation: lack of domain specificity (general-purpose LLM judges do not differentiate between prose and song), rating inflation (default scores skew high absent explicit anti-inflation rules), and the absence of deterministic structural checks orthogonal to subjective evaluation. We formalize the rubric across three weighted tiers (Craft 25%, Expression 40%, Impact 35%), four explicit anti-inflation rules (Gravity, Burden of Proof, Antagonist Ceiling, Historical Context Anchor), and a set of rubric-invisible diagnostic axes (prosody, meter variance, POV stability, detail balance, narrative structure, rhyme family) that inform refinement without affecting the composite score. The standard is published under CC-BY-4.0 and is designed to be replicable across any evaluation framework with equivalent tier-weights and anti-inflation mechanisms.

1. Motivation

As AI lyric generation has become commercially viable (via Suno, Udio, and related music-generation systems), evaluation of lyric craft has remained unstructured. General LLM-as-judge approaches produce scores that correlate weakly with human professional assessment; they also exhibit pronounced rating inflation, with default outputs in the 70-85 range despite the underlying lyric exhibiting identifiable craft failures.

Existing songwriting pedagogies (Pattison, Stolpe, Hayes and Brindell at Berklee College of Music) provide specific, teachable craft rules — prosody preservation, destination writing, external/internal detail balance, sparse-to-dense arrangement. These rules are systematic enough to be encoded into a rubric but are largely absent from general-purpose evaluators. The Lyric Scoring Standard operationalizes the relevant subset of this pedagogy.

2. The Twelve Metrics

The rubric organizes twelve metrics into three tiers. Tier weights reflect the relative importance of each dimension to commercial lyric quality:

  • Craft (25%): Prosody & Musicality, Structural Architecture, Rhyme Intelligence, Economy of Language.
  • Expression (40%): Sensory Specificity, Image System Coherence, Emotional Truth, Narrative Voice.
  • Impact (35%): Memorability, Singability, Replay Value, Category Fit.

Each metric is scored 0–100 on its own, with a composite computed as the tier-weighted average. The Expression tier carries disproportionate weight because expression failures (sensory abstraction, narrator drift, performative emotion) are the most common AI failure mode; craft failures are easier to detect and fix deterministically.

3. Anti-Inflation Rules

The rubric specifies four anti-inflation rules that prevent the evaluator from drifting toward uniformly high scores. These rules are binding on the evaluation prompt and must survive any adaptation:

  1. Gravity Rule: The default score for any metric is 50, not 75. The evaluator starts from “average” and earns its way up, not from “excellent” and negotiates down. This single rule corrects for approximately 8 composite points of baseline drift observed in unregulated LLM judges.
  2. Burden of Proof: Every score above 75 on any metric must cite specific evidence in the lyric. Evaluators cannot assert high Imagery without quoting the image; cannot assert high Narrative Voice without naming the voice attribute.
  3. Antagonist Ceiling: The composite score cannot exceed the weakest tier’s score by more than 15 points. A song with excellent Expression (92) and failing Craft (45) is structurally unstable; the composite reflects that instability.
  4. Historical Context Anchor: Scores are anchored against the actual population of commercially-released songs, not against a hypothetical perfect standard. Rare achievement (90+) means “compares favorably to the top 5% of released work,” not “well-crafted.”

4. Rubric-Invisible Diagnostic Axes

The standard explicitly excludes six classes of measurement from the scored rubric, treating them instead as diagnostic axes that inform refinement but do not affect the composite. This separation is deliberate: some signals are too reliable as deterministic heuristics to warrant subjective evaluation, and some are too fragile for rubric-level codification.

  • Prosody lint: weak endings, stress clusters, syllable outliers. Deterministic via word-list + stress classification.
  • Meter variance: coefficient of variation on per-line syllable counts within sections. Pro-craft threshold CV < 0.15.
  • POV stability: first- / second- / third-person pronoun drift across sections.
  • Detail balance: Stolpe’s external/internal rule applied per section, with pre-chorus as transitional and bridges as free-pass.
  • Narrative structure: presence of verse, chorus, bridge; penalizes missing sections via binary validator.
  • Rhyme family: perfect / family / slant / none classification using curated pronunciation overrides + silent-e long-vowel upgrade on spelling-based extraction.

5. Percentile Anchoring

Raw composite scores are not presented to end users without a percentile anchor. Given a score s, the percentile label is computed against the historical distribution of all forges in the reference implementation (currently SongForgeAI, n > 50,000 songs). The published mapping table is reproducible: any implementation with a statistically equivalent population can compute local percentiles. The anchor is essential to user comprehension — a composite of 78 means nothing without “Top 12%.”

6. Reference Implementation Notes

The reference implementation runs the rubric through a 5-voice evaluation panel (Critic, Songwriter, Listener, Industry Insider, Devil’s Advocate) each scoring all 12 metrics independently. The published composite is the median of the five panel composites, with outliers (>15 point divergence) flagged for reviewer scrutiny. This multi-voice consensus mechanism exists because LLM judges exhibit non-trivial session-to-session variance; the median corrects for that drift without requiring an auditable “true score.”

The rubric is model-agnostic; the reference implementation uses Claude Sonnet for evaluation but has been validated against GPT-4 and Gemini 1.5 Pro with composite-score correlation r > 0.78 across the three systems when the same prompt + anti-inflation rules are enforced.

7. Known Limitations

  • English-only. Prosody rules, rhyme extraction, and POV classifiers assume Germanic / Romance stress patterns. Non-English adaptation requires language-specific rework.
  • The rhyme analyzer uses spelling-based heuristics with pronunciation overrides, not a full phonetic dictionary. Accuracy drops on irregular pronunciations outside the override set (~100 words curated for English lyric frequency).
  • Historical Context Anchor is inherently genre-sensitive. The reference implementation uses a cross-genre population; genre-specific percentile anchors are a future extension.
  • Subjective metrics (Emotional Truth, Narrative Voice) remain dependent on evaluator quality. The rubric reduces but does not eliminate subjective variance.

8. Citations & Pedagogical Sources

  • Pattison, P. Songwriting: Essential Guide to Lyric Form and Structure. Berklee Press, 1991.
  • Pattison, P. Writing Better Lyrics. Writer’s Digest, 2nd ed., 2009.
  • Pattison, P. Setting Your Words to Music. Berklee Online course, ongoing.
  • Stolpe, A. Popular Lyric Writing: 10 Steps to Effective Storytelling. Berklee Press, 2007.
  • Hayes, K. & Brindell, A. Contemporary Songwriting Techniques. Berklee Online curriculum, ongoing.

This whitepaper and the associated rubric are licensed CC-BY-4.0. Attribution: SongForgeAI, Lyric Scoring Standard v0.9, 2026.

9. Changelog

  • v0.9 (April 2026): Draft published. Diagnostic axes enumerated explicitly; rhyme-family analyzer added as deterministic input to Metric #3.
  • v0.8 (March 2026): Anti-inflation rules formalized. Antagonist Ceiling introduced.
  • v0.7 (February 2026): Expression tier weight raised from 35% to 40% after initial implementation showed Craft-tier weight produced over-rewarded technically-polished but emotionally-empty output.
  • v1.0 target: peer review by named songwriters; publication to arXiv (cs.CL) with reproducible evaluation corpus.

Working rubric + scoring tool

Read the full 12-metric rubric, or run your own lyrics through it for free.