Lyric Scoring Standard
Anchored at composite 95 against canonical professional craft — a country song from 1949 that survived seventy-five years. Read the calibration →
An evidence-based 12-metric rubric for scoring song lyrics. Weighted across three tiers (Craft 25%, Expression 40%, Impact 35%). Anti-inflation rules are load-bearing — a 50 is average; a 90+ is rare.
Reading the rubric's evolution? See the unified version + RFC changelog or the structural diff between every published version.
Looking for the orthogonal axis? Read the Fidelity Standard v0.1.0 — the seven-component composite measuring whether the lyric served the brief. (RFC-0010 in comment through 2026-05-26.)
Want the architectural commitment behind the rubric? Read Topic Sovereignty — the principle that says your idea is sovereign at the SUBJECT layer; the system varies the TREATMENT.
Jump to section
Adopt freely, cite as 'Lyric Scoring Standard v1.2.0 (SongForgeAI).' No charge, no permission needed.
License terms →Versioned, machine-readable specification. Stable shape; bumps documented in the version table below.
Download JSON →Published date. Major changes trigger a version bump + interim Songwriter Index edition documenting the diff.
Adopt the standard in one command
Pure data + helpers. Zero runtime dependencies. MIT for the helper code, CC BY 4.0 for the rubric JSON itself. Cite the standard by name + version when scoring against it.
Helpers exported: scoreToGrade(), scoreToPercentileLabel(), computeComposite(), isCompatibleRubricVersion(). Full rubric available as @songforgeai/scoring-rubric/rubric.json.
Published artifacts — four packages on npm
The standard and its companion infrastructure ship as four published packages under the @songforgeai scope. Every install is a citation; every download is on the public counter. The publish discipline is now CI-enforced — any built-but-not-published public artifact fails the check:built-not-published ratchet at the commit gate.
- @songforgeai/scoring-rubric
The 12-metric Lyric Scoring Standard. Pure data + helpers (scoreToGrade, computeComposite, isCompatibleRubricVersion). This standard, installable.
- @songforgeai/fidelity-standard
The seven-component Fidelity Standard — orthogonal to quality. Measures whether the lyric honored the brief, not whether the lyric is good. Currently in RFC-0010 public comment through 2026-05-26.
- @songforgeai/agent-room
The multi-agent “writing room” consensus pattern extracted from the scoring pipeline. N panelists score in parallel; median + standard deviation surface the divergence. Reusable beyond lyric evaluation.
- @songforgeai/client
The TypeScript SDK for the public scoring API. Includes ed25519 seal verification via
verifySeal()so consumers can prove any score response was actually produced by the published rubric + named model + temperature.
Cite this standard
The Lyric Scoring Standard is published under CC BY 4.0. Attribution required when you cite it in a paper, blog post, or third-party tool. Three formats below — copy whichever fits your medium.
@misc{songforgeai_lyric_scoring_2026,
author = {{SongForgeAI}},
title = {Lyric Scoring Standard, version 1.2.0},
year = {2026},
howpublished = {\url{https://songforgeai.com/scoring/standard}},
note = {Licensed CC BY 4.0}
}Cited the standard in your work? Add yourself to /cited-by — the public registry of external implementations.
Machine-readable surfaces
- Rubric JSON: /scoring-standard.json — canonical machine-readable rubric.
- JSON Schema: /scoring-standard.schema.json — validate any adopted copy of the rubric (W3C-style publication; conforms to JSON Schema draft 2020-12).
- Schema.org Dataset: ingested by Google Dataset Search + academic aggregators via the in-page
<script type="application/ld+json">block.
Anti-Inflation Philosophy
Five rules baked into the rubric so a 65 from us means what a 65 from a human craft critic would mean — not the inflated 80 most LLM-judged scoring inflates to.
the default is 50. Every point above 50 must be earned with specific evidence from the lyric itself.
scores above 80 require the scorer to cite specific lines + explain why they justify the number.
a dedicated critical voice challenges every score. If it finds a real weakness the score drops. v1.1.0 clarification: the antagonist must produce evidence (specific lines, specific failure modes) — vague disagreement does not lower the score.
scores anchor to professional craft standards, not to other AI output. A 90+ means near-flawless execution across all 12 metrics. v1.1.0 clarification: 'professional craft' = the corpus published at /scoring/corpus, especially the Hank Williams S-band anchor at composite 95.
a line that resolves with a generic emotional summary ('all I need is love', 'this is my truth', 'love wins') hits the rubric's lowest Specificity + Voice band regardless of surface polish. Documented inline so implementers can cite a published rule rather than discover this empirically.
Three Tiers
Craft (25%)
Can this person write? Mechanics, structure, rhyme, and word choice.
Expression (40%)
Does it say something worth hearing? Specificity, originality, truth, and voice.
Impact (35%)
Will anyone remember it tomorrow? Transcendence, arc, stickiness, and genre fit.
The 12 Metrics
Prosody & Musicality
Meter, stress patterns, consonant and vowel clusters, intentional silence, and breath points. Does the lyric feel good in the mouth?
What good looks likeNatural rhythmic flow that a singer can inhabit without fighting the phrasing. Stressed syllables land on strong beats.
Structural Architecture
Song shape, arc, verse progression, chorus return, and bridge revelation. Does the structure serve the story?
What good looks likeEach section has a clear job. Verses build, choruses resolve, the bridge shifts perspective. Nothing feels arbitrary.
Rhyme Intelligence
Rhyme as craft servant: internal rhyme, slant rhyme, strategic non-rhyme. Does the rhyme scheme feel intentional rather than forced?
What good looks likeRhymes land with purpose. A mix of perfect, slant, and internal rhyme that never bends meaning to satisfy a sound.
Economy of Language
Every word earning its place. No filler, no padding, no lines that exist only to set up a rhyme.
What good looks likeYou cannot remove a word without losing something. Every syllable carries weight or music.
Lyrical Specificity
Concrete imagery, sensory detail, proper nouns, time anchors. The opposite of abstract generalities.
What good looks likeThe song lives in a real place with real objects. "Tangerines and someone else's smile" instead of "memories of you."
Imagery Originality
Fresh metaphors, defamiliarized objects, governing images that haven't been written to death.
What good looks likeImages that surprise on first read and deepen on second. No shattered hearts, no oceans of tears, no wings of freedom.
Emotional Truth
The ring-test: does it feel true? Earned emotion, unforced vulnerability, no borrowed sentiment.
What good looks likeThe emotion arrives through specificity and honesty, not through telling the listener what to feel.
Voice & POV Integrity
Does the narrator sound like a real person, with stance, diction, and reference frame consistent with the song's intent? Includes deliberate POV switches when the song earns them. v1.2.0: refactored from "one coherent narrator" to "INTENTIONAL POV" — collaborative-writing forms (K-pop multi-voice choruses, hip-hop featured verses, gospel call-and-response) score the same as single-narrator stability when the POV switches are deliberate. The metric still penalizes ACCIDENTAL drift; the change is that intentional multi-voice stops being a false positive.
What good looks likeA distinct human presence (or presences) the listener can locate. When POV stays single: word choices, diction, and references belong to one coherent narrator. When POV switches: each voice is internally consistent, the switches mark structural moments (verse → chorus, lead → feature, cantor → response), and the listener never wonders "who is talking now?" by accident.
The Transcendent Line
The unrepeatable line. Not necessarily the cleverest; the truest. The line someone would quote.
What good looks likeAt least one line that stops a listener cold. The kind of line people screenshot and share.
Emotional Arc
Does the song move from state A to state B? Revelation, release, recalibration. Not just emotion, but emotional motion.
What good looks likeThe listener ends the song in a different place than they started. Something shifted.
Memorability
Will this lyric persist? v1.2.0: refactored away from the single "60-minute test" (which privileged literate quotability over structural / cumulative / oral-tradition durability). Now reads across four signals: (1) hook integration — does the recurring phrase land harder each return? (2) phonemic distinctiveness — does the most-repeated line have a sonic shape that resists merging with the medium? (3) chorus-line repetition strategy — is the title earning its repeats? (4) one-listen recall — could a listener quote a line after one pass? No single signal carries the metric; cumulative durability is the goal.
What good looks likeA chorus that means something different by its third return than its first. A title-line whose phonemic shape distinguishes it from its neighbors. A hook that the listener hums involuntarily OR an incantatory refrain that compounds devotional / communal weight (call-and-response, ghazal radif, qawwali ostinato). The metric reads at S-band when the song's memorability mechanism is integral to its form, not bolted on.
Genre Authenticity
Does this honor its genre while extending it? Genre fluency without genre cliche.
What good looks likeA country song that sounds like country but doesn't sound like every country song. Respect and surprise.
Sub-criteria
Named sub-concepts inside the 12 metricsEach of the 12 metrics evaluates multiple signals. Sub-criteria name the discrete signals the eval engine treats as load-bearing — making implicit rubric judgments legible. Cite a sub-criterion by metric + sub-criterion name (e.g. Economy.Restraint).
Restraint
Inside Economy →Stress earns its position. Emphasis is placed where the meaning lands, not scattered across every line. Filler words ("just," "really," "kinda," "like") and reflexive intensifiers ("so very," "totally completely") drop a line out of the Economy band even when its other craft signals are intact. Profanity and explicit content, when present, follow the same rule — Eminem's "Stan" has profanity, but it's positioned; lazy mixtape filler has the same density spread randomly.
SignalsPer-line filler-word density; ratio of placed-vs-scattered emphasis (line endings + internal stress positions); repetition-with-meaning vs repetition-as-padding (cf. Memorability metric, which rewards repetition that earns its return); intensifier stacking. The signal applies to clean, mature, and explicit registers equally — lazy clean and lazy explicit fail the same way.
Failure looks likeA 4-line stanza where every line ends on "yeah," "though," or "y'know"; a chorus that swears every other word without any of them landing harder than the surrounding text; a verse where "just" or "really" appears in three of four lines as syllable padding.
External-Internal Balance
Inside Specificity →Lyrics ground emotion in observable detail (Stolpe: external) before naming the feeling (internal). A song that NAMES grief without showing it slips into thesis-shaped writing; a song that observes detail without ever naming what it means stays inert. The right ratio shifts by section — verses skew external (set the scene, ground the listener), choruses can absorb more internal weight (the named feeling carries the hook). Both extremes lose points; the middle path is craft.
SignalsPer-line external/internal/neutral classification (lyrics-detail-types.ts, B927); per-section ratio against Stolpe-derived targets (verse > 60% external; chorus 30-60% internal); fluctuation detection (a chorus that swings to 90% external is also out of band). Surfaces on dashboard FidelityPanel + scoring eval.
Failure looks likeA verse that names six feelings without observing one scene ("I'm so lonely, I'm so tired, I'm so lost…"); or a chorus that lists three weather observations and never lands the emotional weight the verse was building toward.
POV Consistency
Inside Voice →A narrator stays internally coherent across sections. The speaker who said "I take the long way home" in V1 cannot be the speaker who says "we built this city together" in V3 unless the song earned the shift. POV consistency tracks gender / age / relational tuple / diction register; intentional POV switches (gospel call-and-response, K-pop multi-voice, hip-hop features) count as consistent when each voice is internally stable and switches mark structural moments. v1.2.0 of the rubric formalized "intentional POV" as the metric's lens; this sub-criterion names the underlying signal.
Signalsnarrator-profile.ts (B951) extracts gender / age / relational signals per section; cross-section contradictions flagged ("my wife" + "my husband", "I'm 16" + "I'm 40"). Diction-shift detection catches register drift (a mechanic doesn't suddenly quote Rilke). Intentional POV switches are recognized via section-marker patterns + collaborative-form signals.
Failure looks likeA speaker who's 24 in V1 ("I'm too young to know what I want") and 50 in the bridge ("I've lived enough lives to know"); or a narrator who speaks in workplace argot for two verses and suddenly invokes Stoic philosophy in the chorus without earning the shift; or a song that switches from first-person singular to first-person plural in the bridge for no structural reason.
Sub-criteria are TS-source ahead of the JSON spec at v1.2.0; they will be canonized in the next coordinated release.
Composite Formula
composite = round(craftAvg * 0.25 + expressionAvg * 0.40 + impactAvg * 0.35)
Grade Scale
How to cite
When you implement the rubric, score against it, or reference it in a paper or blog post, attribute:
Lyric Scoring Standard v1.2.0 SongForgeAI (2026) https://songforgeai.com/scoring/standard Licensed under CC BY 4.0
Changelog
MINOR (MINOR): M8 (Voice & POV Integrity) refactored from "one coherent narrator" to "INTENTIONAL POV" — deliberate switches (K-pop multi-voice, hip-hop features, gospel call-and-response) no longer score as drift failures. M11 (Memorability) refactored from the single 60-minute test to a 4-signal cumulative read (hook integration + phonemic distinctiveness + chorus repetition strategy + one-listen recall) so cumulative / oral-tradition / ritual-repetition forms aren't false-positive low-scored. Per Super Deep Audit §5 cuts #1 + #2. Score deltas on the golden-eval set: <3 points on average — refactor is descriptive (when does the metric apply) not prescriptive (changing what the metric values). Within MINOR threshold per RFC-0001.
MINOR (B1240): first MINOR bump shipped through the published cadence. Anti-Inflation rules expanded from 4 to 5 with the addition of the Anti-Platitude rule (lines that resolve with generic emotional summaries hit the lowest Specificity + Voice band regardless of surface polish). Antagonist Ceiling clarified to require evidence; Historical Context anchored to the published corpus. Score deltas on the golden-eval set: <2 points on average (within MINOR threshold). Migration: existing scored content is auto-rescored on next eval; the seal field's rubricVersion now reads '1.1.0'. RFC-0002 (anti-platitude formalization) drafted as the in-comment artifact for this bump.
PATCH (B1211): docs only. Reproducibility seal landed in /api/v1/score (B1199); model card published at /scoring/standard/model-card (B1197). No score deltas. Versioning policy formalized in RFC-0001 (in-comment until 2026-05-02): MAJOR for >5pt golden-eval delta, MINOR for clarifications, PATCH for docs/typos.
Initial public release. 12 metrics finalized, anti-inflation rules documented, grade scale locked.
Translations
21 hand-scored exemplars across the score spectrum.
Implementing the rubric? Calibrate against the corpus. Floor anchor at score 18 (F-band), ceiling at 95 (S-band). Any independent implementation that drifts more than 5 points on either has miscalibrated the Anti-Inflation rules.
Read the reference corpusThe methodology behind the “X% human-authored” chip.
Every song carries a per-line provenance ledger recording who wrote each line. The aggregate %human surfaced on the seal is computed deterministically from that ledger — not estimated, not vibes. Required reading for anyone evaluating the authorship claim under US Copyright Office 2025 mixed-authorship guidance.
Read the Human Contribution Log methodologySee the runtime mechanics behind every score.
The companion model card documents which Anthropic model produces the score, what temperature it runs at, where the prompt lives, and the known limitations. Required reading if you’re implementing against the standard.
Read the model cardVerify any score signed by the standard.
Every score response carries an ed25519-signed envelope binding the rubric version, model id, temperature, and deploy SHA together with the score itself. Changing any field after the fact invalidates the signature. The public key is at /.well-known/songforgeai-pubkey.json; the full field schema, verification flow, and guarantees are documented at the seal spec page.
30 humans will score the corpus against the rubric.
The honest credibility test for any LLM-driven scoring system: do human graders applying the same rubric agree? This page pre-registers the cohort methodology — ICC(2,1) per Shrout & Fleiss (1979), Cicchetti (1994) banding, immutability constraints, target ICC 0.60+ on composite. Results auto-render from the public cohort_runs table when the first cohort publishes.
See the rubric applied to the real corpus.
Every song forged on SongForgeAI logs into forge_metrics. The aggregate report — composite-score distribution, Berklee batch lift, per-genre slices — updates hourly. Independently verifiable.
See the rubric in action.
Reading the spec is one thing. The 8-voice Crucible critique applies the rubric to your lyrics in ten seconds, free, no login. That's the fastest path from spec to felt experience.
Adjacent open artifacts
Full specification + scoring methodology.
The 362 AI signatures we filter from every lyric. Forkable, citable, audit-able.
How we use named artist references in prompts without imitating them or naming them in delivered output. Five-layer methodology.
Public registry of external implementations + citations of the standard.
The four cadence rituals (Quality Council, Trust Decay, Bet Reviews, External Audit) — append-only, public, never deleted.
Annual report on AI-assisted songwriting craft, drawn from the corpus this rubric scores.
How per-line authorship is tracked + receipt schema.