Open standard·v1.2.0·2026-05-02

Lyric Scoring Standard

Anchored at composite 95 against canonical professional craft — a country song from 1949 that survived seventy-five years. Read the calibration →

An evidence-based 12-metric rubric for scoring song lyrics. Weighted across three tiers (Craft 25%, Expression 40%, Impact 35%). Anti-inflation rules are load-bearing — a 50 is average; a 90+ is rare.

Reading the rubric's evolution? See the unified version + RFC changelog or the structural diff between every published version.

Looking for the orthogonal axis? Read the Fidelity Standard v0.4.0 — the seven-component composite measuring whether the lyric served the brief. (RFC-0010 — comment window closed 2026-05-26.)

Want the architectural commitment behind the rubric? Read Topic Sovereignty — the principle that says your idea is sovereign at the SUBJECT layer; the system varies the TREATMENT.

Jump to section

Grade Scale & Percentiles
Three Tiers
The 12 Metrics
Sub-criteria
Composite Formula
FAQ
How to cite
Changelog

Open license — CC BY 4.0

Adopt freely, cite as 'Lyric Scoring Standard v1.2.0 (SongForgeAI).' No charge, no permission needed.

License terms →

Open data — scoring-standard.json

Versioned, machine-readable specification. Stable shape; bumps documented in the version table below.

Download JSON →

Version 1.2.0 · 2026-05-02

Published date. Major changes trigger a version bump + interim Songwriter Index edition documenting the diff.

Grade Scale & Percentiles

What a composite score means, in two tables: the letter-grade bands and the percentile each score threshold represents across the scored corpus. Both read directly from the versioned rubric JSON.

Grade scale

S+96–100Immortal. Canon-defining.

S91–95Canonical. Future songs learn from this.

A+86–90Exceptional. Remarkable craft and depth.

A80–85Excellent. Genuinely strong, worthy of replay.

B+73–79Strong. Accomplished with clear strengths.

B65–72Good. Solid professional output.

C+55–64Average-plus. Competent but unremarkable.

C45–54Average. Functional, nothing to remember.

D+35–44Below average. Identifiable flaws.

D25–34Weak. Significant problems.

F0–24Poor. Fundamentally broken.

Percentile anchors

95+Top 1%

90+Top 3%

86+Top 8%

82+Top 12%

78+Top 18%

73+Top 25%

65+Top 35%

56+Above average

A composite at or above the threshold sits in the labeled percentile band. Anti-inflation rules keep the top bands rare — a 50 is average; a 90+ is top 3%.

Install on npm

Adopt the standard in one command

Pure data + helpers. Zero runtime dependencies. MIT for the helper code, CC BY 4.0 for the rubric JSON itself. Cite the standard by name + version when scoring against it.

View on npm

$ npm install @songforgeai/scoring-rubric

Helpers exported: scoreToGrade(), scoreToPercentileLabel(), computeComposite(), isCompatibleRubricVersion(). Full rubric available as @songforgeai/scoring-rubric/rubric.json.

Published artifacts — four packages on npm

The standard and its companion infrastructure ship as four published packages under the @songforgeai scope. Every install is a citation; every download is on the public counter. The publish discipline is now CI-enforced — any built-but-not-published public artifact fails the check:built-not-published ratchet at the commit gate.

@songforgeai/scoring-rubric
The 12-metric Lyric Scoring Standard. Pure data + helpers (scoreToGrade, computeComposite, isCompatibleRubricVersion). This standard, installable.
@songforgeai/fidelity-standard
The seven-component Fidelity Standard — orthogonal to quality. Measures whether the lyric honored the brief, not whether the lyric is good. Specified in RFC-0010 (comment window closed 2026-05-26).
@songforgeai/agent-room
The multi-agent “writing room” consensus pattern extracted from the scoring pipeline. N panelists score in parallel; median + standard deviation surface the divergence. Reusable beyond lyric evaluation.
@songforgeai/client
The TypeScript SDK for the public scoring API. Includes ed25519 seal verification via verifySeal() so consumers can prove any score response was actually produced by the published rubric + named model + temperature.

Cite this standard

The Lyric Scoring Standard is published under CC BY 4.0. Attribution required when you cite it in a paper, blog post, or third-party tool. Three formats below — copy whichever fits your medium.

@misc{songforgeai_lyric_scoring_2026,
  author       = {{SongForgeAI}},
  title        = {Lyric Scoring Standard, version 1.2.0},
  year         = {2026},
  howpublished = {\url{https://songforgeai.com/scoring/standard}},
  note         = {Licensed CC BY 4.0}
}

Cited the standard in your work? Add yourself to /cited-by — the public registry of external implementations.

Machine-readable surfaces

Rubric JSON: /scoring-standard.json — canonical machine-readable rubric.
JSON Schema: /scoring-standard.schema.json — validate any adopted copy of the rubric (W3C-style publication; conforms to JSON Schema draft 2020-12).
Schema.org Dataset: ingested by Google Dataset Search + academic aggregators via the in-page <script type="application/ld+json"> block.

Anti-Inflation Philosophy

Five rules baked into the rubric so a 65 from us means what a 65 from a human craft critic would mean — not the inflated 80 most LLM-judged scoring inflates to.

Gravity Rule

the default is 50. Every point above 50 must be earned with specific evidence from the lyric itself.

Burden of Proof

scores above 80 require the scorer to cite specific lines + explain why they justify the number.

Antagonist Ceiling

a dedicated critical voice challenges every score. If it finds a real weakness the score drops. v1.1.0 clarification: the antagonist must produce evidence (specific lines, specific failure modes) — vague disagreement does not lower the score.

Historical Context

scores anchor to professional craft standards, not to other AI output. A 90+ means near-flawless execution across all 12 metrics. v1.1.0 clarification: 'professional craft' = the corpus published at /scoring/corpus, especially the Hank Williams S-band anchor at composite 95.

Anti-Platitude (v1.1.0 added)

a line that resolves with a generic emotional summary ('all I need is love', 'this is my truth', 'love wins') hits the rubric's lowest Specificity + Voice band regardless of surface polish. Documented inline so implementers can cite a published rule rather than discover this empirically.

Three Tiers

Craft (25%)

Can this person write? Mechanics, structure, rhyme, and word choice.

Expression (40%)

Does it say something worth hearing? Specificity, originality, truth, and voice.

Impact (35%)

Will anyone remember it tomorrow? Transcendence, arc, stickiness, and genre fit.

The 12 Metrics

Prosody & Musicality

Meter, stress patterns, consonant and vowel clusters, intentional silence, and breath points. Does the lyric feel good in the mouth?

What good looks likeNatural rhythmic flow that a singer can inhabit without fighting the phrasing. Stressed syllables land on strong beats.

Structural Architecture

Song shape, arc, verse progression, chorus return, and bridge revelation. Does the structure serve the story?

What good looks likeEach section has a clear job. Verses build, choruses resolve, the bridge shifts perspective. Nothing feels arbitrary.

Rhyme Intelligence

Rhyme as craft servant: internal rhyme, slant rhyme, strategic non-rhyme. Does the rhyme scheme feel intentional rather than forced?

What good looks likeRhymes land with purpose. A mix of perfect, slant, and internal rhyme that never bends meaning to satisfy a sound.

Economy of Language

Every word earning its place. No filler, no padding, no lines that exist only to set up a rhyme.

What good looks likeYou cannot remove a word without losing something. Every syllable carries weight or music.

Lyrical Specificity

Concrete imagery, sensory detail, proper nouns, time anchors. The opposite of abstract generalities.

What good looks likeThe song lives in a real place with real objects. "Tangerines and someone else's smile" instead of "memories of you."

Imagery Originality

Fresh metaphors, defamiliarized objects, governing images that haven't been written to death.

What good looks likeImages that surprise on first read and deepen on second. No shattered hearts, no oceans of tears, no wings of freedom.

Emotional Truth

The ring-test: does it feel true? Earned emotion, unforced vulnerability, no borrowed sentiment.

What good looks likeThe emotion arrives through specificity and honesty, not through telling the listener what to feel.

Voice & POV Integrity

Does the narrator sound like a real person, with stance, diction, and reference frame consistent with the song's intent? Includes deliberate POV switches when the song earns them. v1.2.0: refactored from "one coherent narrator" to "INTENTIONAL POV" — collaborative-writing forms (K-pop multi-voice choruses, hip-hop featured verses, gospel call-and-response) score the same as single-narrator stability when the POV switches are deliberate. The metric still penalizes ACCIDENTAL drift; the change is that intentional multi-voice stops being a false positive.

What good looks likeA distinct human presence (or presences) the listener can locate. When POV stays single: word choices, diction, and references belong to one coherent narrator. When POV switches: each voice is internally consistent, the switches mark structural moments (verse → chorus, lead → feature, cantor → response), and the listener never wonders "who is talking now?" by accident.

The Transcendent Line

The unrepeatable line. Not necessarily the cleverest; the truest. The line someone would quote.

What good looks likeAt least one line that stops a listener cold. The kind of line people screenshot and share.

Emotional Arc

Does the song move from state A to state B? Revelation, release, recalibration. Not just emotion, but emotional motion.

What good looks likeThe listener ends the song in a different place than they started. Something shifted.

Memorability

Will this lyric persist? v1.2.0: refactored away from the single "60-minute test" (which privileged literate quotability over structural / cumulative / oral-tradition durability). Now reads across four signals: (1) hook integration — does the recurring phrase land harder each return? (2) phonemic distinctiveness — does the most-repeated line have a sonic shape that resists merging with the medium? (3) chorus-line repetition strategy — is the title earning its repeats? (4) one-listen recall — could a listener quote a line after one pass? No single signal carries the metric; cumulative durability is the goal.

What good looks likeA chorus that means something different by its third return than its first. A title-line whose phonemic shape distinguishes it from its neighbors. A hook that the listener hums involuntarily OR an incantatory refrain that compounds devotional / communal weight (call-and-response, ghazal radif, qawwali ostinato). The metric reads at S-band when the song's memorability mechanism is integral to its form, not bolted on.

Genre Authenticity

Does this honor its genre while extending it? Genre fluency without genre cliche.

What good looks likeA country song that sounds like country but doesn't sound like every country song. Respect and surprise.

Sub-criteria

Named sub-concepts inside the 12 metrics

Each of the 12 metrics evaluates multiple signals. Sub-criteria name the discrete signals the eval engine treats as load-bearing — making implicit rubric judgments legible. Cite a sub-criterion by metric + sub-criterion name (e.g. Economy.Restraint).

Restraint

Inside Economy →

Stress earns its position. Emphasis is placed where the meaning lands, not scattered across every line. Filler words ("just," "really," "kinda," "like") and reflexive intensifiers ("so very," "totally completely") drop a line out of the Economy band even when its other craft signals are intact. Profanity and explicit content, when present, follow the same rule — Eminem's "Stan" has profanity, but it's positioned; lazy mixtape filler has the same density spread randomly.

SignalsPer-line filler-word density; ratio of placed-vs-scattered emphasis (line endings + internal stress positions); repetition-with-meaning vs repetition-as-padding (cf. Memorability metric, which rewards repetition that earns its return); intensifier stacking. The signal applies to clean, mature, and explicit registers equally — lazy clean and lazy explicit fail the same way.

Failure looks likeA 4-line stanza where every line ends on "yeah," "though," or "y'know"; a chorus that swears every other word without any of them landing harder than the surrounding text; a verse where "just" or "really" appears in three of four lines as syllable padding.

External-Internal Balance

Inside Specificity →

Lyrics ground emotion in observable detail (Stolpe: external) before naming the feeling (internal). A song that NAMES grief without showing it slips into thesis-shaped writing; a song that observes detail without ever naming what it means stays inert. The right ratio shifts by section — verses skew external (set the scene, ground the listener), choruses can absorb more internal weight (the named feeling carries the hook). Both extremes lose points; the middle path is craft.

SignalsPer-line external/internal/neutral classification (lyrics-detail-types.ts, B927); per-section ratio against Stolpe-derived targets (verse > 60% external; chorus 30-60% internal); fluctuation detection (a chorus that swings to 90% external is also out of band). Surfaces on dashboard FidelityPanel + scoring eval.

Failure looks likeA verse that names six feelings without observing one scene ("I'm so lonely, I'm so tired, I'm so lost…"); or a chorus that lists three weather observations and never lands the emotional weight the verse was building toward.

POV Consistency

Inside Voice →

A narrator stays internally coherent across sections. The speaker who said "I take the long way home" in V1 cannot be the speaker who says "we built this city together" in V3 unless the song earned the shift. POV consistency tracks gender / age / relational tuple / diction register; intentional POV switches (gospel call-and-response, K-pop multi-voice, hip-hop features) count as consistent when each voice is internally stable and switches mark structural moments. v1.2.0 of the rubric formalized "intentional POV" as the metric's lens; this sub-criterion names the underlying signal.

Signalsnarrator-profile.ts (B951) extracts gender / age / relational signals per section; cross-section contradictions flagged ("my wife" + "my husband", "I'm 16" + "I'm 40"). Diction-shift detection catches register drift (a mechanic doesn't suddenly quote Rilke). Intentional POV switches are recognized via section-marker patterns + collaborative-form signals.

Failure looks likeA speaker who's 24 in V1 ("I'm too young to know what I want") and 50 in the bridge ("I've lived enough lives to know"); or a narrator who speaks in workplace argot for two verses and suddenly invokes Stoic philosophy in the chorus without earning the shift; or a song that switches from first-person singular to first-person plural in the bridge for no structural reason.

Sub-criteria are TS-source ahead of the JSON spec at v1.2.0; they will be canonized in the next coordinated release.

Composite Formula

composite = round(craftAvg * 0.25 + expressionAvg * 0.40 + impactAvg * 0.35)

Frequently asked questions

How are song lyrics scored?

Song lyrics are scored against the Lyric Scoring Standard v1.2.0 — an open, evidence-based rubric of 12 metrics weighted across three tiers (Craft 25%, Expression 40%, Impact 35%). Each metric is graded 0–100 with cited evidence from the lyric itself, then combined into a weighted composite. Five anti-inflation rules are load-bearing: a 50 is average and a 90+ is rare, so a score means what a human craft critic would mean by it.

What is a good lyric score?

A composite of 65–72 (grade B) is good — solid professional output. 80+ (grade A) is excellent and genuinely strong; 91+ (grade S) is canonical work that future songs learn from. Under the standard's anti-inflation rules a 50 is average, so anything above the mid-60s already beats most of what the rubric sees. The full grade scale and percentile anchors are published on this page.

What are the 12 metrics?

The 12 metrics are Prosody & Musicality, Structural Architecture, Rhyme Intelligence, Economy of Language, Lyrical Specificity, Imagery Originality, Emotional Truth, Voice & POV Integrity, The Transcendent Line, Emotional Arc, Memorability, Genre Authenticity. They group into three weighted tiers — Craft 25%, Expression 40%, Impact 35% — so craft mechanics, emotional expression, and listener impact each carry explicit weight in the composite. Every metric ships with a definition and a "what good looks like" anchor in the machine-readable rubric JSON.

Who can use the Lyric Scoring Standard?

Anyone. The standard is published under CC BY 4.0 — adopt it, implement it, or score against it freely with attribution ("Lyric Scoring Standard v1.2.0, SongForgeAI"). No charge, no permission needed. The canonical rubric ships as machine-readable JSON with a JSON Schema for validation, an npm package, and a hand-scored reference corpus to calibrate independent implementations against.

How to cite

When you implement the rubric, score against it, or reference it in a paper or blog post, attribute:

Lyric Scoring Standard v1.2.0
SongForgeAI (2026)
https://songforgeai.com/scoring/standard
Licensed under CC BY 4.0

Changelog

v1.2.02026-05-02

MINOR (MINOR): M8 (Voice & POV Integrity) refactored from "one coherent narrator" to "INTENTIONAL POV" — deliberate switches (K-pop multi-voice, hip-hop features, gospel call-and-response) no longer score as drift failures. M11 (Memorability) refactored from the single 60-minute test to a 4-signal cumulative read (hook integration + phonemic distinctiveness + chorus repetition strategy + one-listen recall) so cumulative / oral-tradition / ritual-repetition forms aren't false-positive low-scored. Per Super Deep Audit §5 cuts #1 + #2. Score deltas on the golden-eval set: <3 points on average — refactor is descriptive (when does the metric apply) not prescriptive (changing what the metric values). Within MINOR threshold per RFC-0001.

v1.1.02026-04-25

MINOR (B1240): first MINOR bump shipped through the published cadence. Anti-Inflation rules expanded from 4 to 5 with the addition of the Anti-Platitude rule (lines that resolve with generic emotional summaries hit the lowest Specificity + Voice band regardless of surface polish). Antagonist Ceiling clarified to require evidence; Historical Context anchored to the published corpus. Score deltas on the golden-eval set: <2 points on average (within MINOR threshold). Migration: existing scored content is auto-rescored on next eval; the seal field's rubricVersion now reads '1.1.0'. RFC-0002 (anti-platitude formalization) drafted as the in-comment artifact for this bump.

v1.0.12026-04-25

PATCH (B1211): docs only. Reproducibility seal landed in /api/v1/score (B1199); model card published at /scoring/standard/model-card (B1197). No score deltas. Versioning policy formalized in RFC-0001 (in-comment until 2026-05-02): MAJOR for >5pt golden-eval delta, MINOR for clarifications, PATCH for docs/typos.

v1.0.02026-04-20

Initial public release. 12 metrics finalized, anti-inflation rules documented, grade scale locked.

Translations

Español Français 日本語

Reference corpus

21 hand-scored exemplars across the score spectrum.

Implementing the rubric? Calibrate against the corpus. Floor anchor at score 18 (F-band), ceiling at 95 (S-band). Any independent implementation that drifts more than 5 points on either has miscalibrated the Anti-Inflation rules.

Read the reference corpus

Human Contribution Log

The methodology behind the “X% human-authored” chip.

Every song carries a per-line provenance ledger recording who wrote each line. The aggregate %human surfaced on the seal is computed deterministically from that ledger — not estimated, not vibes. Required reading for anyone evaluating the authorship claim under US Copyright Office 2025 mixed-authorship guidance.

Read the Human Contribution Log methodology

Model card

See the runtime mechanics behind every score.

The companion model card documents which Anthropic model produces the score, what temperature it runs at, where the prompt lives, and the known limitations. Required reading if you’re implementing against the standard.

Read the model card

Reproducibility seal

Verify any score signed by the standard.

Every score response carries an ed25519-signed envelope binding the rubric version, model id, temperature, and deploy SHA together with the score itself. Changing any field after the fact invalidates the signature. The public key is at /.well-known/songforgeai-pubkey.json; the full field schema, verification flow, and guarantees are documented at the seal spec page.

Read the seal spec

Inter-rater agreement

30 humans will score the corpus against the rubric.

The honest credibility test for any LLM-driven scoring system: do human graders applying the same rubric agree? This page pre-registers the cohort methodology — ICC(2,1) per Shrout & Fleiss (1979), Cicchetti (1994) banding, immutability constraints, target ICC 0.60+ on composite. Results auto-render from the public cohort_runs table when the first cohort publishes.

Read the methodology

We publish our numbers

See the rubric applied to the real corpus.

Every song forged on SongForgeAI logs into forge_metrics. The aggregate report — composite-score distribution, Berklee batch lift, per-genre slices — updates hourly. Independently verifiable.

Read the latest craft-metrics report

See the rubric in action.

Reading the spec is one thing. The 8-voice Crucible critique applies the rubric to your lyrics in ten seconds, free, no login. That's the fastest path from spec to felt experience.

Score a draft free at /crucible Adopt the standard — download JSON

Adjacent open artifacts

Standard whitepaper

Full specification + scoring methodology.

Banned-cliché list (CC BY 4.0)

The 362 AI signatures we filter from every lyric. Forkable, citable, audit-able.

What We Refuse To Build (CC BY 4.0)

The four things this product cannot do — no engagement feed, no audio cloning, no artist-identity forgery, a scorer built to under-praise. Each refusal cites the file that enforces it.

Cited by

Public registry of external implementations + citations of the standard.

Audit trail

The four cadence rituals (Quality Council, Trust Decay, Bet Reviews, External Audit) — append-only, public, never deleted.

The Songwriter Index 2026

Annual report on AI-assisted songwriting craft, drawn from the corpus this rubric scores.

Human Contribution Log methodology

How per-line authorship is tracked + receipt schema.