A song can score 90 on quality and still be the wrong song. Quality and fidelity are orthogonal questions. Today we publish the seven-component composite that measures the second one.

The accident that named the problem

Our internal Sacred Accidents log keeps growing. Number 17, logged earlier this year, reads:

"The system can write a good song without writing the same song."

The accident surfaced during a five-song stress test. We gave the forge five hard prompts, each with specific anchors — an old water tower, a dying Midwest town, seventeen years of maintenance, a mother's disappearance, the tower as a character. The forge wrote five songs. The 12-metric quality rubric scored all five in the 80s. They were craft-clean, image-rich, banned-cliché-free.

One of them was a breakup song. Another was a generic loneliness song. A third drifted into a Hozier-tier arena chorus that mentioned the tower exactly once. Only the fourth and fifth songs were actually about what the prompt asked for.

The quality scores were not wrong. The lyrics WERE good. They just weren't the songs the user asked for.

Two questions, two grades

The Lyric Scoring Standard answers is the song good? Twelve metrics across craft, expression, and impact. Weighted into one composite. Anchored at 95 against a 1949 country song that survived seventy-five years. It's the rubric SongForgeAI scores against on every forge.

The Fidelity Standard answers did the system write the song you asked for? Seven components. Weighted into a different composite. Published today at v0.1.0 under CC BY 4.0 at /scoring/standard/fidelity.

Both questions matter. Neither answers the other. A song that nails both is a song you'd actually use.

The seven components

Here is what the fidelity composite weighs and what each component asks:

Weight	Component	Question
30%	Premise match	Did the lyric serve the named premise?
25%	Anchor coverage	Did each required detail land in a vivid line?
15%	Structure compliance	Does the section map match what was asked?
15%	Style constraints	Were the style rules honored (sensory-only, etc.)?
5%	Forbidden language	Did the lyric avoid the banned words?
5%	Chorus evolution	Did the chorus actually shift across the song?
5%	Earned transcendence	Did a V1 image return with transformed meaning?

Premise match (30%) + anchor coverage (25%) = 55% load-bearing weight. The premise is the question the song is supposed to answer. The anchors are the specific details that prove the answer landed. Everything else is secondary discipline — important, but not the load-bearing axes.

Premise match is the only component that uses a Haiku judgment call. Six of the seven are pure regex / vocabulary / structural heuristics. The judgment lives on the question that can't be solved with pattern matching — "did the lyric serve the premise?" requires reading comprehension, not regex.

Complexity-gated UX

Not every song needs a fidelity chip. A user typing "a song about morning" with no anchors, no structure requirement, no style constraints — that brief is too light to score meaningfully. The composite still computes, but the dashboard chip stays hidden.

The standard defines three UX prominence buckets:

hide (complexity 0–2, light brief): chip suppressed unless score drops below 80. A low score on a light brief is the rescue path — the system writes badly enough that the user deserves to know even though they didn't ask for the constraint.
secondary (complexity 3–6, standard brief): chip renders alongside the quality grade at equal weight.
primary (complexity 7–10, heavy brief): fidelity is the headline. The user gave the system a hard target; the first signal they should see is whether the system hit it.

Brief complexity is calculated mechanically: anchor count (max 5) + style constraint count (max 3) + 1 if structure is set + 1 if forbidden language is set + 1 if premise is ≥40 characters. Maxes out at 10. The complexity is visible on the chip's hover-tooltip.

What this changes for users

Three surfaces shift with this release:

Every song's dashboard page now shows TWO grades. Quality (the legacy chip) and fidelity (new). Side by side. Complexity-gated. Linkable into a deep panel that breaks down each of the seven components, shows the Haiku's premise-echo + cited evidence lines, and surfaces any sensory-rewrite pairs the gauntlet executed.
The leaderboard now ranks by composite. 0.6 × quality + 0.4 × fidelity. Songs forged before this release fall back to quality-only ranking, so historical position is preserved. Songs with both signals compete on the combined axis. A great-quality song that ignored the brief drops; a faithful song with mediocre craft rises.
The forge runs a sensory-rewrite gauntlet when the brief asks for sensory imagery AND the thesis audit catches analytical lines. Targeted line-level Sonnet rewrites; only flagged lines change; the rest of the song stays intact.

RFC-0010 opens today

v0.1.0 isn't v1.0.0. The numbers — weights, mode multipliers, grade band edges, complexity formula, UX bucket thresholds — are operator-locked but open for public comment. RFC-0010 opens a seven-day window through 2026-05-26.

Five open questions for the comment window:

Is the 30/25 split between premise and anchors the right calibration?
Should strict mode cap at 0.95 or at 0.90?
Is the 'primary' bucket threshold (complexity ≥ 7) the right edge?
Should chorus evolution + earned transcendence weigh more than 5% each?
What's the right semantics for an N/A verdict — abstain or count negatively?

Comments via email to todd@songforgeai.com with subject "RFC-0010 comment." After the window closes, v1.0.0 ships with any incorporated feedback. The audit implementation in code tracks separately at FIDELITY_AUDIT_VERSION = 1.4.0 (post-B2788).

The bigger frame

The reason this matters isn't because fidelity is more important than quality. It isn't. Quality is the floor — a song that reads badly doesn't deserve a high fidelity score; the floor catches it first.

The reason fidelity matters is that no other AI lyric tool measures it. Suno doesn't. Udio doesn't. LyricStudio doesn't. The pattern across the category is to optimize the generation step and let "does the output match the prompt?" become an implicit, vibes-based check.

For a serious songwriter who writes against a brief — a sync placement, a film project, a country radio target, a worship anthem with theological constraints — that vibes-based check is not enough. They need a system that scores BOTH axes the way they think about them. Two questions: is it good? + did it answer my prompt?

That's the product we shipped today. The standard is open. The RFC is live. Use it, fork it, build on it. CC BY 4.0.

— Todd
Founder, SongForgeAI

Introducing the Fidelity Standard v0.1.0

The accident that named the problem

Two questions, two grades

The seven components

Complexity-gated UX

What this changes for users

RFC-0010 opens today

The bigger frame

SongForgeAI Launches to Solve AI Music's Lyric-Quality Problem

Not a Song — a Whole Album: Introducing the Showcase Albums

RFC-0010: Five Open Questions on the Fidelity Standard

Per-line authorship for AI-assisted lyrics. Receipts that hold up.