Skip to content
All posts
Product2026-05-197 min readBy Todd Nigro

RFC-0010: Five Open Questions on the Fidelity Standard

We published the Fidelity Standard v0.1.0 last week. The numbers are operator-locked but open for public comment. Here are the five questions we genuinely want outside input on before v1.0.0 ships.

We published the Fidelity Standard v0.1.0 last week. Seven components, weighted into a composite, with a calibrated grade scale and three UX prominence buckets. The numbers are operator-locked, but they are NOT settled. Five questions stayed deliberately open when v0.1.0 shipped; the public comment window (RFC-0010, through 2026-05-26) is here to resolve them before v1.0.0 ossifies.

If you write songs, work in AI products, or care about how AI lyric tools should be measured — these are the questions where outside input changes the answer.

1. Is the 30/25 split between premise and anchors the right calibration?

The composite weighs premise match at 30% and anchor coverage at 25%. Premise + anchors = 55% load-bearing. That's the question Sacred Accident #17 framed as the fundamental fidelity ask, so the load-bearing majority feels right. But the split between them — 30/25 vs say 25/30 or 35/20 — is open.

The case for 30/25 (current): the premise IS the question; anchors are the supporting details. A song that gets the premise right but misses one anchor is closer to fidelity than one that hits every anchor but misses the premise entirely. The premise should weigh more.

The case for 25/30: premise is Haiku-judged (semantic, noisier); anchors are mechanically verifiable via regex + vocabulary heuristics. We should weigh the more-precise signal higher. Anchors should weigh more.

The case for 35/20: the premise IS the answer to "what is this song about?" — everything else is supporting craft. Premise should weigh much more.

We've calibrated against the five-song stress test (which produced SA#17) + a small corpus of operator-rated edge cases, but neither sample is large enough to settle this confidently. We'd specifically value: songwriters who've used AI tools long enough to have an opinion on which signal correlates better with their "this is/isn't the song I asked for" gut reading.

2. Constraint-mode multipliers: 0.95 cap on strict, or 0.90?

The Standard supports three constraint modes:

  • strict (× 0.95) — user demanded zero deviation; score is capped at 95
  • standard (× 1.00) — default neutral
  • loose (× 1.15) — user opted into deviation; capped at 100

The 0.95 cap on strict mode means a "perfect" strict score lands at A (95), not A+ (100). Two arguments:

For 0.95 (current): strict mode is an aspirational ceiling — the user says "I demand zero deviation" and the rubric responds "perfect work in this mode lands one band below the open ceiling." It's an honest acknowledgment that perfect is unreachable when you demand zero deviation.

For 0.90 (alternative): 0.95 is too generous. If the user demands zero deviation, the bar should be HIGHER, not lower. A 0.90 cap means even excellent strict work caps at A− territory; A+ is reserved for loose-mode brilliance.

We picked 0.95 because the calibration corpus suggested it's where excellent strict work actually lands; 0.90 would push too many genuinely-faithful strict-mode songs into the B band. But we have not run the experiment formally.

3. 'Primary' bucket threshold — complexity ≥ 7?

Brief complexity is a 0-10 integer based on anchor count + style constraint count + structure-set + forbidden-set + premise length. The UX prominence buckets:

  • hide (complexity 0-2): suppress chip unless score < 80 (rescue path)
  • secondary (complexity 3-6): equal-weight with quality
  • primary (complexity 7-10): fidelity is the headline; quality renders secondary

The boundary between secondary + primary is at 7. Open question: should it be 6, 7, or 8?

The 7 threshold filters to briefs with at least 4 of 5 constraint dimensions populated. That feels like the right line — the user gave the system a heavy brief, the headline answer should be "did we honor it." But the line is operator-set, not validated against user research.

Lower threshold (6): more songs become "primary" — the fidelity story dominates more often. Risk: light-but-meaningful briefs hit primary and the user reads the fidelity grade with more weight than the underlying brief justified.

Higher threshold (8): fewer "primary" songs. Risk: the headline-fidelity treatment becomes rare; users with heavy briefs don't get the visual emphasis the brief earned.

4. Chorus evolution + earned transcendence — should they weigh more than 5% each?

Both components are at 5%. Combined they're 10% of the composite. Some early feedback suggested the literary moves they measure (chorus that actually shifts; V1 image returning transformed) are load-bearing for re-listen quality and should weigh more — call it 8% each, totaling 16%.

The counter: 5% × 2 = 10% is already enough signal to move the composite, but not so much it crowds out the load-bearing premise + anchors + structure weights. Raising chorus + transcendence pushes the load-bearing components down proportionally, which feels backwards — those ARE the load-bearing axes per SA#17.

We're open to either calibration. The deciding evidence would be: do songs that score well on chorus + transcendence correlate with songs that listeners rate as more memorable / more re-listenable? Without that correlation data, we're guessing.

5. 'N/A' verdict semantics — abstain or count negatively?

When a fidelity component doesn't apply to a song (the brief didn't ask for that constraint OR the lyric lacks the structural prerequisite — e.g., transcendence on a verse-only lyric), the audit returns 'N/A'. The current behavior: N/A excludes the component from the composite; the remaining components' weights redistribute proportionally.

This is the principled behavior because "the bar IS the brief" — if the user didn't ask for a forbidden-language list, we shouldn't penalize them for not having one. But it's not the only defensible behavior.

Alternative: count N/A negatively. The system "missed" the chance to demonstrate fidelity on that axis, even though the user didn't ask. A song that hits all 5 axes the brief named scores higher than a song that only addressed 2. The current null-redistribution math TIES those two cases at the same composite, which can feel wrong intuitively.

We picked null-redistribution because the "wrong-song" failure mode is specifically about songs that ignored what the user DID ask for. Penalizing songs for not addressing what the user DIDN'T ask for is a different problem. But the operator-rated edge cases we tested don't unambiguously favor one math over the other.

How to comment

Email todd@songforgeai.com with the subject "RFC-0010 comment." Specify which open question(s) you have a view on; include your reasoning. Pseudonymous comments welcome (we publish the reasoning, not the commenter, unless you opt in to the byline).

The window closes 2026-05-26. After that, v1.0.0 ships with whatever the calibration evidence supports. The audit implementation in code (currently FIDELITY_AUDIT_VERSION = 1.5.0) tracks separately from the public standard version — implementation can move via patch; the standard moves via RFC.

Why we're asking

The Fidelity Standard is a category-creation play. "Lyric fidelity" doesn't have an established noun in the songwriting + AI-product literature. We're claiming it, and we want the claim to be load-bearing — citable, defensible, externally validated.

The way categories become load-bearing is through public governance: the rubric is published under CC BY 4.0, the version history is append-only at /scoring/standard/fidelity/changelog, the RFC process is public, the comments inform the calibration. If v1.0.0 ships with weights the community didn't have a chance to argue, the category is just our product's calibration — which is what every other AI lyric tool already does.

We don't think we have all five answers right. That's why these five questions are open and not closed.

— Todd
Founder, SongForgeAI