How we measure chorus compression — and why it matters more than emotion.
External reviewers kept flagging the same gap: choruses that are emotionally correct but musically forgettable. So we built an analyzer that measures the structural compression that makes a chorus chant-able. Here is how it works, what it doesn’t do, and why we shipped instrumentation before scoring.
The pattern that kept showing up
Two external reviewers, two batches of 100 forged songs each, both flagged the same gap:
"The verses feel lived. The choruses sometimes become emotionally correct rather than musically inevitable. They work, but they do not always become the kind of line a listener remembers after one listen."
That sentence—emotionally correct rather than musically inevitable—was the thing the rubric couldn't see. The 12-metric Lyric Scoring Standard measures specificity, voice, arc, transcendence, memorability, ten others. Every metric is semantic. None of them measures whether a chorus has the structural compression that makes a phrase chant-able.
You can write a chorus that's specific (M2), truthful (M7), arcs cleanly (M3), and still doesn't stick. The reviewer was right.
What "compression" means here
A great chorus is shorter per line than the verses around it. "I will always love you." "Dancing in the dark." "Hey Jude, don't make it bad." Four to seven words. Hard consonants. The phrase survives being shouted in a stadium.
The model's choruses, when they failed, failed by writing prose with line breaks. A specific lived-detail line that would have been a great verse moment becomes the chorus, and the chorus loses its job.
We needed a number. So we built three.
Three signals, one analyzer
The rhythmic-dimension analyzer (src/lib/rhythm/index.ts) is pure JavaScript. No model calls. Runs in under a millisecond. It computes:
- Compression ratio. Average syllables per chorus line divided by average syllables per verse line. Below 0.85 is compressed (the typical chant-able pattern). Above 1.15 is bloated (the failure mode the reviewers named). Between is matched.
- End-rhyme architecture. Per-section rhyme scheme detection. AABB couplets, ABAB alternating, AAB blues stanzas, AAAA monorhyme, ABCB ballad. Letter-tail matching with silent-e and syllabic-le handling so "bone" and "stone" and "alone" all resolve to the same tail.
- Internal-rhyme density. Rhymes within a single line, normalized per 100 words. Eminem-style multi-rhyme verses score "Dense." Prosaic narrative verses score "None."
Then a fourth signal that's actually the headline: chorusCompressed, a boolean that's true when the ratio is below 0.85, false when above 1.15, null when there's not enough structure to compare. That's the signal the writer sees first on their forge result.
What the analyzer doesn't do
Three things, named honestly:
- It doesn't measure semantic memorability. A specific lived-detail chorus and a generic catchphrase chorus both register as compressed if they're short. Phase 4 (later) wires this into M11 (Memorability) so the eval pipeline can punish "compressed but generic" the way the reviewer would.
- It doesn't catch phonetic-only rhymes. "Smoke" and "oak" sound the same; their letter tails are different. The current analyzer is letter-level. A CMU-dict integration is on the roadmap when absolute syllable-and-phoneme precision starts to matter.
- The syllable count is heuristic. Vowel-group counting with silent-e + syllabic-le handling. Accurate to within ~10% for most English words. Sufficient for relative comparisons (chorus density vs verse density). Not sufficient for absolute meter analysis.
All three limitations are documented in the Lyric Scoring Standard whitepaper, Appendix A.9. The point of an open standard is honest disclosure, including of what the standard doesn't yet do.
Telemetry before scoring
The whole reason we shipped this in three phases (analyzer → telemetry → UI) instead of jumping straight to "compression ratio is part of M11 now" is calibration discipline.
The thresholds (0.85 / 1.15) are an initial bet, derived from a 10-entry synthetic calibration corpus that ships with regression tests. Five compressed chorus structures, five bloated ones. The calibration test asserts every entry lands in its expected band AND that the bands are strictly separated (no overlap). When the analyzer drifts, the corpus catches it before users do.
But the corpus is synthetic. Real production lyrics span a wider distribution. So Phase 2 ships post-forge telemetry: every forge logs a structured event with the analyzer's output. After 14 days of production data, we'll know what compression ratios top-scoring songs actually cluster around. Phase 4 will flip the wound trigger using THAT distribution, not our intuition.
The honest constraint
We don't yet know if "structurally compressed chorus" actually correlates with "human-memorable chorus." The reviewer's intuition says yes. Our intuition says yes. But intuition is what gave us "verses feel lived, choruses feel emotionally correct" — a reviewer caught that, not the rubric. So we're going to wait for the data.
If 14 days of telemetry shows that compression ratio doesn't predict the recall scores from our delayed-recall test (a separate Phase-1-shipped instrument that asks Haiku to recall a chorus from working memory after a distractor), then the bet was wrong. The rubric stays at 12 metrics, the rhythmic dimension stays as a writer-facing signal, and we publish the negative result. That's the discipline.
If the data confirms the bet, Phase 4 wires it. The rubric grows a sub-signal inside M11. Honest disclosure stays the same.
Why this is in the open standard
The Lyric Scoring Standard is CC-BY-4.0. Every score reproducible. Every disclosure honest. The rhythmic dimension is now part of that.
If you want to verify any of this against your own lyric output, the analyzer is open and the calibration corpus is checked into the repo. If you build a competing scorer, your numbers should land in the same bands ours do for the same inputs. If they don't, one of us is calibrating wrong, and the published thresholds + corpus mean the disagreement is resolvable.
That's the whole point of the open standard. Subjective judgment is allowed. Mystery is not.
What ships next
- Phase 4 (data-gated). Wire compression ratio + recall score as M11 sub-signals once 14 days of production telemetry calibrate the thresholds.
- Phase 5 (precision lift). CMU-dict integration for absolute syllable + stress accuracy. Required for cross-language calibration as RFC-0009 Phase 2 ships.
- Phase 6 (per-genre thresholds). A bloated-chorus signal probably means one thing in country and a different thing in opera. Genre-keyed thresholds when the data supports them.
None of these get speculative ship dates. They get ratchets and telemetry gates. That's how we keep the standard honest.
For songwriters who care about the why
The shortest version: your chorus is shorter than your verses, or your chorus isn't doing its job. A chorus that bears the same syllable load as the surrounding verses isn't a chorus, it's another verse with a label.
You don't need an analyzer to feel this. Read your draft out loud. Read the verse, then the chorus, then the verse again. If the chorus doesn't feel like a release—if it doesn't compress, doesn't accelerate, doesn't land—you wrote a verse where a chorus belonged.
The analyzer just tells you the math behind the feeling. Sometimes that's useful. Sometimes the read-out-loud test is faster.
— Built by one operator + Claude Opus 4.7. Read the standard.