People ask what actually happens when they click Forge. Most AI lyric tools treat this as a trade secret. We don’t — the whole point of publishing the Lyric Scoring Standard was to be legible. This post does the same thing for the generation pipeline itself: one prompt, seven phases, two scoring passes, one final song. Traced end to end.

The example is a real prompt I ran while writing this: “a heartbreak on a Tuesday.” Country, female vocal. What follows is what the engine did, in order, from the click to the page.

Phase 1 — SuperPrompt (concept expansion)

The prompt you typed is rarely the prompt the forge sees. A 5-word prompt (“a heartbreak on a Tuesday”) carries a scene but no angle. Before anything else, SuperPrompt runs — a small 20-expert panel that rewrites your prompt into a richer brief.

For “a heartbreak on a Tuesday,” SuperPrompt came back with something like:

A woman sitting at her kitchen table on a Tuesday morning realizes the relationship is already over — not because of a fight the night before, but because she notices she’s relieved he’s not there. The song is about the quiet, unceremonious nature of when love actually ends, and how ordinary weekdays are the setting for most endings. Country, narrator-first, female vocal, conversational register.

That’s what the forge panel actually works from. The original prompt is never thrown away — SuperPrompt is additive — but the expanded brief is what gets handed to phase two.

Phase 2 — The forge (seven-voice war room)

The forge proper is a single Claude call running a 50-voice “war room” prompt. Seven voices from that panel drive this particular song (the selection is genre-weighted: a country prompt pulls more country-native writers). They each propose directions, a lead voice synthesizes, and a constraints layer (no banned cliches, no generic imagery, named specificity required) polices the output.

The draft that came back for our Tuesday prompt was 40 lines across two verses, two choruses, a bridge, and an outro. A title: “Tuesday Coffee.” A hook: “I poured you coffee out of habit / the mug’s still on the counter now.”

The first-draft hook is already doing what the rubric rewards. “Mug” is named. “Habit” is a concrete behavior. The narrator voice (“I poured”) is first-person. No cliches in the line.

Phase 3 — First scoring pass (Deep Eval)

Immediately after the forge, the song hits the 12-metric rubric. An 8-voice panel scores each metric 0-100, provides reasoning, and flags specific lines as “wounds” (metric failures) or “transcendent” (top-decile hits).

For “Tuesday Coffee” the first eval came back at composite 71. The tier breakdown:

Craft: 74 — prosody solid, rhyme scheme tight, a couple of economy violations in verse two.
Expression: 70 — specificity strong (coffee mug, the counter, a specific Tuesday), but originality flagged because “I’m fine now” in the bridge was too close to a hundred other country songs.
Impact: 69 — the chorus landed; the bridge didn’t reframe anything new; the song ended where it started.

Wounds the evaluator flagged:

Verse 2, line 3: “I didn’t cry when you left me” — generic, hedge-heavy.
Bridge line 2: “I’m fine now” — flagged as cliche; overused in the genre.
Chorus line 4: “And I realize it’s over” — tells the listener what the song has already shown them.

This is a reasonable first draft. 71 is above the 50 default but not yet in the territory that scores as a credible demo. The gauntlet’s job is to close that gap.

Phase 4 — The gauntlet (targeted refinement)

The gauntlet takes the wounds from phase three and routes them through a fix-panel that’s prompted specifically to repair the flagged metrics. It doesn’t rewrite the whole song — that destroys the parts that worked. It does surgery on named lines.

For our three wounds:

“I didn’t cry when you left me” became “I waited for the crying and it never came.” Same emotional content; specific and active instead of generic and passive.
“I’m fine now” became “I’m still here on Tuesday.” Anchors the bridge back to the song’s concrete frame. Not cliche.
“And I realize it’s over” became “The coffee went cold and I didn’t make more.” Shows rather than tells.

Three lines changed. Everything else survived the gauntlet. That’s by design — most revisions should be local. The frame, the hook, the structure all earned their keep in phase two.

Phase 5 — Second scoring pass

The revised song goes back to the same 8-voice evaluator. This is not optional and it’s not cosmetic — the gauntlet is allowed to lose. If the revised song scores worse on any axis, the engine keeps the phase-two version. (See the multi-axis revert logic in forge-types.ts if you want the detail.)

In this case the re-score came back at composite 78. Up seven points. The tier moves:

Craft: 74 → 76 (economy cleaned up)
Expression: 70 → 79 (originality lifted hard when the cliche bridge went)
Impact: 69 → 78 (the new bridge and chorus endings give the song an arc)

78 on the rubric is roughly top 8% of forged output. The percentile anchor does a better job of communicating that than the raw number does. A 78 is demo-credible territory — a song a working writer might take into a session.

Phase 6 — SuperStyle (Suno-ready string assembly)

Separately from the lyrics, the engine also produces a Suno style string — the comma-separated descriptor Suno uses to shape the audio. Things like “country, female vocal, fingerpicked acoustic, sparse drums, emotive vocal performance, 80 bpm.”

SuperStyle runs on Claude Haiku (a smaller, faster model — this is a formatting task, not a creative one). It takes the final lyrics and the genre context and produces a style string tuned to the song’s mood. For our Tuesday song: “country, female vocal, fingerpicked acoustic, brushed drums, Kacey Musgraves-adjacent, conversational delivery, 84 bpm, intimate room tone.”

You paste that string into Suno, along with the lyrics, and what comes out is shaped instead of generic.

Phase 7 — Persist + surface

The song saves to Supabase with its final composite score, its metric breakdown, its wounds and transcendent lines, its style string, and a generated cover-art concept. The page re-renders with a before/after score strip if you had already composed or paste-in lyrics; with the radio-player if you chose to auto-spotlight the forge. If the composite score clears the leaderboard threshold and the song’s public, it shows up in the weekly top-10.

The whole pipeline took about 90 seconds for this song. Most of that is in phase two (the forge itself) and phase five (the second eval). SuperPrompt is fast. The gauntlet runs in parallel with UI updates.

What the walkthrough actually shows

Three things this makes visible that a black-box AI song tool never would:

1. The revision is where the quality lives. A 71 first draft and a 78 final is a seven-point lift, but the lift happens on three lines. Specific lines. Named. Targeted. This is how songwriters work, not how language models work by default.

2. The rubric is load-bearing. The gauntlet doesn’t rewrite randomly — it rewrites against the flagged metric wounds. Without the scoring pass, there’s nothing to target. This is why we published the Lyric Scoring Standard: the rubric IS the refinement signal.

3. The engine loses sometimes. The multi-axis revert is not theater. If the gauntlet’s rewrite scores worse on even one axis, the engine reverts. The pre-gauntlet snapshot is kept for exactly this reason. (See the B1149 extraction if you want the code.)

Anatomy of a Forge: one song from prompt to final score