The lyric AI that gets sharper for you with every song
Most lyric tools are templates — they produce the same average output for everyone. SongForgeAI now ships a per-user weakness profile that injects into every forge as active bias, training the prompt directive toward each writer’s specific blind spots. The compounding loop, end to end.
The headline
Most AI lyric tools — and most generative AI products full stop — are templates. They produce the same average output for everyone. The model knows craft in general; it does not know your craft, your weaknesses, your patterns.
As of Build 1530, SongForgeAI does. The platform now reads each user's catalog, computes a per-user weakness profile, renders that profile into a ~250-token prompt directive, and injects it into every new forge. The next song the user writes is biased toward addressing the dimensions where their previous songs scored lowest, while explicitly preserving the dimensions they already do well.
Build 1531 added the user-facing dashboard at /dashboard/insights so the user can see what's getting injected. Both halves of the loop — the WRITE side (auto-injection) and the READ side (visibility) — are now in production.
This post is the principle, the mechanism, and the proof.
The principle
A human professional songwriter improves with every song they write. Their craft compounds. Their weaknesses get smaller; their strengths get stronger. They build a personal style.
An AI tool that produces the same average output regardless of who is using it does not compound. It is a generator, not a trainer. The user gets better despite the tool, not because of it.
The compounding loop is the thing that turns a generator into a trainer. Without it, a lyric tool is forever a template. With it, the tool becomes useful in ways a template can't be — because every interaction makes the next one sharper for that specific user.
This is the kind of feature that's easy to claim and hard to actually build. The claim "AI that learns from you" is in approximately every product page on the internet. The implementation that backs it up is rare. Most "personalization" is a checkbox preference (genre, voice, vibe). The compounding loop is something different: catalog-wide pattern detection feeding directly into the next generation as structured prompt directive.
The mechanism
Four layers, deliberately separated:
1. Pure compute (src/lib/user-weakness-profile.ts)
Pure function. Takes a list of past scored songs (forge_score + per-metric eval data + flagged wounds). Returns a structured profile:
- weakestMetrics: bottom 3 metrics by avg score across the catalog (with sample-size guard — minimum 5 song appearances)
- strongMetrics: top 3 metrics by avg score
- chronicWounds: wound families that appeared in >30% of past songs (wound texts grouped by 5-word prefix; same wound phrased differently still counts)
- overallAvg: catalog-wide composite (sanity reference)
- songsAnalyzed: how many songs contributed
14 unit tests lock the compute behavior. Returns an empty profile when songsAnalyzed < 5 — new users don't get a directive until there's signal worth biasing toward.
2. Pure prompt fragment (src/lib/claude/weakness-directive.ts)
Pure function. Takes a profile, returns a string. The string is the directive that gets injected into the war-room user prompt. Example output:
PERSONAL CRAFT BIAS — adjust this forge to address user's
documented weaknesses without sacrificing their strengths.
WEAKEST METRICS (across 47 scored songs):
- Sensory Specificity: avg 62/100 — push concrete detail
- Emotional Truth: avg 67/100 — let the wound show
- Hook Strength: avg 69/100 — earn the chorus
CHRONIC WOUNDS (recur in >30% of past songs):
- "weak line ending pattern on conjunctions"
- "abstract feeling without anchor"
PRESERVE STRENGTHS:
- Imagery: avg 84
- Rhyme & Meter: avg 81
- Structure: avg 80
Bias your craft choices toward addressing the weak metrics +
breaking the chronic patterns. Do NOT reduce the strengths.
18 unit tests + a token-budget guard (<2000 chars). Returns empty string when the profile lacks signal — users with no signal get byte-identical pre-bias prompts (this keeps the 12-snapshot golden-eval prompt-drift CI gate green by construction).
3. Server bridge (src/app/api/songs/forge/fetch-weakness-bias.ts)
The I/O layer. Pulls the user's last 200 scored songs from Supabase, computes the profile, renders the directive. Bounded by a 250ms hard timeout via Promise.race — a slow DB query never delays a forge response. If the timeout fires, returns an empty string and the forge runs unbiased (the bias fills in next forge).
Skipped in batch mode (leanBatch) — batch is throughput-optimized; the 250ms-per-song overhead would compound to 2.5s on a 10-song batch.
4. Forge wiring (src/lib/claude/forge-stream.ts)
The injection point. The directive (when non-empty) gets appended to the user prompt right after the language fragment. Empty-string default preserves byte-identical pre-1530 prompts for any code path that doesn't pass the bias.
Why user prompt, not system prompt? Per-user content shouldn't bust system-prompt cache. Cache stability matters for cost + latency at scale.
The visibility surface
Build 1531 added /dashboard/insights — the read side of the loop. The page surfaces the SAME compute that the forge bias uses. Three render states:
- No data: first-forge prompt with a CTA to /forge
- Insufficient data: "Forge 5+ scored songs to unlock insights"
- Rich profile: sample-size headline + weakest metrics cards + strong metrics grid + chronic wounds list + a violet/cyan callout explaining how the data shapes the next forge + collapsible raw-directive viewer
The collapsible raw directive shows the user the EXACT text injected into their forges. Full transparency. No black box.
The strategic move was deliberate: closing the loop visually is what turns the compounding from "trust us, it's working" into "you can see it work." Most tools that claim personalization don't show you what's been personalized. This one does.
What this is NOT
Not a fine-tune. The base Claude model is unchanged. The bias is prompt-level — a structured directive injected per-request. No training, no per-user weights, no model storage. The compute happens at forge time from the user's catalog.
Not a preference. The user doesn't choose what gets biased. The system reads the catalog and identifies the weak dimensions automatically. Preferences are surface-level (genre, voice). This is craft-level (where your songwriting actually struggles).
Not magic. The directive is ~250 tokens. The model still produces what the model can produce. What changes is the brief — the model is told "focus on these dimensions for this user" instead of receiving the same default brief everyone else gets. Better brief, better output.
Not opaque. Every user can read the exact directive that's being injected on their behalf. The compute is auditable, the injection point is documented, the rendered text is downloadable.
What this enables
Compounding. The user's first 5 songs prime the bias. Their next 50 songs sharpen it. Their next 500 songs make it specific to them in a way no template can match.
Honest improvement claims. The platform can now say "you'll improve with us" and back it up with a measurable curve — weakness-metric averages should rise over time as the bias does its work. Future builds will surface that curve as a per-user trajectory chart.
Differentiation that can't be copied by adding a feature. A competitor who wants to ship the same loop has to build the catalog read + the eval data structure + the wound family detection + the prompt directive shape + the bias injection point + the safety timeout + the visibility surface + the empty-state handling. Each piece is small. The composition is the moat.
The honest limits
The bias only works for users with a catalog. New users get the unbiased default until they've forged ~5 songs with eval data. Until then the platform IS a generator for them — same average output as everyone else. The compounding starts on song 6.
The bias only operates on the metrics the rubric measures. If your craft weakness is in a dimension the 12-metric rubric doesn't cover, the bias won't catch it. Adding new metrics to the rubric (B1485 Stolpe Index, B1482 POV stability, B1483 Rhyme Complexity, B1484 Singability stress, B1473 Emergent Cliché) compounds the bias's coverage — every new analyzer expands what the system can train on.
The bias depends on your eval being honest. If the rubric grades a song generously, the bias points the next song at the wrong weakness. The Burden of Proof + Gravity Rule + Antagonist Ceiling + Historical Context Anchor + Anti-Platitude rules in the eval prompt are the defenses; they're not perfect.
The ask
If you have lyrics you've been writing — pasted from a notebook, drafted in another tool, anything — paste them into the refine surface and let the platform read your catalog. After 5 scored songs, your insights page lights up and the bias starts shaping every new forge toward your specific craft.
The platform compounds with you. That's the whole pitch.