Model card·standard v1.1.0

Lyric Scoring — Model Card

The honest mechanical truth of how a SongForgeAI score is produced. Models, temperatures, prompt provenance, and limitations — the companion document to the rubric itself.

Scoring pipeline (the official path)

Every published composite score is produced by eval-stream. The consensus + batch rows below are alternative entry points used by internal tooling — same rubric, different runtime profile.

eval-streamclaude-sonnet-4-20250514temp 0.7

12-metric deep evaluation. Assigns the composite score, surfaces wounds and transcendent lines.

src/lib/claude/eval.ts + src/skills/lyric-evaluator.md

eval-consensusclaude-sonnet-4-20250514temp 0.5

Conservative second pass at lower temperature. Used to detect score instability.

src/lib/claude/eval.ts (consensus block)

eval-batchclaude-haiku-4-5-20251001temp 0.7

Fast non-streaming eval used by batch tooling. Same rubric, smaller model. Not the official score path.

src/lib/claude/eval.ts

Supporting pipeline (context only)

These tasks generate or refine the lyric being scored. They are not part of the scoring contract but are documented for full pipeline reproducibility.

forgeclaude-sonnet-4-20250514temp 0.85–1.0 (voltage-mapped)

Generates the original lyric. Not part of scoring; included so reviewers can see the full pipeline.

src/lib/claude/forge-stream.ts + src/skills/hit-song-forge.md

gauntletclaude-sonnet-4-20250514temp 0.85

Targeted lyric surgery against weak metrics. Produces a refined draft that gets re-scored.

src/lib/claude/gauntlet-stream.ts

fix-critiqueclaude-haiku-4-5-20251001temp 0.5

Self-critique of gauntlet output. Grades whether each fix actually addressed its targeted wound.

src/lib/claude/gauntlet-self-critique.ts

Why the temperatures are what they are

Eval @ 0.7. Lower than forge (which runs hot for creative variance), higher than a classifier (which would run near 0). Scoring needs enough temperature to use the panel-of-judges framing in the eval skill, but not so much that two consecutive calls disagree by 5+ points.

Consensus @ 0.5. The "honest floor" pass. If the consensus number drops sharply from the headline number, the eval flags the score as unstable rather than reporting a confident high.

Forge @ 0.85–1.0. Mapped from the user-facing voltage slider. Generation needs creative variance; scoring does not.

Reproducibility fields (the seal)

Every score response carries these fields so a third party can verify which rubric version produced it — wired into /api/v1/score as the reproducibility seal.

Field	Value	Why
rubricVersion	"1.1.0"	Pinned at the top of every scored response (pending #36 wire-up).
model	"claude-sonnet-4-20250514"	The eval-stream model. Always Sonnet for the official score path.
temperature	0.7	Fixed for the eval-stream task. Lowered to 0.5 for consensus pass.
publishedAt	"2026-04-25"	Date the rubric version was published.

Known limitations

Model nondeterminism

Even at fixed temperature, the Anthropic API does not guarantee byte-identical output across calls. Expect ±1–2 points of score drift on the same lyric across repeated calls. The 12-metric breakdown is more stable than the composite single number.

Genre coverage skew

The rubric was calibrated against pop, rock, country, hip-hop, folk, and R&B. Niche genres (drone, noise, spoken word) score against the same rubric but the anti-inflation calibration was built from the listed mainstream set.

Language coverage

English only at v1.0. Lyrics in other languages will be evaluated by Claude's multilingual capability but the rubric prompt is English. Scores on non-English lyrics should be treated as advisory until a v1.1 multilingual revision ships.

No human-in-the-loop

Every score is fully automated. There is no human reviewer, no manual override, no editorial pass. This is by design — the rubric must reproduce.

Self-evaluation limit

A model evaluating output it itself produced has a known bias toward self-favorability. We mitigate this with the anti-inflation rules in the prompt (Gravity / Burden of Proof / Antagonist Ceiling), but the bias is not zero.

Verify the runtime

The model registry is the source of truth.

If a value on this page disagrees with src/lib/model-registry.ts, trust the registry. This page is regenerated against it; drift is a bug worth filing.

Back to the rubric definition