Motivation
Today GPT-4o is used in exactly one place: the B1279 triangulation re-score. It produces a 12-metric scorecard on the same rubric the Sonnet primary already used; the divergence between the two becomes a cross-family corroboration signal. That answers ONE question — "do model families agree on the score?"
It does NOT answer a different and arguably more useful question: "what does a different model family actually NOTICE in this draft?" A second judge giving the same scorecard tells you about agreement; a second reader giving free-form craft notes tells you what the first reader missed.
This RFC proposes adding GPT-4o as a SECOND ROLE: a pre-gauntlet literary critic. Output shape is intentionally NOT another scorecard — it's per-line craft notes (cut / rewrite / preserve) plus an overall craft impression. The Sonnet gauntlet then incorporates or rejects each note as input to its existing decision-rule.
Proposal
Activation
- Off by default. Requires `SF_GPT4O_PREGAUNTLET=1` AND `OPENAI_API_KEY` set in the environment.
- When active, fires AFTER the initial Sonnet eval lands, BEFORE the gauntlet runs. Adds ~30 seconds + ~$0.04 per forge.
- Flag is independent of `SF_INTERNAL_EVAL_TRIANGULATION` (the re-scoring use). Both can be on simultaneously; both consume `OPENAI_API_KEY`.
Critic contract (locked)
Returns a `CriticResult`: ``` interface CriticResult { overall: string; // one-paragraph craft impression lines: CritiqueLine[]; // up to 12 per-line notes strengths: string[]; // 3 lines to preserve model: string; // pinned: gpt-4o-2024-11-20 durMs: number; } interface CritiqueLine { line: string; reason: string; action: 'cut' | 'rewrite'; // 'preserve' lives in strengths[] } ```
Implementation: `runGpt4oCritic(lyrics, genre)` in `src/lib/triangulation/openai-critic.ts`. Failures are silent — a null return means the gauntlet runs as it does today.
How the Sonnet gauntlet uses it
When the critic returns non-null, the gauntlet prompt receives an additional section:
> CROSS-FAMILY CRITIQUE (advisory, not authoritative): > A second model family (gpt-4o) read this draft. It flagged > these lines as needing revision: > <line>: <reason> [action] > ... > It identified these lines as load-bearing strengths to preserve: > <line> > ... > Use this as input. Reject any flag that contradicts your own > craft judgment. Cite which flags you accepted in your output > notes.
The gauntlet decision-rule already requires the model to justify each cut + replacement; this just adds an external prior. The gauntlet remains the final authority — Sonnet's "main brain" decides accept or deny per the operator's framing.
Telemetry contract
Three new structured logger events:
- `gpt4o_critic.complete` — successful run with line/strength counts + durMs
- `gpt4o_critic.parse_failed` — model returned non-JSON
- `gpt4o_critic.shape_invalid` — JSON shape didn't match the contract
- `gpt4o_critic.call_failed` — network/timeout
Plus a per-gauntlet outcome flag `gauntlet.cross_critic_applied`: boolean indicating whether the critic was active for that pass. Lets us measure "did adding the critic improve the gauntlet's score lift?"
Cost + budget
- Per-call: ~$0.04 (2x the triangulation cost — larger output, no fixed JSON schema bound)
- 1000 active songs/month: $40
- 10000 active songs/month: $400
- Cost-per-song increase: +12% over current ~$0.30 forge cost
- Off by default, so cost only accrues when the operator chooses to flip the flag (test runs, A/B experiments, or full rollout)
Versioning consequence
The critic is NOT part of the Lyric Scoring Standard rubric. The rubric continues to be Sonnet-evaluated against the published 12 metrics. The critic is a PIPELINE component (like the gauntlet itself), not a scoring component. `rubricVersion` does not change when the critic toggles on or off.
However, scores produced WITH the critic active will systematically differ from scores produced without it (because the gauntlet's acceptance pattern shifts). To preserve reproducibility, the seal gains a new field: `pipeline.preGauntletCritic: 'gpt-4o' | 'none'`.
Acceptance criteria for resolution
This RFC accepts when ALL of: 1. `runGpt4oCritic` is shipped with the locked contract above (B1354 — DONE this build) 2. The admin test endpoint `/api/admin/gpt4o-critic-test` proves the call path works (B1354 — DONE) 3. The 7-day public comment window closes 4. The shadow-mode wiring (gauntlet calls the critic but does NOT yet weigh its output, only logs the result) ships in a follow-on build 5. After 30 days of shadow-mode telemetry, a measured comparison shows the critic-augmented gauntlet either improves mean composite by ≥1pt OR identifies ≥10% more legitimate cuts (judged by spot-check) without increasing failure rate
If criteria 4-5 don't land within 60 days of acceptance, the RFC is automatically withdrawn and the function stays as a callable- but-unused module.
Out of scope
- Replacing GPT-4o with a different family (Mixtral, Llama). Future work.
- Multiple critics with consensus (3+ model families voting). Future work — needs a separate RFC because the cost calculus changes materially.
- Wiring the critic into the cold-reader pass (B980) too. The cold reader runs AFTER the gauntlet; this RFC only governs the pre-gauntlet path.
Comment window
This RFC is open for comment until 2026-05-03. Email support@songforgeai.com with the subject `RFC-0005` to leave a comment.
Resolution
(Pending — will be filled in after 2026-05-03 with a summary of comments + the accepted text. Until then, the function is callable via `/api/admin/gpt4o-critic-test` for evaluation but NOT yet wired into the production gauntlet path.)