Motivation

Today GPT-4o is used in exactly one place: the B1279 triangulation re-score. It produces a 12-metric scorecard on the same rubric the Sonnet primary already used; the divergence between the two becomes a cross-family corroboration signal. That answers ONE question — "do model families agree on the score?"

It does NOT answer a different and arguably more useful question: "what does a different model family actually NOTICE in this draft?" A second judge giving the same scorecard tells you about agreement; a second reader giving free-form craft notes tells you what the first reader missed.

This RFC proposes adding GPT-4o as a SECOND ROLE: a pre-gauntlet literary critic. Output shape is intentionally NOT another scorecard — it's per-line craft notes (cut / rewrite / preserve) plus an overall craft impression. The Sonnet gauntlet then incorporates or rejects each note as input to its existing decision-rule.

Proposal

Activation

Off by default. Requires `SF_GPT4O_PREGAUNTLET=1` AND `OPENAI_API_KEY` set in the environment.
When active, fires AFTER the initial Sonnet eval lands, BEFORE the gauntlet runs. Adds ~30 seconds + ~$0.04 per forge.
Flag is independent of `SF_INTERNAL_EVAL_TRIANGULATION` (the re-scoring use). Both can be on simultaneously; both consume `OPENAI_API_KEY`.

Critic contract (locked)

Returns a `CriticResult`: ``` interface CriticResult { overall: string; // one-paragraph craft impression lines: CritiqueLine[]; // up to 12 per-line notes strengths: string[]; // 3 lines to preserve model: string; // pinned: gpt-4o-2024-11-20 durMs: number; } interface CritiqueLine { line: string; reason: string; action: 'cut' | 'rewrite'; // 'preserve' lives in strengths[] } ```

Implementation: `runGpt4oCritic(lyrics, genre)` in `src/lib/triangulation/openai-critic.ts`. Failures are silent — a null return means the gauntlet runs as it does today.

How the Sonnet gauntlet uses it

When the critic returns non-null, the gauntlet prompt receives an additional section:

> CROSS-FAMILY CRITIQUE (advisory, not authoritative): > A second model family (gpt-4o) read this draft. It flagged > these lines as needing revision: > <line>: <reason> [action] > ... > It identified these lines as load-bearing strengths to preserve: > <line> > ... > Use this as input. Reject any flag that contradicts your own > craft judgment. Cite which flags you accepted in your output > notes.

The gauntlet decision-rule already requires the model to justify each cut + replacement; this just adds an external prior. The gauntlet remains the final authority — Sonnet's "main brain" decides accept or deny per the operator's framing.

Telemetry contract

Three new structured logger events:

`gpt4o_critic.complete` — successful run with line/strength counts + durMs
`gpt4o_critic.parse_failed` — model returned non-JSON
`gpt4o_critic.shape_invalid` — JSON shape didn't match the contract
`gpt4o_critic.call_failed` — network/timeout

Plus a per-gauntlet outcome flag `gauntlet.cross_critic_applied`: boolean indicating whether the critic was active for that pass. Lets us measure "did adding the critic improve the gauntlet's score lift?"

Cost + budget

Per-call: ~$0.04 (2x the triangulation cost — larger output, no fixed JSON schema bound)
1000 active songs/month: $40
10000 active songs/month: $400
Cost-per-song increase: +12% over current ~$0.30 forge cost
Off by default, so cost only accrues when the operator chooses to flip the flag (test runs, A/B experiments, or full rollout)

Versioning consequence

The critic is NOT part of the Lyric Scoring Standard rubric. The rubric continues to be Sonnet-evaluated against the published 12 metrics. The critic is a PIPELINE component (like the gauntlet itself), not a scoring component. `rubricVersion` does not change when the critic toggles on or off.

However, scores produced WITH the critic active will systematically differ from scores produced without it (because the gauntlet's acceptance pattern shifts). To preserve reproducibility, the seal gains a new field: `pipeline.preGauntletCritic: 'gpt-4o' | 'none'`.

Acceptance criteria for resolution

This RFC accepts when ALL of: 1. `runGpt4oCritic` is shipped with the locked contract above (B1354 — DONE this build) 2. The admin test endpoint `/api/admin/gpt4o-critic-test` proves the call path works (B1354 — DONE) 3. The 7-day public comment window closes 4. The shadow-mode wiring (gauntlet calls the critic but does NOT yet weigh its output, only logs the result) ships in a follow-on build 5. After 30 days of shadow-mode telemetry, a measured comparison shows the critic-augmented gauntlet either improves mean composite by ≥1pt OR identifies ≥10% more legitimate cuts (judged by spot-check) without increasing failure rate

If criteria 4-5 don't land within 60 days of acceptance, the RFC is automatically withdrawn and the function stays as a callable- but-unused module.

Out of scope

Replacing GPT-4o with a different family (Mixtral, Llama). Future work.
Multiple critics with consensus (3+ model families voting). Future work — needs a separate RFC because the cost calculus changes materially.
Wiring the critic into the cold-reader pass (B980) too. The cold reader runs AFTER the gauntlet; this RFC only governs the pre-gauntlet path.

Comment window

This RFC is open for comment until 2026-05-03. Email support@songforgeai.com with the subject `RFC-0005` to leave a comment.

Resolution

**Accepted as-written, 2026-05-03.**

Comment window closed without proposed amendments. Two clarifying questions received:

Q: If `SF_GPT4O_PREGAUNTLET` is on but `OPENAI_API_KEY` is missing, what happens? A: Silent no-op. `runGpt4oCritic` returns null; the gauntlet runs as it does today. The seal field reads `preGauntletCritic: 'none'` so reproducibility stays honest.

Q: Does the critic run on Crucible? Or only on the forge -> gauntlet path? A: Forge -> gauntlet only. Crucible is a different surface with its own 8-voice contract; adding a GPT-4o critic there would require a separate RFC because the cost calculus + reproducibility shape is different.

The shadow-mode requirement (criterion 4) and the 30-day telemetry threshold (criterion 5) remain. The function is callable via `/api/admin/gpt4o-critic-test` for evaluation; production wiring lands in a follow-on build only after shadow-mode telemetry meets the threshold OR is withdrawn at the 60-day mark per the RFC's own auto-withdrawal clause.