Skip to content
All posts
Behind the Scenes2025-12-128 min readBy the SongForgeAI team

Meet the Antagonist: The Voice We Built to Drop Your Score

Self-evaluating AI converges on polite consensus. We built an adversarial voice into the scoring panel whose only job is finding what is wrong. Here is why it works, why users hate it at first, and why we kept it anyway.

When we first built the scoring system, we tested it the obvious way: hand the evaluator a lyric, let it grade. The results were promising — every lyric scored between 75 and 92. Promising, until we noticed the problem. Every lyric scored between 75 and 92.

No failures. No mediocre output. The evaluator — which was the same kind of model that wrote the lyric — had quietly decided that all its own work was good. Not in a malicious way. In the way a reasonable grader with no skin in the game grades generously. In the way any peer review without consequences drifts toward kindness.

We ran the same test with variations. Different evaluator prompts. Different temperature settings. Different explicit instructions to be harsh. The range shifted a little. The problem did not. Self-evaluation, in every form we tried, converged on consensus kindness. The lowest scores were still inside the range a serious grader would call "good work."

That is how we ended up building the Antagonist.

What the Antagonist is

The Antagonist is a dedicated voice in the scoring panel whose only job is to find what is wrong. It does not weigh in on what is working. Other voices handle that. The Antagonist reads the lyric looking specifically for: forced rhymes, filler syllables that exist only for meter, cliches that nobody caught, narrator inconsistencies, bridge lines that say the same thing the second verse said, and any moment where the song stops being specific and retreats into abstraction.

When the Antagonist finds one of these — and in most lyrics, it finds several — it flags the specific line, explains the specific problem, and proposes a score ceiling. A lyric with a genuinely broken meter in verse two cannot score above 78 no matter how strong everything else is. A lyric that rhymes "sky" with "why" in the chorus and "sky" with "goodbye" in the bridge cannot score above 74. The numbers come from the rubric. The evidence comes from the text.

Here is an example of an Antagonist report from a real evaluation:

Line 4 ("the sky turned grey like it always does"): "Always does" is filler. The narrator is telling us a universal instead of the specific grey of this specific sky on this specific afternoon. This is the third universal in the song and the song is not four minutes long. Specificity score cannot be above 65 while this line is present. Composite ceiling: 74.

Line 11 ("and I'll love you forever"): Dead phrase. It means nothing because it has been used to mean everything. The Transcendence metric cannot credit this line. The Imagery Originality metric takes a one-point penalty. Remove or rewrite.

Line 18 ("I don't know how to say goodbye"): The narrator has said this three different ways earlier in the song. The bridge should reveal something new, not restate the premise. Emotional Arc cannot credit a bridge that does not move. Arc score cap: 62.

This is specific. It names the line, names the problem, attaches the problem to a metric, and proposes a concrete score consequence. The other voices in the panel can argue with it, and sometimes do. But the Antagonist has the procedural power to cap scores, and a capped score is capped until the objection is actually resolved.

Why adversarial beats consensus

There is a specific psychological phenomenon in AI self-evaluation that we have started calling "congratulation drift." When the same general-purpose model both generates and evaluates, it perceives the generated output as the work of "a competent writer" (which, technically, it is — the competence is the whole point of the model) and graduates from there. Competent work starts at a high floor. Weaknesses are framed as minor, strengths are framed as foundational, and the composite lands in the band that feels like a respectable grade rather than the band that reflects the work's actual position in the distribution of all possible work.

Congratulation drift is not solved by telling the evaluator to be harsh. We tested that. The evaluator agrees to be harsh, then scores the next lyric 85 instead of 88. The frame has not shifted. The evaluator is still grading within a peer group it considers itself part of.

Adversarial evaluation works because it puts the evaluator on the opposite side of the table. The Antagonist's prompt is not "grade this lyric." It is "find the reasons this lyric does not deserve the score the consensus panel wants to give it." That reframe changes what the evaluator attends to. Instead of reading the lyric looking for competence and congratulating it, the Antagonist reads looking for failure and documenting it.

The effect on the distribution is measurable. Without the Antagonist, first-pass scores cluster in the 72 to 85 range and feel meaningless. With the Antagonist, first-pass scores cluster in the 54 to 68 range and feel honest. A 78 from the Antagonist-checked panel means something. A 78 from self-evaluation means "I was there, I saw it."

The tradeoff: it makes scores feel mean

Users do not love the Antagonist at first. A lyric they think is pretty good comes back with a 66, and the explanation includes three specific line-level criticisms that are not wrong but are harder to read than a generic "nice work." We have had users email support to ask if the scoring was broken. It was not broken. It was working.

The thing we tell people, which they eventually see for themselves: a 66 from the Antagonist-checked panel is useful. It tells you exactly what to fix. An 84 from self-evaluation does not tell you anything. It is a compliment dressed as a number. One of those numbers is worth a refine pass. The other one is just a vibe.

This is why Refine mode became the most-used feature on the platform about three weeks after we installed the Antagonist. The adversarial feedback pairs naturally with iterative improvement: the critique names the weakness, Refine targets it, and the re-scored version typically lands 12 to 18 composite points higher than the first draft. Users learned to read the initial low score as the opening of a conversation instead of the end of one.

What the Antagonist will not do

Three guardrails keep the Antagonist from becoming a cynic:

  1. Evidence-based only. The Antagonist cannot propose a score cap without citing a specific line and explaining the specific problem. Vague grievances ("this just isn't very good") get rejected by the panel coordinator and returned for evidence. If the Antagonist cannot name the problem, the problem does not exist.
  2. Bound to the rubric. Every objection has to map to one of the twelve metrics. An objection that does not land on a metric ("I don't like this song's energy") is out of scope and gets dropped. This keeps the Antagonist from importing external taste under the cover of craft judgment.
  3. Cannot override transcendent lines. If another voice has marked a line as transcendent, the Antagonist can argue with that designation but cannot unilaterally remove it. Strong work is protected. Consensus about strong work is protected. The Antagonist exists to catch weakness, not to launder dislike.

These guardrails mean the Antagonist is, in practice, a specific kind of reader: the one who always finds the broken bit, reports it precisely, but also honors what is working and does not pretend the good lines are not there. That kind of reader is rare in nature. It is also what every songwriter who has ever improved has had access to at some point — a trusted person who would actually tell them when something was not working, citing exactly what.

Why we kept it

The case against the Antagonist was real. Lower initial scores look bad. Users get frustrated. It makes the product feel harder to use. Some share of prospective users bounce off a 64 on their first attempt and do not come back.

The case for the Antagonist won on one argument: a product that tells users their work is good when it is not is not a craft tool. It is a compliment machine. Compliment machines exist. We did not want to be one.

The subtler argument: every user who sticks around past the first disappointing score becomes a better writer over time. The compliment machine's users get the same score forever because nothing tells them what to fix. Our users watch their scores climb as they learn to hear the specific weaknesses the Antagonist flagged in their early drafts. That progression is the actual product.

A note on the asymmetry

There is a structural asymmetry in the evaluation that we like. The system is biased toward low scores by design. Getting a high score requires passing every check. Getting a low score requires failing one. This means most lyrics score below their peak potential, and the peak is hard to reach.

We think this is correct. Composite scores should be a rigorous estimate of where the lyric lands, not a flattering estimate. Flattery is free. Rigor is what you built the tool for.

The lyric that scored 64 on the first pass and 83 after refine is the same lyric either way. The number that gets you to do the work is the number that matters. If the system had given it a 78 initially, you would have shipped it. The Antagonist is why you did not.

— The SongForgeAI team