Skip to content
All posts
Behind the Scenes2026-04-306 min readBy the SongForgeAI team

The Hank Williams Test — why our scoring rubric is calibrated against a country song from 1949

Most AI scoring is calibrated against other AI output. The Lyric Scoring Standard anchors at 95 against Hank Williams' "I'm So Lonesome I Could Cry." Here is why that choice is load-bearing.

Most AI scoring rubrics are calibrated against other AI output. They look at a sample of GPT or Claude or Gemini lyrics, average the perceived quality, and call the median 75. That feels honest until you remember: every model is calibrated against the same training distribution. "Average AI lyric" is a moving target that drifts with each model release.

The Lyric Scoring Standard doesn't do this. The rubric anchors at composite 95 against Hank Williams' "I'm So Lonesome I Could Cry," released in 1949. Every score the system produces is implicitly answering: how close is this to a song that survived seventy-five years of recordings, covers, lectures, and citation? The Hank Williams Test is the calibration.

Why a country song from 1949

Three reasons, none of them sentimental:

  1. It has been independently verified by time. Seventy-five years of survival is the strongest possible peer review. Every working songwriter studied under since 1949 has heard this song. It has been covered by Patsy Cline, B.B. King, Cassandra Wilson, Diana Krall, Frank Ocean. If your scoring rubric can’t identify "I’m So Lonesome I Could Cry" as A-tier output, the rubric is broken.
  2. The craft is structurally measurable. The lyric is a four-image stack: silence, train, robin, moon. Each image carries grief through physical observation. No emotional summary lines. No platitudes. The verse-to-chorus structure is verse-only — there is no chorus — and yet the song is more memorable than 99% of choruses ever written. The 12-metric rubric scores this as 95 because the per-metric scores are: Specificity 96 (every image concrete), Voice 95 (the stillness IS the narrator), Memorability 96 (no hook, total recall), Image Discipline 98 (zero abstractions), and so on.
  3. It pre-dates AI training data by half a century. No GPT or Claude or Gemini was trained on a corpus that didn’t include "I’m So Lonesome I Could Cry." Every model has memorized it. Anchoring at this song means the test is: how close does your generated lyric come to a target that every model knows but few can replicate. That gap is the craft surface.

What "calibrated at 95" actually means in practice

When the rubric scores a forged lyric at, say, 78, it’s saying: this lyric is 17 points below the calibrated S-band anchor. That gap is composed of measurable per-metric deficits — the Specificity score might be 82, the Voice score 75, the Memorability score 71. Each deficit points to a specific revision: the line that hides behind a generic noun, the verse where the narrator breaks character, the chorus that lands but doesn’t echo.

The Anti-Inflation rules (Gravity Rule, Burden of Proof, Antagonist Ceiling, Historical Context, Anti-Platitude) make this calibration honest. Without them, an LLM-judged scoring system would score everything 80+ because LLMs are sycophantic by default. The Anti-Inflation rules force the rubric to argue UPWARD with evidence — a 90 requires citing specific lines and comparing to canonical-song references. Without those references, the score lands in the 60-75 band where most working AI output legitimately sits.

Why this matters for any tool that scores AI lyrics

The published standard (CC BY 4.0, also on npm as @songforgeai/scoring-rubric) ships with the calibration documentation. Any third party implementing the rubric inherits the Hank Williams Test as the anchor. That’s the point of a published standard — the calibration is shared infrastructure, not proprietary positioning.

If you’re building an AI lyric tool and your scoring system anchors at "average GPT-4 output" or "median Suno render," you’re calibrating against a moving target. Anchor against something that has survived. The corpus is at /scoring/corpus; the methodology is at /scoring/standard/whitepaper; the rubric is on npm.

The test in one sentence

If your scoring rubric can’t produce a 95 for "I’m So Lonesome I Could Cry," your rubric isn’t honest. If it produces a 95 for an average AI render, your rubric is broken. The Hank Williams Test is the floor and the ceiling at the same time.