Skip to content
All guides
Tools2026-04-306 min read

Reproducible AI: What Scoring Should Actually Look Like

Most AI scoring tools work by assertion: "this output scored 87." There is no rubric you can audit, no implementation you can reproduce, no signature you can verify. The Lyric Scoring Standard exists because reproducible AI is a bar, not a slogan — and most of the field is below it. Here is what the bar actually looks like.

The four properties of a reproducible AI scoring system

An AI score is reproducible when an independent third party can answer four questions:

  1. What rubric? The metrics, weights, anti-inflation rules, and calibration anchors are PUBLISHED. Not "we use a multi-dimensional rubric" — the actual numbers, in a versioned document.
  2. What runtime? Which model produced the score? At what temperature? Against which build of the scoring code? Without these the score is an artifact of an unspecified system, like a measurement reported without units.
  3. What signature? The score envelope is signed with a verifiable key, so a downstream consumer can prove the score came from the claimed system and wasn’t fabricated. Cryptographic, not "trust us."
  4. What corpus? If the system claims a score band corresponds to professional craft, the calibration corpus is PUBLIC. Anyone can verify the anchors.

Below the bar: any system missing one of these is operating on assertion, not reproducibility.

How the Lyric Scoring Standard meets the bar

The Lyric Scoring Standard publishes all four:

  • Rubric: 12 metrics across 3 tiers, 5 anti-inflation rules, calibration band published at /scoring/standard/whitepaper. CC BY 4.0; cite the version when implementing.
  • Runtime: every score response carries a reproducibility seal listing model + temperature + buildSha + rubricVersion. The seal is part of the public response schema; consumers can read it.
  • Signature: the seal is signed with an ed25519 key. Public verifier coming on /developer (the verifier IS the standard-compliance check). For now, anyone with the public key can verify a seal manually.
  • Corpus: the calibration corpus lives at /scoring/corpus with hand-scored exemplars across the band. The S-band anchor is Hank Williams ("I’m So Lonesome I Could Cry") — a song that survived 75 years.

Why this matters for AI music tools beyond SongForgeAI

If you’re building anything that scores AI-generated text — lyrics, prose, code, anything — these four properties are the bar your system either meets or doesn’t. The current AI music tooling landscape mostly doesn’t meet it: scores are emitted by closed models, against unpublished rubrics, with no signature, against unspecified corpora.

That’s a problem the field can solve in a single coordination round: pick an open standard, ship implementations that emit signed seals, publish corpora. The Lyric Scoring Standard is one published candidate; there will be others; the goal isn’t our standard winning, it’s the field having ANY published standard that meets the bar.

How to verify a score yourself

Today, the verification flow looks like this:

  1. Receive a SongForgeAI score response. Note the seal field — that’s the signature envelope.
  2. Look up the public key on /developer (also published in the npm package metadata).
  3. Verify the seal’s signature against the seal’s data using ed25519. Any standard cryptography library handles it.
  4. Cross-reference the seal’s rubricVersion against the version pinned in the published whitepaper. Mismatched versions = the score isn’t against the rubric you think it is.

This is what reproducible AI looks like in practice: not "trust the number," but "verify the number." Every consumer should be able to run that loop without asking us anything. We’re publishing because it’s the only honest move.

The cost of NOT being reproducible

The cost is invisible until something matters. A label decides whether to clear a sample based on a "trust score" from an unspecified system. A music school grades AI-assisted student work against an unpublished rubric. A streaming platform de-prioritizes tracks scoring below a threshold computed by a black box. Without reproducibility, none of these decisions can be appealed, audited, or improved. The system is just an oracle.

Reproducibility doesn’t guarantee a score is RIGHT. It guarantees the score is CHECKABLE. That’s the bar.

Related rubric metrics

Every craft directive on this page maps to one or more metrics in the Lyric Scoring Standard. If you want the measurable side: