Inter-rater agreement

30 humans will score the corpus against the published rubric.

The Lyric Scoring Standard’s LLM evaluator runs the rubric against the calibration corpus and produces published scores. The honest credibility test for any such system is whether human graders applying the same rubric agree with each other — and with the machine. This page is the pre-registered methodology for a 30-person human cohort that scores the published corpus, plus the URL where the resulting κ statistics will land.

Status: recruitment in progress

The first 30-rater cohort is being recruited. Estimated scoring window: a 30-day stretch following the cohort close. When the κ statistics land, this page automatically renders them — the data flows from the publiccohort_runstable when status flips to ‘published’ (B1962 scaffold). No code change required at publish time; this page reads the already-public artifact.

Apply to join the cohort

Looking for ~30 raters with mixed backgrounds: working songwriters, academic music-theory faculty, critics + A&R + publishing professionals. Honorarium $200/rater. Estimated commitment: ~3.5 hours (21 entries at ~10 minutes each). Methodology pre-registered above; rater tooling is built and ready.

Email a 1-paragraph application

Response within 7 days. The mailto link pre-fills a short application template; replace the bracketed fields with your details. No CV required; the background paragraph is enough.

Pre-registered methodology

Methodology is pinned BEFORE the cohort completes — the same logic as a pre-registered study. Anyone can verify these claims against the published code + migration + κ-computation library at the time the results land.

Cohort size: 30 raters. Mix targeted at ~10 working songwriters (Nashville / LA / indie), ~10 academic music-theory faculty, ~10 critics + A&R + publishing professionals. Honorarium $200/rater.
Corpus: the published 21-entry calibration set at /scoring-corpus-v1.json. Same hand-scored exemplars the LLM evaluator was calibrated against. Operating principle: humans score the same lyrics the machine scored.
Rubric: pinned to v1.2.0 (or the version active at cohort opening). κ stats are anchored to a specific rubric version; a future MINOR bump opens a new cohort run.
Statistic: ICC(2,1) per Shrout & Fleiss (1979). Two-way random effects, absolute agreement, single measurements. Right pick when raters are a representative sample drawn from a larger population (working songwriters / academics / critics) rather than the only graders we will ever have. Published as a single composite ICC plus per-metric ICCs (some metrics are easier to grade than others; the per-metric breakdown is itself useful published data).
Cicchetti (1994) banding: <0.40 poor, 0.40-0.59 fair, 0.60-0.74 good, 0.75-1.00 excellent. Our target on composite scores: 0.60+. Per-metric agreement will likely be lower for some metrics (Specificity easy; Memorability harder) and that is itself information worth publishing.
Immutability: Once a rater submits a score on a corpus entry, that submission cannot be revised. Database constraint enforces this (B1962 UNIQUE constraint on (cohort_run, rater, corpus_entry)). Preserves the inter-rater claim.
Publication: under cohort_runs.status = ‘published’. When status flips, the row becomes publicly readable. This page renders the result automatically.

What the result will mean

ICC ≥ 0.75 (excellent): the rubric is applicable across human graders to the standard "agreement comparable to inter-rater agreement on published instruments in clinical psychology." Strong credibility result.
ICC 0.60-0.74 (good): applicable across graders, with some legitimate per-metric variance. This is the realistic target. Per-metric ICCs will highlight where the rubric needs further specification for next-version refactors.
ICC 0.40-0.59 (fair): modest agreement. Honest publication of this result triggers a methodology revision (clearer metric definitions, scoring rubric examples) before the next cohort.
ICC <0.40 (poor): the rubric in its current form does not produce reliable inter-rater agreement. Triggers a major-version revision. We publish the result anyway — the standards credibility comes from honest reporting, not selective publication.

Timeline

Recruitment: approximately 8 weeks. Mix of personal outreach + cold-email + warm intros through Berklee, ASCAP, and music journalism networks.
Onboarding: 1 week. Each rater receives a unique scoring URL, a copy of the rubric, and a calibration entry walkthrough.
Scoring window: 30 days. Raters score ~21 entries each at ~10 minutes per entry — ~3.5 hours total commitment.
Analysis + publication: 1-2 weeks. ICC computation runs against the database; result is published to cohort_runs; this page renders the result automatically.
Honorarium settlement: 1 week post-publish. Raters who completed ≥80% of entries receive their $200 honorarium.

For HN readers + skeptical reviewers

Yes — until this page renders results, the rubric’s inter-rater agreement is theoretical. We publish the methodology + recruitment status + κ-computation source code before the cohort completes, the same way a research lab pre-registers a study, so when the result lands the cohort can’t be reverse-engineered to support a particular number. The κ-computation lives at src/lib/cohort-kappa.ts (open source, 20 tests passing). The data tables are at supabase/migrations/add_cohort_scoring.sql. The scoring run is gated by an immutability constraint that prevents post-hoc revision.

If you would like to be considered for the cohort, email support@songforgeai.com with a 1-paragraph background summary. Response within 7 days.

30 humans will score the corpus against the published rubric.

Apply to join the cohort

Pre-registered methodology

What the result will mean

Timeline

For HN readers + skeptical reviewers

The standard

Whitepaper

Reproducibility seal

Changelog

Version diff

Model card