The receipts

Every claim, traceable to a public artifact.

Most AI tools ask you to trust the marketing copy. SongForgeAI publishes the rubric, the model card, the reference corpus, the reproducibility seal contract, the inter-rater methodology, the RFCs, the incidents, the engineering report, the operating principles, and the performance budget. Every one of them under CC BY 4.0 or source-controlled in a public git repository. This page is the index.

The rubric is published.

"We score against a 12-metric rubric." Most AI tools say something like this. We publish ours under CC BY 4.0, with version, date, and complete metric definitions.

Lyric Scoring Standard v1.2.0

The model is named.

"AI-powered." We name the exact model id, the exact temperature, and the exact prompt files. The model card lists known limitations including the bias we cannot eliminate.

Model Card

The corpus is open.

"Calibrated scoring." Calibration without exemplars is unverifiable. We publish 21 hand-scored reference entries with composite + per-tier breakdowns, F-band and S-band anchors, and full rationale per entry.

Reference Corpus v0.1.2

Every score carries a seal.

"Reproducible." Every API response includes rubricVersion + model id + temperature + buildSha + build number. A third party can pin which deploy emitted any given score.

Reproducibility Seal · /api/v1/score

Major changes ship after public RFC.

"Iterative." Most products iterate in private. Major changes to the rubric, scoring pipeline, public API, or pricing open as RFCs at /rfc with a 7-day public comment window.

Public RFC process

When something breaks, you read about it.

"Reliable." We publish a postmortem within the same week of every user-visible incident. Plainly worded. Root cause, detection, mitigation, prevention. No corporate fog.

Incidents + postmortems

Engineering activity is auditable.

"Active development." A live 14-day engineering report at /engineering is auto-generated from git history. Highlights, CI ratchets, file-touch heatmap, every commit linked to its GitHub SHA.

Engineering report (auto-generated)

How decisions get made is documented.

"Principled." 10 operating principles, each paired with the inversion we reject, plus 4 decision gates that translate the principles into operational rules.

Operating principles

Performance has a published budget.

"Fast." Most products say "fast." We commit to a public budget: total JS 5.81MB / ceiling 6.3MB, single chunk ≤ 512KB, Lighthouse floor: perf ≥ 75, a11y ≥ 90 (first measurement pending). CI fails the build on any regression. Numbers can only move down.

Performance budget

The seal contract is a published spec.

"Cryptographically signed." Every score response carries an ed25519-signed envelope binding rubric version + model + temperature + buildSha + composite. The full field schema, verification flow, public-key rotation policy, and what-it-does/doesn't-guarantee live in a published spec — not just inline source code. B2242: every key rotation also lands in a public append-only log at /seal-log so the rotation history is auditable, not just the current key.

Reproducibility Seal Specification + Rotation Log

The inter-rater methodology is pre-registered.

"Validated." LLM-driven scoring without human inter-rater statistics is theoretical. We pre-publish the methodology — 30-rater cohort, ICC(2,1) per Shrout & Fleiss (1979), Cicchetti banding, immutability constraints — BEFORE the cohort completes. Same logic as a pre-registered scientific study. When the κ statistics land, this URL renders them automatically; the methodology can't be reverse-engineered from the result.

Inter-rater agreement methodology

Every forge ships a 36-section release brief.

"AI-generated lyric." Most products stop there. Every song forged at SongForgeAI produces a Release Dossier — 36 sections of analysis covering the Executive Snapshot, Final Recommendation verdict, Release Readiness breakdown across 6 dimensions, color-coded Annotated Lyrics, Hook Lab with algorithmic Recommended Direction, Taste & Sensitivity Scan, Revision ROI projector, Competitive Placement lineage, Version Strategy (3 paths), and Song Identity Card. Ten CI gates run before any Dossier reaches the user — no contradictions ship. The Dossier IS the receipt of how the system thinks about every forge.

Release Dossier — 36 sections, 10 CI gates

Receipts prove the system; deliveries prove the output. See the before/after proof gallery at /proof.

What doesn’t work yet.

The twelve receipts above are real artifacts you can verify today. But the published-rubric story has gaps that won’t close until specific operator-side work lands. Honesty about what isn’t finished is a stronger trust signal than asserting completeness.

No human inter-rater statistics yet. The 30-rater cohort is in recruitment. Methodology is pre-registered at /scoring/standard/inter-rater; the κ statistics auto-render there when the cohort completes (~12 weeks from recruitment close). Until then, the rubric’s inter-rater agreement claim is theoretical.
No academic citations yet. We’re positioned for ISMIR / NIME submission, but no peer-reviewed paper has cited the rubric. Operator-side outreach to relevant researchers is in progress.
No public customer testimonials yet. The TestimonialsSection (now on /reviews, cut from the homepage at B1864) renders an honest empty-state card when the registry is empty: “We don’t ship fabricated testimonials.” That stays up until a real opted-in quote lands. We’re not posting testimonials we don’t have; the empty state IS the trust posture.
The eval is currently LLM-driven. Claude Sonnet 4.6 scoring lyrics that were also produced by Claude Sonnet 4.6 is the elephant. Cross-family triangulation (GPT-4o) ships as an optional path on /api/v1/score; default-on across the platform is on the roadmap. The published rubric is what makes the LLM-driven eval wrong-aboutable in the first place — a human grader or a different model can apply the same rubric and disagree, specifically.
Calibration corpus is small (21 entries). Target: 100. Operator punch list A3 covers the expansion; blocked on native-speaker review for non-English entries and rights clearance for some Latin American + reggae catalog. Spanish, French, Japanese banned-terms catalogs (RFC-0009 phase 2+) follow a similar dependency chain.
No real-time collaboration. Multiple windows / collaborators don’t see live forge progress. Engineering punch list #18; deferred until the single-user retention mechanic is proven.
Single-region deploy. Vercel serverless; Anthropic API calls cross-region per request. P99 latency is unknown outside US. Multi-region edge deployment is on the roadmap; not blocking for v1.

When any of these items lands, this page updates and the item moves to the artifact list above. The published log is the audit trail.

What this page is, and isn’t.

This is not a marketing landing page. Every link goes to a public artifact — usually a CC BY 4.0–licensed document or an API response with a verifiable seal. If a competitor matches every one of these, we lose our differentiation, and that’s the right outcome — the category becomes more legible.

Until then, this page exists so you don’t have to take our word for any of it.