How we build

The engineering behind SongForgeAI.

The product is only as good as the process that builds it. These are the seven practices that keep quality from drifting.

Currently on Build 4416. Every claim below maps to a commit, a file, or a test in the repo.

Ratchet gates

Every CI run compares against a committed floor. Passing-test count, strict-typecheck allowlist, forbidden-copy scanner. The number only moves one direction.

Test-count floor lives in .github/test-count-floor.txt. Every build that drops below fails.
Strict-tsc allowlist in .github/strict-tsc-allowlist.txt. Files NOT on the list must stay clean of unused locals + params. New code cannot land on the allowlist.
Forbidden-copy scanner catches regression of retired marketing terms (inflated adjectives we’ve banned, space-variant of the brand name).
Route-export validator catches non-route exports in app/**/route.ts before Next.js rejects them at build.

EXTRACT-1 passes

21 disciplined extractions from the forge page monolith over 18 months. Every pass leaves behind a pure, testable module and a net-smaller root file.

Forge page: 2397 → 2096 lines. Examples page: 1825 → 926 lines. Dashboard SongDetail: 1641 → 1400 lines.
Pattern: identify a self-contained choreography (gauntlet loop, SuperBoost iterator, batch stream-retry), lift it into a pure async function with injected deps, add ≥5 tests, wire the parent to consume a tagged outcome.
Extracted orchestrators: runGauntletStream, runBatchSuperBoost, runFixWoundsPipeline, runColdReaderPass, resolveBatchStreamResult. Each returns a discriminated union the caller pattern-matches.
Diminishing returns acknowledged: the next lever on `handleForge` is a state-machine rewrite, not another slice.

Golden evals

A regression harness that scores a fixed set of lyric prompts against the live rubric. Drift above threshold fails the build.

Pinned golden prompt set that exercises every genre + every craft axis.
Composite score drift over ± a tolerance band trips a regression alert.
Combined with per-metric telemetry to distinguish "real quality shift" from "noise in one metric".
Paired with the prompt-tuner (Build 958) which consumes forge_metrics + regression alerts to propose A/B experiments.

Public telemetry

Every forge writes a row to forge_metrics. We aggregate quarterly + publish the numbers. Nobody else in the category shows their distribution.

forge_metrics captures composite score, per-metric breakdown, prosody warnings, detail ratio, arc shape, fire-line peak, enrichment/gauntlet/boost flags, build number.
/metrics/quarterly aggregates with a MIN_PUBLIC_ROWS=30 suppression floor so no small-cell cross-referencing leaks identity.
Telemetry write-canary (Build 1018) alerts if forge_metrics gets zero rows while songs keep completing. That canary is what would have caught the batch-mode bug that silently dropped telemetry from Build 928 → 1013.
Nothing else in the AI lyric space publishes corpus metrics. This is a deliberate trust-and-category play.

Build-number discipline

BUILD_NUMBER bumps on every deploy. Currently at 4416. The number shows in the footer. The history reads as a build journal.

Current: Build 4416. The number goes up and never sideways.
Every substantive Build has a commit message and — where relevant — a note in .github/test-count-floor.txt explaining what shifted.
Footer StatusBadge component shows the number alongside real-time /api/health liveness across Supabase + Upstash + Stripe + Anthropic.
The visible-number-in-footer is a developer tell — charming to engineers, opaque to designers. It stays. Trust that compounds with specialists is worth more than the occasional "what is that number?" from a marketing reviewer.

Safety + reliability patterns

SSE reliability HOF consolidates retry + safety-timeout + stream-open tracking. Sentry-style exception capture on every B2B API failure path.

withSSEStream (Build 881) centralizes the safeSend / criticalSend / 285s safety-timeout / streamOpen-flag pattern. Refine + eval routes share the same implementation; forge + Crucible each have specialized variants with domain hooks.
All B2B API failure paths call captureException with structured context. Sentry knows which key triggered which 5xx.
Per-user sliding-window rate limits with Upstash + in-memory fallback. 60 req/hour/key on the public /api/v1/score API.
100KB body-size caps + CSRF origin validation on every state-changing route. 362-term banned-cliché scanner runs post-generation on every lyric.

Pure-module pattern

Business logic lives in pure functions with injected dependencies. React components consume tagged outcomes and update state. Tests don't need React.

Pattern: async orchestrator returning a discriminated-union outcome. Injected deps (onLiveQuote, runEvaluation) make the module testable without a React render.
Example: batch-super-boost.ts exports runBatchSuperBoost(input, deps): Promise<BatchBoostOutcome>. 6 tests hand-mock fetch + streamLyricResult; no React tree.
Integration tests mock Supabase via hand-rolled stubs (31 tests across the /api/v1/score surface). Covers auth + scope + rate-limit + body validation + CAS-claim invariants.
Anything that touches React state owns its own state; anything that doesn't lives in a pure module.

Why publish this?

Most AI products hide their operating practices. That makes buyer due-diligence pure vibes — you decide whether to trust an AI tool based on its marketing copy, not its engineering. We reject that default.

The ratchet gates + golden evals + public telemetry + build-number discipline are the substrate the rubric sits on. Without them, “we score every song on 12 metrics” would drift into marketing claim. With them, drift is a compile error.

What’s building now The open rubric Public telemetry