For a long time, the forge page on SongForgeAI was 2,800 lines of TypeScript React in a single file.

It had drifted there over 18 months. Every feature, every audit, every midnight fix added a few more lines. The file grew the way all monoliths grow: one reasonable-seeming paste at a time. And then one day I opened it and realized I couldn’t find anything without Ctrl-F, which is the engineering equivalent of walking into your kitchen and not knowing which drawer the forks are in.

This post is about how I got it down to 1,940 lines across 25 extraction passes, and what I learned along the way that might actually be useful to somebody else trying the same thing.

First: why the file mattered

SongForgeAI has one user-facing workflow that touches every critical system: the forge. A user types a prompt, the app talks to Claude via SSE streaming, runs a 12-metric evaluation, optionally runs a refinement gauntlet, optionally runs a SuperBoost iterative improvement loop, then renders results with cover art, audio, lyric editing, and share controls.

That whole surface — input handling, SSE consumption, state machines, retries, error classification, telemetry, persistence — lived in one file. src/app/forge/page.tsx. 2,800 lines. Every async handler used 10-15 pieces of state. Every React setter was potentially firing from any of four different handlers.

The actual business logic was fine. The organization was not.

Rule #1: Don’t refactor what users can see

The first rule I set for myself was that the user experience had to not change. Not a pixel, not a rendering frame, not a loading sequence.

This sounds obvious. It’s not. Most big refactors I’ve seen in open-source repos come with a note like “while I was in there I also simplified the error messages,” and then you can’t tell whether the refactor broke things or the simplification did. Separating the work matters more than doing the work fast.

The test suite had to pass identically before and after every extraction. The footer build number had to bump every time. The Vercel deploy had to succeed on every push. If I broke those, I’d lose the ability to tell whether my 7th extraction pass introduced a regression or my 3rd pass did.

The three patterns that did the work

Twenty-five passes sounds like twenty-five different techniques. In practice, I kept reaching for three shapes.

Pattern 1: Pure validators

When a handler started with a cluster of guards — usage checks, minimum-length checks, “is the user authorized” checks — those came out first. Each became a pure function that took inputs and returned a tagged discriminated union.

type BatchPreflightResult =
  | { ok: true }
  | { ok: false; reason: 'empty-queue' }
  | { ok: false; reason: 'no-songs'; upgradeType: 'songs' }
  | { ok: false; reason: 'batch-too-large'; upgradeType: 'batch' };

The call site replaces a 20-line if/else chain with a single function call and a switch statement on preflight.reason. The validator can be unit-tested without React. Adding a new gate is a 5-line change in the validator plus a new case in the switch — not a search-and-replace across three handlers.

I shipped validators for three handlers: preflightBatchForge, preflightRefine, preflightForge. All three share a shape, which means the next cross-cutting concern — rate limiting, feature flags, whatever — has a single shape to hook into.

Pattern 2: State-clustering hooks

The forge page had 56 pieces of React state at one point. Many of them were related — nine gauntlet-mode flags, nine batch-mode flags, seven creative-palette knobs — but each was a separate useState call scattered across a 60-line state-declaration block.

I wrote six custom hooks, one per concern:

useForgeSessionState — step machine + result + evaluation + live quotes + copied flag (10 fields)
useBatchState — batch-mode knobs + cancellation ref + wake-lock ref (11 fields)
useGauntletState — gauntlet + pre-gauntlet state (9 fields)
useGenreConfig — ghost, genre splice, voltage, clean-mode (7 fields)
useCoverArt — cover art descriptor + URL + loading + error + prompt UI (6 fields + 2 handlers)
useForgeUsage — usage snapshot + mount fetch + refresh (3 fields)

Each hook returns an object the page destructures into the same local names it already used. Every downstream reference stays identical; only the declarations changed.

Paradoxically, this wave added lines, not removed them. Each destructuring block is wider than the 10 inline useState lines it replaces. That’s fine. The win isn’t line count — it’s that adding a new batch-mode flag is now a 2-line change in one hook file, and each hook exposes a reset() method that replaces what used to be eight inline setX(null) calls.

Pattern 3: Outcome dispatchers

Most forge handlers end with a switch on some tagged outcome. The gauntlet orchestrator returns { kind: 'kept' | 'reverted' | 'aborted' | 'skipped' }. The fix-wounds pipeline returns { kind: 'improved' | 'rejected' | 'no-output' | 'skipped' }. The batch-song processor returns a 7-way tagged union.

Each switch block was 25 to 60 lines of setter fanout + persistence + telemetry. Each got lifted into a pure-ish function:

applyGauntletOutcome(outcome, {
  result,
  preGauntletResult,
  preGauntletEvaluation,
  baselineScore,
  prompt,
  updateResult,
  setEvaluation,
  setGauntletReverted,
  persistSongUpdate,
  logClientForgeMetrics,
});

The dispatcher takes the outcome plus all the setters it might need. It pattern-matches on the kind, calls the appropriate setters and persistence functions, and returns void. Each dispatcher has its own test file that mocks every setter and asserts the right one fires for each outcome shape.

Three dispatchers: applyFixWoundsOutcome, dispatchBatchSongOutcome, applyGauntletOutcome. Together they pulled ~130 lines out of the monolith and into named, tested modules.

What stayed inline, and why

A few things I deliberately did NOT extract:

The guards at the top of each handler. They touch resultRef.current, fixingWounds, showUpgradeModal — all component-scope flags. Extracting them means threading three more props through the injection surface and the resulting call site is noisier, not cleaner.

Refs. activeForgeRef, activeGauntletRef, resultRef. These participate in the component’s lifecycle and the cancellation flow. Moving them means moving a piece of the component itself.

The primary render tree. It’s 143 lines of conditional JSX routing to five sub-components. Each sub-view is already in its own file; the page just orchestrates. Extracting the orchestrator would mean building a 60-prop interface — legitimate, but every rename ripples through both files. Not worth it today.

The LOC story was not a straight line

Raw numbers, committed to a journal:

Pre-extraction: ~2,800 lines.
After Passes 1-7 (earlier work): ~2,400 lines.
After Passes 10-18 (this session, wave one — extractions of outcome switches + helper modules): 1,884 lines.
After Passes 19-25 (this session, wave two — state hook clustering): 1,940 lines.

The line count went up in the second wave. I wrote a section in my extraction journal explaining why, because it matters: destructuring six fields is always more verbose than declaring six useStates. The gain is elsewhere — in the navigability of the file, in the reset() helpers that kill setter storms, in the testability each hook unlocks.

The real metric was never line count. It was: when I open this file, do I know where the forks are?

The CI gate that locks it in

After 25 passes, the last thing I wanted was for someone (including me, at 2am) to paste 50 lines back into forge/page.tsx and silently reinflate it.

So I added a new ratchet gate: .github/forge-loc-ceiling.txt. It holds a single number — the current file’s line count, plus five lines of headroom. CI runs check:forge-loc on every push. The file can shrink freely; growth above the ceiling fails the build.

When a future feature legitimately needs new logic, the CI failure is a forcing function: extract the logic into a sibling module, import it, done. The monolith can’t grow quietly any more.

What I’d do differently

Start the journal earlier. I started tracking LOC only on Pass 10. Earlier counts are reconstructed from git log. A running journal would have let me spot the “line count is going up” inflection earlier and communicate it earlier, rather than writing the rationale section after the fact.

Write tests before extraction, not after. I extracted 25 times before writing a single test against the extracted code. It all worked, but I was running on npx tsc --noEmit as my only safety net. Tests for the hooks + dispatchers landed three builds after the extraction wave finished. If one had broken something invisible, I’d have found out at runtime, not CI.

Extract the render tree earlier. I put it off because of the 60-prop interface. In retrospect, that interface is the work — writing it forces the page to make explicit every piece of state the render tree reads. That’s a good outcome, not a bad one.

The quiet reason this matters

A 2,800-line React component is fine if you’re never going to touch it again. Most codebases have at least one — that file that senior engineers quietly avoid. Everyone knows what’s in it; nobody wants to edit it.

The problem isn’t the lines. It’s the compounding tax every new feature pays. Every new handler touches all the existing state. Every new render branch inherits the prop-thread. Every new engineer spends their first month afraid of the file. Eventually the team works around the monolith instead of through it, and that’s when the product stops shipping.

The extraction work above cost me about 30 hours across 25 builds. By the end, adding a new outcome branch is a 5-line change in one dispatcher + a new case in a switch. Adding a new state cluster is a new hook file. Adding a new CI gate is two files.

None of those felt heavy. They all used to.

How I cut 800 lines from a 2,800-line React component