The four-phase playbook for long-running engineering work

Shipping a product is the easy part — keeping it sharp over many years is what separates a studio from a software factory. This is the working playbook we use on every long-running client engagement, distilled from the ten or so projects we've shipped this year alone.

Why most engineering rewrites fail

When a codebase starts to feel slow to change, the instinct is to rewrite. This is almost always the wrong move. A rewrite reproduces every accidental decision the team has already made, only this time without the original commit messages explaining why.

The codebase you have is a record of every problem you've solved. The codebase you want is a record of every problem you haven't yet hit.

Three signals that a refactor is enough

The pain is concentrated in 3-5 files, not spread across the system.
You can describe the desired end state in a single paragraph.
The team agrees on what 'good' looks like without a 90-minute meeting.

If two of these are true, refactor. If none of them are, stop and run a discovery — you're not ready to make architectural decisions yet.

The four-phase playbook

We split every long-running engagement into four phases. Each phase has an explicit exit criterion — moving to the next phase before hitting the criterion is the single largest predictor of timeline slip in our data.

Phase	Duration	Exit criterion
Discovery	1–2 weeks	A single-page brief signed by both engineering and product.
Spike	3–5 days	A throw-away prototype that proves the risky bit.
Build	4–8 weeks	Feature-complete in staging with monitoring wired up.
Harden	1–2 weeks	Two weeks of zero P1 bugs before sign-off.

Phase one: discovery in practice

We use a five-question framework to drive every discovery call. The questions are deliberately blunt because nuance comes later — the goal is to surface the implicit assumptions before they become contractual.

Who exactly is the user? Name a real person if you can.
What does this person do today, before our software exists?
What single metric will tell us this is working?
What is the smallest thing we can ship that moves that metric?
What does failure look like, concretely?

Phase two: the spike

Spike code is throw-away by definition. We literally start it on a branch named spike/<thing> and end it with git branch -D once the question is answered. The deliverable is a one-page memo, not the code.

A typical spike file ends up looking like this:

// spike: can we hit the legacy API under 200ms p95?
import { performance } from "node:perf_hooks";

async function probe(url: string, n = 200) {
  const samples: number[] = [];
  for (let i = 0; i < n; i++) {
    const t0 = performance.now();
    await fetch(url, { cache: "no-store" });
    samples.push(performance.now() - t0);
  }
  samples.sort((a, b) => a - b);
  return {
    p50: samples[Math.floor(n * 0.5)],
    p95: samples[Math.floor(n * 0.95)],
    p99: samples[Math.floor(n * 0.99)],
  };
}

console.log(await probe("https://api.example.com/v1/things"));

The numbers go into the memo. The code goes in the bin. Resist the urge to clean it up — it has served its purpose.

Operational discipline

Once a project is live, the entire shape of the work changes. The question is no longer 'can we build this?' but 'can we keep it healthy for the next three years?' We track four signals daily.

p95 response time across the top 5 endpoints
Error rate by surface (web, mobile, internal)
Time-to-merge for non-trivial PRs
Backlog of customer-reported bugs over 7 days old

Three of these are leading indicators of health. The fourth — backlog age — is a lagging indicator of team morale, and it's the one we miss most often when things start to slip.

Tips and footnotes

CO₂ emissions per build are tracked in our staging dashboard. We aim for an O(log n) growth profile.

Want the full template? Read the studio handbook, or check our public engineering principles.

Studio Notes · Codeflee Engineering Team