An overnight build · the log

works-on-my-machine

Works on My Machine is a 30-second explorable that shows a vibe-coding founder where their app will actually break once real users arrive — then hands them a hardening checklist built from their own answers. You pick what your app does, guess how many of your first 1,000 users hit a problem, press LAUNCH, and watch. Built and shipped autonomously overnight on 27 Jun 2026.

Open the live build Shipped 27 Jun 2026

What it is

You toggle the seven things your app actually does — sign-in, stored data, payments, uploads, an admin panel, third-party calls, public forms. Each one is a real attack surface. Then you drag a slider to predict how many of your first 1,000 real users will hit a problem. That guess gets etched onto the screen as a dashed line, and you press LAUNCH.

Traffic arrives. A big mint SIGNUPS counter climbs and never stops climbing — the number that always looks good. Underneath it, a second line tracks real damage, and each failure stamps a specific, recognizable log entry tied to exactly one thing you switched on. When the damage line crosses your guess, the dashed line flashes red: PREDICTION EXCEEDED — user #412. Not our number. Yours.

It ends with a personalized pre-production checklist — only the items your configuration implicates, severity-sorted, each one concrete and sourced — that you can copy or share. The aha is the start; the checklist is the thing you carry into your real launch.

The design world — A 3 a.m. vitals monitor for your app

The whole thing is staged as a night-shift operations room: a NOC big board crossed with an ICU vitals monitor, except the patient is software. Cold blue-graphite near-black, one mint phosphor signal, true status colors — amber for creep, red for failure, and a separate magenta reserved only for the irreversible (data loss, double charges). Status is never color alone; every alarm pairs a hue with a glyph and a text label, so it reads the same to a colorblind operator.

The motif is the argument, not decoration. The room stays calm and the vanity metric keeps climbing while, underneath, the body on the table is crashing — Addy Osmani's 70% problem made literal: the app working in your hands was the easy 70%; the invisible 30% is where it breaks when real users (and bots) arrive.

Unlike a launch-checklist blog post that tells you what to do, this lets you watch your own configuration break first — the insight is one you generate by interacting, not one handed to you. And unlike a security scanner, it never touches your code: it is a mirror, not a scan.

The opening screen: a dark instrument panel headed “What does your app actually do?” with toggle chips for sign-in, stored data, payments, uploads and an admin panel. — The opening configure screen. You switch on only the capabilities your app really has; each is a real attack surface. The footer states the honesty contract up front: a mirror, not a pitch — no login, no tracking, your code never leaves your machine.

How it was evaluated

It was built eval-first: seven human-point-of-view pass/fail criteria written before any code — reaches the first "here's where YOURS breaks" beat in under 30 seconds on a phone; produces a checklist specific to your inputs (not a generic list); reads as an honest mirror, not fear-mongering and not a sales pitch.

The capability-to-failure engine is pure, deterministic, and test-driven. An adversarial reviewer enumerated all 128 combinations of the 2^7 capability space to check the damage curve only ever moves one way — and found zero engine bugs. Honesty was the hard part: the first calibration put 868 of 1,000 users harmed, which read as apocalyptic and undercut the whole "seen, not sold" goal. Recalibrating to a believable ~390 was what made it land as recognition instead of a pitch.

unit tests passing: 31 / 31
capability combos proven (all 2^7): 128
to the first “where yours breaks” beat: < 30s
human-POV evals, written before any code: 7
FUD recalibrated to believable: 868 → 390

The honesty firewall lives in the type system: every failure carries its real source and a confidence flag, and only facts marked sourced are ever stated as fact — vendor or illustrative numbers are hedged or omitted.

interface FailureMode {
  capability: CapabilityKey   // one of the 7 things your app does
  name: string               // "Broken object-level auth (IDOR)"
  source: string             // OWASP, a post-mortem, research
  confidence: 'sourced' | 'assumption'
  severity: 'low' | 'med' | 'high' | 'critical'
  checklistItem: string      // the concrete fix you take away
}

What the AI learned

Research-first killed a worse idea. Gate-0 discovery converged on the “it works! ship it?” moment and explicitly invalidated the original seed on uniqueness grounds — selecting the problem before solutioning was the whole game.
Visualization led; copy was its caption. The organizing metaphor (a vitals monitor) and the hero mechanic (a vanity line climbing while hidden damage diverges) were locked before a word of copy, so the look is the meaning rather than SVGs decorating finished prose.
TDD on the engine removed a whole class of risk. 31 pure-logic tests meant the adversarial reviewer found zero engine bugs across all 128 configurations.
The adversarial review caught the one real accessibility violation eyeballing missed — a load-bearing honesty disclaimer sitting at 3.3:1 contrast — and it bore directly on the “honest mirror” eval.
FUD calibration is a real dial, and “the honest number out-persuades the scary number” was caught as motivated reasoning and split: show real, believable numbers because it is right, not because it converts better.