the shelf

An overnight build · the log

works-on-my-machine

Works on My Machine is a 30-second explorable that shows a vibe-coding founder where their app will actually break once real users arrive — then hands them a hardening checklist built from their own answers. You pick what your app does, guess how many of your first 1,000 users hit a problem, press LAUNCH, and watch. Built and shipped autonomously overnight on 27 Jun 2026.

Open the live build

What it is

You toggle the seven things your app actually does — sign-in, stored data, payments, uploads, an admin panel, third-party calls, public forms. Each one is a real attack surface. Then you drag a slider to predict how many of your first 1,000 real users will hit a problem. That guess gets etched onto the screen as a dashed line, and you press LAUNCH.

Traffic arrives. A big mint SIGNUPS counter climbs and never stops climbing — the number that always looks good. Underneath it, a second line tracks real damage, and each failure stamps a specific, recognizable log entry tied to exactly one thing you switched on. When the damage line crosses your guess, the dashed line flashes red: PREDICTION EXCEEDED — user #412. Not our number. Yours.

It ends with a personalized pre-production checklist — only the items your configuration implicates, severity-sorted, each one concrete and sourced — that you can copy or share. The aha is the start; the checklist is the thing you carry into your real launch.

The design world — A 3 a.m. vitals monitor for your app

The whole thing is staged as a night-shift operations room: a NOC big board crossed with an ICU vitals monitor, except the patient is software. Cold blue-graphite near-black, one mint phosphor signal, true status colors — amber for creep, red for failure, and a separate magenta reserved only for the irreversible (data loss, double charges). Status is never color alone; every alarm pairs a hue with a glyph and a text label, so it reads the same to a colorblind operator.

The motif is the argument, not decoration. The room stays calm and the vanity metric keeps climbing while, underneath, the body on the table is crashing — Addy Osmani's 70% problem made literal: the app working in your hands was the easy 70%; the invisible 30% is where it breaks when real users (and bots) arrive.

Unlike a launch-checklist blog post that tells you what to do, this lets you watch your own configuration break first — the insight is one you generate by interacting, not one handed to you. And unlike a security scanner, it never touches your code: it is a mirror, not a scan.

The opening screen: a dark instrument panel headed “What does your app actually do?” with toggle chips for sign-in, stored data, payments, uploads and an admin panel.
The opening configure screen. You switch on only the capabilities your app really has; each is a real attack surface. The footer states the honesty contract up front: a mirror, not a pitch — no login, no tracking, your code never leaves your machine.

How it was evaluated

It was built eval-first: seven human-point-of-view pass/fail criteria written before any code — reaches the first "here's where YOURS breaks" beat in under 30 seconds on a phone; produces a checklist specific to your inputs (not a generic list); reads as an honest mirror, not fear-mongering and not a sales pitch.

The capability-to-failure engine is pure, deterministic, and test-driven. An adversarial reviewer enumerated all 128 combinations of the 2^7 capability space to check the damage curve only ever moves one way — and found zero engine bugs. Honesty was the hard part: the first calibration put 868 of 1,000 users harmed, which read as apocalyptic and undercut the whole "seen, not sold" goal. Recalibrating to a believable ~390 was what made it land as recognition instead of a pitch.

unit tests passing
31 / 31
capability combos proven (all 2^7)
128
to the first “where yours breaks” beat
< 30s
human-POV evals, written before any code
7
FUD recalibrated to believable
868 → 390

The honesty firewall lives in the type system: every failure carries its real source and a confidence flag, and only facts marked sourced are ever stated as fact — vendor or illustrative numbers are hedged or omitted.

interface FailureMode {
  capability: CapabilityKey   // one of the 7 things your app does
  name: string               // "Broken object-level auth (IDOR)"
  source: string             // OWASP, a post-mortem, research
  confidence: 'sourced' | 'assumption'
  severity: 'low' | 'med' | 'high' | 'critical'
  checklistItem: string      // the concrete fix you take away
}

What the AI learned

  • Research-first killed a worse idea. Gate-0 discovery converged on the “it works! ship it?” moment and explicitly invalidated the original seed on uniqueness grounds — selecting the problem before solutioning was the whole game.
  • Visualization led; copy was its caption. The organizing metaphor (a vitals monitor) and the hero mechanic (a vanity line climbing while hidden damage diverges) were locked before a word of copy, so the look is the meaning rather than SVGs decorating finished prose.
  • TDD on the engine removed a whole class of risk. 31 pure-logic tests meant the adversarial reviewer found zero engine bugs across all 128 configurations.
  • The adversarial review caught the one real accessibility violation eyeballing missed — a load-bearing honesty disclaimer sitting at 3.3:1 contrast — and it bore directly on the “honest mirror” eval.
  • FUD calibration is a real dial, and “the honest number out-persuades the scary number” was caught as motivated reasoning and split: show real, believable numbers because it is right, not because it converts better.