I shipped a game solo. Here's what AI couldn't do.

Claude Code made the scope possible. It did not make the product obvious.

May 07, 2026

Steve Jobs called the computer “a bicycle of the mind”: a tool humans build to “amplify these inherent abilities that we have to spectacular magnitudes.”

But what happens when the tools get fast enough to outrun the rider?

Code has become the cheap part. The expensive parts are constraint definition, taste, validation, and figuring out when a technically impressive system makes the product worse.

the tl;dr: I built a word puzzle game called Right Words as a side project while working full-time. You trace two-word phrases in a grid and jump a gap between the two words. Every tile belongs to exactly one phrase.

The Problem Behind the Puzzle

A Right Words puzzle is a cluster of six two-word phrases crammed into a grid. The packing has to meet a series of constraints that are, frankly, deceptively convoluted:

every cell can belong to only one answer
consecutive tiles have to be grid-adjacent, including diagonals
the gap between word1 and word2 of a phrase must be one tile skip (cardinal or diagonal) giving a max of 8 possible tiles to jump to
no alternative tile arrangement can spell the same phrase (requires exhaustive path enumeration)
if two tiles share a letter, swapping them can’t produce a valid alternative traversal
no repeated words across answers, no word1 that doubles as another answer’s word2
every phrase has to be a real two-word phrase people actually recognize, within a tight character limit
the difficulty system optionally requires one answer to have a readable “gimme” word while preventing all others from having straight-line words

The packer uses MRV heuristic for phrase selection, seeded RNG for reproducibility, and restart limits to escape dead ends. False positive detection runs DFS against the full grid for every placed answer.

Friends keep asking when it’ll go daily. Six valid, non-overlapping, theme-coherent two-word phrases that pack into a grid with zero false positives and the above constraints is not something I can produce on that schedule (hello, Tracy Bennett? Sam Ezersky?!). 🥲

I hadn’t built a constraint-satisfaction solver before, so I designed the rules and architecture, then used Claude Code to accelerate the implementation. I defined what the solver needed to enforce, Claude built the packer, validator, and path generator. I tested the output and caught the cases it missed, like a path ambiguity bug where two same-letter tiles separated by a different letter could be swapped without breaking adjacency.

That division (I define, Claude builds, I verify) is what made the scope possible.

note: I previously discussed this dynamic in Judgment Doesn’t Scale With Speed

The Breadth Problem Solved

Justin Searls wrote about “full-breadth developers”, people who do both technical execution and product thinking. His take is that AI rewards this combination disproportionately. I’d say it more plainly: I could design Right Words before AI. But I couldn’t implement the full stack alone in any reasonable timeframe. It would’ve sat in the backlog of ideas piled high on my list of purchased and unhosted domain names.

None of the individual subsystems were impossible. In fact, all of them are pretty doable on their own (a constraint solver, puzzle packer, fun UI/UX interactions when selecting words and creating skip chains, and all the usual mobile game goodness in share cards, etc.). The hard part’s keeping product intent coherent across all of them at once, and doing it in the margins of a full-time job. That’s the breadth problem. An AI collaborator that could move between the packer algorithm and a CSS animation timing bug in the same conversation is ultimately what made the scope feasible.

Peter Steinberger (of OpenClaw fame) has a technique for bulletproofing specs using two AI contexts, one generates, one critiques. I applied that combo to a layered doc hierarchy (SPEC.md → ARCHITECTURE.md) that the AI reads every conversation (and applies to its own memory & CLAUDE.md), each layer narrowing from broad product intent to specific code constraints. Precise specs meant precise output, which meant speed. Vague specs meant debugging the spec, not the code, and the bugs nearly always came from the vague parts.

The Human Element Remains

I even had AI try to generate the puzzle content. It returned a Greek mythology theme that included STAR FISH (theme incoherency aside, starfish is one word!). The same list had TROJAN HORSE twice and HERCULES STRENGTH (?), and a bunch of one-word “phrases.” You cannot delegate content curation.

Picking themes is harder than it sounds. The theme name itself is a difficulty lever: “Fast Food” is obvious, “Legendary Lines” requires you to realize these are all idioms from Greek mythology (after ideally spending no more than a couple of minutes looking for a “Here’s Johnny!”). Monday themes run more cryptic than Friday themes, by design. I choose each one knowing that a confused player bounces and a condescended player stops coming back.

Word selection is worse. Every phrase has to be immediately recognizable, but not so common that it shows up in multiple themes. The two words have to be long enough to create interesting paths (minimum three characters each, for now) but short enough to fit the grid. Some perfect phrases just don’t pack: too many common letters create false positives, and consecutively repeated letters leave the packer with no room to route other answers around them. I’ve thrown away dozens of great phrases because the algorithm couldn’t use them without creating an alternate valid path.

I hand-curate every word list. I hand-place the paths I care about. For example, I placed GORDIAN KNOT tile-by-tile in the visual designer, locked it, and told the packer to arrange five other phrases around my shape, a puzzlingly unnecessary use of my time to give one phrase of one puzzle a specific aesthetic. This required building shape-shifting, pre-placed chain support, and updating the CLI to support completion of a --partial pack. Claude built the machinery in an hour. I spent three days deciding what the puzzle should look like. It can build anything you want! It just won’t want it for you (and in an era of creative slop being dispensed like warm soft serve, that’s a good thing).

The Difficulty Trap

I spent entirely too much time building a hard mode that rejected any packing where a word read in a straight line. Maximum difficulty.

Then, a friend played it and said, “95% of the addressable market does these puzzles to feel clever. You get better uptake by adding confetti to winning than figuring out how to obscure things.”

I’d been building for my five most hardcore friends. Analytics revealed that the players who actually left were leaving because they had no foothold.

So I (sadly) scrapped it. Fridays get a word that reads left-to-right. Mondays get one readable vertically on an edge. The “gimme” system exists because someone with fresh eyes identified the difficulty I had grown accustomed to (and thus kept trying to crank higher). I cannot stress the importance of early user testing enough!

Fast Feels Free

The tutorial has a trampoline animation on the gap tile during a jump. Worked fine in the normal game. In the tutorial, it replayed on every subsequent tap.

I prioritized AI “correction” speed over diagnosis and went with the first fix: suppress the pulse on the gap tile. This killed the visual hint showing players which tile to tap next.

Second attempt: stabilize the key so pulse changes wouldn’t remount the tile. Pulse animations went out of sync.

Third: timed effect, apply the trampoline for 500ms then clear it. Rapid tapping re-triggered it.

Fourth: ref-based tracking. Didn’t fire at all on one answer due to setState timing.

Fifth: duplicate the Grid component for the tutorial, decouple the trampoline from React’s key system entirely.

Five worked. But I should have gotten there on attempt one. The bug only appeared in the tutorial because the normal game’s validMoves empties after a jump, a !state.jump gate. The gap tile is never pulsed in normal gameplay. Asking “why only here?” instead of “how do I fix it?” would have saved four attempts.

Claude gave me five implementations in the time it would’ve taken to try two. But I was moving fast because fast felt free, and I skipped the part where you explain the bug to yourself before trying to fix it.

“95% of the addressable market does these puzzles to feel clever. You get better uptake by adding confetti to winning than figuring out how to obscure things.”

What Stayed Difficult

The hard parts were deciding what difficulty means for casual players. Or a puzzle feeling “less fun” until I traced a new version that “felt” right. Or deciding to duplicate a 250-line component rather than risk a small change to the Grid in production, now that there’s a small but loyal group of people who play every week.

That is the part I think people understate about AI-assisted development. It does not remove engineering judgment. It changes where judgment is spent.

Jobs said we’re tool builders, and the best tools amplify our inherent abilities to spectacular magnitudes. The bicycle for the mind assumed the rider knew the route. AI is not a bicycle. It’s an orchestra. You still need a conductor, a score worth playing, and the ear to know when it sounds like shit.

my side projects waiting for me to get to them, 2024 BC (Before Claude)

Without Claude Code, Right Words would still be a purchased domain name on my list of ideas I’ll get to someday. 💀 With it, the game shipped, people play it every week, and I spend my time on what the next puzzle should feel like instead of debugging a path validator at 2am. I mean… I still debug path validators at 2am. But now it’s a choice.

load-bearing printf

Discussion about this post

Ready for more?