Play it here: https://jwbatey.com/parallax/

Caution, it’s alpha at best.

This page is mostly AI text… in the spirit of this project. I am using this project as a way to test how far AI has advanced. I setup the structure and described the game I want, but I have scaffolding (which the AI can edit) which goes and actually generates the game.

Parallax: a point-and-click adventure built with model-generated content

I’m building Parallax, a 2D point-and-click adventure that borrows its pacing and tone from Monkey Island. The twist is practical: most of the writing and art starts as output from language models and image models, then gets assembled and tested through a repeatable pipeline.

I’m not trying to let a model “make a game on its own.” I treat models like content compilers. They take a small set of inputs and produce structured assets (JSON graphs, prompts, images) that I can inspect, validate, and rerun like any other build.

The bet is that game content can be handled like compiled output: you keep a small, reviewable source-of-truth, and you generate everything else in a way that’s repeatable and inspectable. Models are useful here because they’re fast at producing draft artifacts. The project only stays sane if those artifacts are constrained, validated, and easy to regenerate.


What I’m aiming for


Where it is right now

Parallax has two parts that cooperate.

I keep the project split into three layers:

That separation sounds mundane, but it’s the difference between “I can rerun this safely” and “I’m scared to touch anything.”

1) Content generation pipeline (Python)

One pipeline run can produce:

The loop is “edit inputs, regenerate outputs.” I try hard to avoid hand-editing downstream artifacts.

Under the hood, the pipeline behaves like a small build system:

That last point matters more than it sounds: when something gets weird, I can answer “what produced this?” without guessing.

2) A playable debug environment (static HTML/JS)

The Parallax Debug Suite is a static HTML/JavaScript toolkit that renders the game from data:

A few debug affordances have been disproportionately valuable:

It also has a deliberately retro entry point (a DOS-style boot screen). That part is pure tone-setting, and it makes me smile.


The “LLM as compiler” workflow

Generation starts from a small set of stable inputs:

Typical flow:

One choice that has paid off: dialogue generation happens in two passes. Pass one is quick drafting. Pass two reforms the draft into a stricter graph format and runs validation. It keeps the pace high while still giving me structure I can trust.

Validation is mostly boring rules that prevent expensive debugging later:

When a run fails, I’d rather get a blunt error with a file/line pointer than discover it mid-playthrough.

All generated artifacts land under story_specific_gen/. Keeping that boundary sharp has saved me from editing the wrong layer more than once.


Data-driven design and tooling choices

Everything important is data

Most of the world lives in JSON:

Both the debug suite and the runtime read these files. Story logic is not hard-coded.

Version control is part of the workflow: source inputs are reviewed like code, generated outputs are treated like build artifacts, and the game/debug suite stays thin. That keeps creative iteration compatible with normal engineering hygiene: diffs, rollbacks, reproducible runs.

Mask-first interaction debugging

Hotspots and navigation get tested through overlay masks, not only hand-tuned coordinates. One rule I learned the annoying way: mask files have to match full screen image dimensions and naming conventions, or the overlays lie to you.

I also keep the mask-to-hotspot mapping explicit. A mask is only useful if it’s unambiguous how pixels map back to IDs in hotspots.json (palette indices, RGB codes, or a lookup table—pick one, document it, stick to it). “Looks right” is not a passing test when the engine is reading it.

A web debugger with no build step

The debugger runs from a plain local server (for example, python -m http.server). Entry points:

Explicit state, persisted for testing

Debug state lives in localStorage under debugState. That makes it easy to refresh, reproduce an issue, and test gating logic across scenes and dialogue without rebuilding anything.

Generated outputs are treated as build artifacts

Renders can be skipped when outputs already exist, with explicit flags to redo work when needed. That avoids accidental rerenders and keeps costs predictable.

One unavoidable wrinkle is that models are stochastic. I handle that in three ways:

When something comes out worse, I don’t argue with it—I adjust the upstream constraint until the pipeline reliably produces something usable again.


Notes from the generation boundary

The hardest problems show up where text intent has to become a picture.

A practical lesson: image generation behaves better when I treat prompts like a contract, not a vibe. For each screen, I keep a short “must include” list (key interactables, doors/exits, landmarks), plus a “must not” list (floating objects, unreadable signage, cropped exits). I also keep camera rules stable—framing, horizon line, implied player height—so navigation doesn’t feel like teleporting between unrelated illustrations.

One recurring failure mode: when a hotspot name encodes a container/content relationship or a state, often written like Container / Item, image generation may drop the container and draw only the item. The interaction ends up wrong because the picture is missing the thing the player is supposed to click.

The fix has been simple:

That gets visuals back in sync with the intended interaction.


What I’m working on next


Cost visibility is part of the design

I track spend and time per phase (arrange/plan/render/characters). Full runs have landed in the “tens of dollars” range across calls to OpenAI models and Google models, so tooling that lets me rerun only what changed is part of the work, not a nice extra.

For my own sanity, I record the boring details too: per-stage wall time, cache hits/misses, image counts, and token/call counts where I can get them. If I’m going to iterate like a developer, I want the feedback loops to look like developer tooling.