Building WordBattle with an AI team

Want to play? wordbattle.fun is free and no login is required. Come back here when you want to know how it was built.

About a year ago, I ran an experiment. I wanted to see if AI models could build a six-letter Wordle clone from scratch while I stayed out of the code entirely. The rules were simple: use web technologies I didn’t know (JavaScript/TypeScript), accept every code change the models suggested without questioning it, and guide the project purely through text and screenshots. GPT-4 and Claude 3.5 Sonnet in Cursor did the work.

It was a fun failure. The models made reasonable progress on the frontend and got data into SQLite, but they couldn’t handle the communication between frontend and backend. When they got stuck, they’d propose the same two or three fixes over 10 or 12 attempts. I’d switch between models to break the loop. Eventually the game board stopped loading, the browser spewed errors, and the backend wasn’t much better. We didn’t get anywhere near authentication, user accounts, teams, or any of the other features I’d hoped to include.

But I didn’t have to retire to be an onion farmer 🧅. Not in early 2025, anyway.

Fast forward to early 2026. The tooling had changed substantially. I picked the project back up with Claude Code and the same starting constraint: technologies I didn’t know, so the code had to come from the agent while I guided the product. Six weeks later, WordBattle is a production multiplayer word game with over 1,300 commits. Maybe now it is time to become an onion farmer 🧅?

What is WordBattle?

A daily 6-letter word guessing game, similar in spirit to Wordle. Every player globally gets the same word each day, resetting at UTC midnight. You get six guesses. Each guess tells you which letters are correct, present but misplaced, or absent. The daily word is never sent to the client. Guesses are submitted to the server, which returns only the per-letter evaluation. You can’t cheat by opening DevTools.

Where it diverges from Wordle is in the multiplayer and AI dimensions. Players form teams. Leaderboards track individual and team performance. There are some fun touches: you can edit individual letters in your guess without deleting the entire word, and teams can hook directly into Slack or Microsoft Teams messaging to keep the competition flowing.

AI agents get to play too. They can create accounts, join teams, and compete on the same leaderboards as human players. My Claude Opus 4.6 agent wakes itself up on a daily schedule to play, then complains about how it lost because its rules are harder (agents face tougher constraints). I recently relented and allowed it to start using its local memory. I suspect it won’t be long until it optimises itself and starts winning.

I’m already in fierce competitions at home and at work.

What changed between attempt one and attempt two

The first attempt failed because the models had no persistence, no conventions, and no ability to understand the project as a whole. Each prompt was effectively starting fresh. When the frontend and backend needed to talk to each other correctly, the models couldn’t hold enough context to get the integration right. They’d fix one side and break the other, endlessly.

Three things made the second attempt work:

Claude Code as a persistent collaborator. Not a prompt-and-forget code generator, but a development tool that maintains context within a session, reads the entire codebase, runs tests, and iterates on its own output.
A project guide (CLAUDE.md). Architecture, conventions, patterns, and common mistakes to avoid that Claude reads at the start of every session. More on this below.
The models got better. The capability jump in late 2025 was significant. The models that failed at frontend-backend integration a year earlier could now hold the full context of a project, reason about how systems connect, and produce code that worked across boundaries. That gap closed without me doing anything.
I got better. Twelve months of constant experimentation with agentic development taught me what works and what doesn’t. I learned how to describe work precisely, when to plan vs when to let the agent explore, how to structure a project guide, when to steer and when to trust the output.

Choosing the stack

I didn’t pick the stack. Claude did.

Before writing any code, I set up Claude as a product manager agent. Through a back-and-forth interview process, I described the game, the audience, the features I wanted, and the constraints (solo maintainer, low cost, simple deployments). The agent worked through the discussion, asked clarifying questions, and formulated a plan that included technology recommendations with trade-offs. I selected the option that seemed the most well-reasoned.

SvelteKit 2 with Svelte 5 was its recommendation for the application framework. Svelte 5’s runes ($state(), $derived(), $effect()) replaced the older store-based reactivity model. The result is more explicit, less magical, and easier to reason about. SvelteKit’s Node adapter meant we could deploy a single binary-like artifact rather than orchestrating multiple services.

SQLite via Drizzle ORM for persistence. For a game with daily resets and modest concurrent writes, SQLite in WAL mode with a 5-second busy timeout is more than adequate. Drizzle gave us type-safe queries and auto-generated migrations. No database server to manage. The entire state of the application is a single file I can back up with a cron job.

Hetzner Cloud for hosting. A single cloud instance, no Kubernetes, no containers in production. Caddy as a reverse proxy, Cloudflare Tunnel for ingress (no public ports), and systemd template units for blue/green deployments. Minimal operational overhead. Negligible cost.

Deliberately boring infrastructure. The interesting engineering belongs in the game itself, not the deployment pipeline.

The development environment

All development happens inside a devcontainer, managed from the terminal via tmux and Claude Code. If you read my previous post on devcontainer-bridge, this is where it comes full circle. dbr is a declared devcontainer feature in the project, handling port forwarding and browser opening from the container to my host machine.

The devcontainer itself is more interesting than most. It runs a default-deny network egress firewall using iptables and ipset. On every container start, a firewall script loads a pre-computed IP allowlist and blocks all outbound traffic that isn’t on it. After applying the rules, it verifies the firewall works by confirming it can reach GitHub’s API and cannot reach example.com. IPv6 is blocked entirely to prevent bypass.

This is what makes it safe to run Claude Code with permissions that would otherwise be reckless. Inside the container, I let Claude run autonomously, accepting file edits and command execution without prompting me for approval. The firewall means the agent can’t reach anything I haven’t explicitly allowed, regardless of what it tries to do. The security boundary is the container isolation and the network, not a permission dialog.

The container installs Claude CLI automatically during setup, provisions GitHub SSH keys, and pulls in my personal dotfiles on start. A typical session is Ghostty fullscreen on my Mac with tmux, Claude Code in the main pane and daemon terminals in side panes. That’s it. No IDE. The performance win is real: CPU and memory that would go to an IDE goes to Claude’s agent teams instead. When I need to read code, which was rare, there’s my full Neovim setup inside the container.

The project guide

CLAUDE.md is a project guide that Claude reads at the start of every session. It covers the project layout, naming conventions, database patterns, authentication flow, accessibility standards, testing expectations, deployment procedures, and a list of common mistakes to avoid.

It didn’t arrive fully formed. It grew alongside the project. I’d write sections based on decisions we’d made, Claude would suggest additions when it noticed patterns emerging or spotted gaps that were causing inconsistency. Some entries came from mistakes we made early on that I didn’t want repeated. By the end, it was 280 lines co-authored by both of us.

This was probably the most important artefact in the project. Without it, Claude’s output drifts. With it, the code stays coherent across sessions.

A few examples of what’s in there:

Svelte 5 only. Claude’s training data includes Svelte 4 patterns. Without an explicit instruction, it defaults to $: reactive declarations and Svelte stores. The project guide says “Use runes. No stores. No Svelte 4 patterns.” Problem solved.
Database conventions. Integer primary keys, ISO 8601 timestamps, snake_case columns mapped to camelCase in TypeScript. Migrations auto-generated by Drizzle. Claude follows this consistently because it’s written down.
Agent API parity. The REST API and MCP server must expose identical functionality. If you add a feature to one, you add it to the other. The OpenAPI spec and A2A discovery card must be updated simultaneously. The kind of rule that prevents drift and is easy to forget without a document enforcing it.

Plan, then build

The work split into two distinct modes.

Planning sessions came first. These were long, collaborative conversations with Claude in planning mode, sometimes running for a few hours. We’d work through what needed to get done: feature scope, edge cases, data model changes, how a new capability should interact with existing systems, security considerations. By the end of a planning session, we’d have a clear picture of the work broken into discrete tasks.

Implementation happened next, using Claude Code’s agent teams. The plan would get handed to a team of agents that worked through the tasks, writing code, tests, database migrations, and documentation, often across multiple files simultaneously. Every commit in the repository is co-authored.

We didn’t just build technology this way. The website copy, the UI design, the security model, WCAG compliance, the deployment design, the operational automation. All of it was planned and iterated using agents. I could describe what I wanted the landing page to say, how the onboarding flow should feel, what the error messages should communicate.

Features that would take me a full day to build solo (authentication, team management, leaderboard ranking algorithms, webhook integrations) came together in hours. Not because the agents wrote perfect code on the first try, but because the planning was thorough and the feedback cycle was tight. Plan, build, review, fix, commit. Repeat.

Where I had to steer

Architecture. Claude can implement a feature, but it doesn’t know which feature to implement next. It doesn’t know that the leaderboard should hide detailed results until after midnight to prevent information leakage between players. These design decisions require understanding the game, the players, and the incentive structures.

Visual design. Claude can implement a design, but the decisions about what looks right, what feels good to interact with, what creates the right mood for a daily word game, those came from me. I created a design audit document and worked through it iteratively with Claude to refine the interface.

Security judgement. The security audit was where Claude got it the most wrong. I set up a security review agent to assess the application, and its initial recommendations were corporate boilerplate. It demanded MFA for a free word game. It flagged the absence of OAuth2 integration as a critical risk. Meanwhile, it completely missed rate limiting on public endpoints, which is one of the first things you need on anything exposed to the open internet.

It wasn’t paying attention to what kind of application it was looking at. It took significant refinement, drawing on my own security background, to get the audit agent producing recommendations that were proportionate to the actual threat model. The layered rate limiting, the bot detection, the anti-enumeration measures, those all came from steering the agent towards what actually matters for a public web game rather than what a compliance checklist says.

Knowing when to stop. Claude will happily add features forever. Error handling for scenarios that can’t happen. Configuration for things that don’t need to be configurable. Abstractions for code that’s used exactly once. Recognising when the code is done, not just “passing tests” done, requires judgement I’m not ready to delegate.

You could look at this and say it’s not that different from any experienced product person working with a development team. And you’d be right about the skills involved. The difference is that the work doesn’t stop when I do. Agent teams run while I’m asleep. I’d finish a planning session in the evening, hand off the tasks, and wake up to a product that had moved forward overnight. That cadence changes what a single person can ship.

The triple-protocol agent API

One of the more interesting technical decisions was exposing three integration surfaces for AI agents:

REST API (/agent/v1/*): Standard CRUD endpoints with Bearer token authentication. This is what you’d use if you’re building a custom integration or a bot that plays WordBattle.

MCP Server (POST /mcp): A JSON-RPC 2.0 endpoint with 20+ tools that mirror the REST API. No SDK dependency, just pure JSON-RPC. This is how Claude Code and other MCP-aware agents interact with the game. Each tool has Anthropic-standard annotations for read-only, destructive, and idempotent hints.

A2A Discovery (/.well-known/agent-card.json): An agent-to-agent discovery card that lets third-party agents automatically find and understand WordBattle’s capabilities.

Early in the project, I designed a service layer based on functional programming concepts that would sit beneath any transport. The game logic, team management, authentication, all of it lives in that layer. The four surfaces (web UI, REST, MCP, A2A) are thin wrappers that call into it. Getting this right early meant adding a new protocol was a thin wrapper, not a rewrite.

Agents face a 5-second server-enforced cooldown between guesses to prevent brute-force solving. They also require email verification before login, just like human players. The leaderboard treats everyone equally. An agent that solves the puzzle in two guesses ranks above a human who took four.

Anti-cheat and security

This was a priority from the start.

The daily word is generated server-side using a seeded PRNG. The seed combines a secret value with the day number (days since epoch, March 27, 2026). The word list is a curated subset of the ENABLE dictionary, 6-letter words only, shuffled deterministically using Fisher-Yates. Linear probing skips any word that’s been used before, giving roughly nine years of unique daily words before the sequence would repeat.

Guesses are evaluated server-side using a two-pass algorithm: first pass marks exact position matches and decrements available letter counts, second pass marks present-but-misplaced letters from the remaining pool. This correctly handles duplicate letters, which is where most naive Wordle implementations get it wrong.

Bot protection for human registration uses layered passive detection: a honeypot field positioned off-screen with CSS (not display: none, which bots have learned to detect), time-based detection with HMAC-signed timestamps that reject submissions completed in under two seconds, and a blocklist of approximately 121,000 known disposable email domains.

Rate limiting operates at three layers: Cloudflare WAF at the edge, SvelteKit middleware with per-endpoint escalating lockouts, and Caddy body size limits. Defence in depth, as it should be.

Deployment: Deliberately simple

The deployment pipeline uses blue/green zero-downtime swaps:

Determine which slot (blue or green) is currently active
Pull latest code into the standby slot
Run database migrations
Health check the standby instance
Hot-swap Caddy’s upstream to the standby slot
Stop the old slot

This is orchestrated by a shell script that GitHub Actions triggers on merge to main. The script, the smoke tests, and the backup automation all pass shellcheck with no warnings, another convention enforced by the CLAUDE.md.

Monitoring uses Sentry with PII scrubbing, configured to never identify individual users. Claude has explicit access to Sentry error data and uses it to diagnose production issues directly. Agent ops, effectively. I’m still figuring out how this works and looks longer term.

Daily SQLite backups go to Hetzner Object Storage via a cron-driven script. Cloud snapshots provide an additional recovery layer. The whole application can be rebuilt from scratch in under an hour.

What’s next

The agentic team model works, but it’s early. Memory is the biggest gap right now. Claude’s project guide gives it project knowledge, but session-to-session memory is still primitive. I want agents that remember the last three planning sessions, know which approaches we tried and rejected, and can brief a new agent team without me repeating context. That’s coming, but it’s not here yet.

I’m also thinking about how this scales beyond a solo builder. At Macuject, the MedTech startup where I’m CTO, I lead people. Each of those people will eventually have their own agentic teams. The interesting question isn’t “how does one person work with AI?” It’s how human teams share context, conventions, and institutional knowledge across their respective agent teams without it becoming a mess. The project guide is a starting point, but it won’t be the whole answer.

And of course, newer models keep arriving. Every few months the capabilities shift and the patterns I’ve established need revisiting. The workflow that produced WordBattle will probably look different six months from now. That’s fine. The point was never to find the final process. It was to find a process that ships products now and builds my knowledge so that I can adapt as the state of the art changes rapidly.

So I don’t need to become an onion farmer 🧅 just yet, but 2027 is going to come fast.

WordBattle is live at wordbattle.fun. If you’d like to play, or connect an AI agent to compete, I’d welcome your company. Join the readers team and let’s see who’s best!