MMO as production: Grafana, PostgreSQL and WebSocket for a game bot
Not every game lets you plug in your own monitoring. In Screeps the bot is plain JavaScript and the game world exposes WebSocket and REST. I built an external collector and watch the energy economy and global market like an exchange.
Screeps is the only MMO I know where the player's AI is plain JavaScript that runs 24/7 on the developer's servers. The bot lives in a shard, ticks at 1Hz, consumes real CPU from a fixed budget, and trades on a shared global market with other players.
Because of that, Screeps turns into a small production system. And like any production system, it needs monitoring. Not "open the client and look." But dashboards, time series, event annotations, alerts.

In most games this is impossible: internal state is closed, no API, at best a battle log. In Screeps there's a WebSocket to console.log and CPU per-tick, REST API to room state and Memory segments, and a global market with trade history. I plugged all of that into my own Grafana and now watch the bot the way I watch a prod service.
Why a game needs production-grade observability
A Screeps bot is a distributed agent in an unmanaged environment:
- CPU budget — 20 CPU/tick at GCL 6. Exceed it → bucket drops → creeps stall. You need avg/p99/trends per role and method.
- Energy economy — internal resource. A source regenerates 10 e/tick, an upgrader consumes 22 e/tick. If outflow steadily beats income — storage drains in hours.
- Global market — a real exchange: bid/ask, depressed orders, volume, volume filter. Buying 10K units of a mineral = a real ~2M credits. Without price monitoring I burned credits into the void.
- Combat and attacks on colonies — events you want to see as dots on a graph, not dig out of 24 hours of logs.
The Screeps client shows "here and now." What was on the Z-mineral exchange two hours ago — unavailable. You collect history yourself.
Architecture: WebSocket + REST → PostgreSQL → Grafana
┌──────────────────────────────┐
[Screeps WS] ──console+cpu──┐ │ │
├──→ [screeps-logger] ──→ [PostgreSQL] │
[Screeps REST] ──segments────┘ │ │
▼ │
[Grafana] │
▲ │
│ │
[Claude Code] │
MCP grafana │
│ │
└─────────┘
Three data sources:
- WebSocket
wss://screeps.com/socket/websocket— subscriptions touser:{id}/consoleanduser:{id}/cpu. Real-time stream: everyconsole.logfrom the bot, runtime errors, CPU and memory every tick. - REST
GET /api/user/memory-segment— once a minute I poll 10 memory segments. The bot writes structured snapshots there: economy (roles, storage, credits), market (deals, fills), labs (reactions), spawns/deaths/attacks. - REST
GET /api/user/code— rarer, when I need to know the deployed code version.
The external collector is a FastAPI app in Docker. It dumps everything into PostgreSQL, which Grafana reads. No Kafka, no Loki — a regular relational DB, because the data is structured and the volume is small (4-6 GB over 30 days at 30-day retention).
Why not a log file
The temptation to "write to a file and parse it" is strong, but doesn't work for two reasons:
- The bot doesn't write to disk. Screeps has no filesystem — only Memory (about 2 MB per user) and segments (10 × 100 KB). Anything you want to persist lives in that constrained budget and is wiped on reset.
- The WS channel is fragile. A connection lives ~5 ticks (15 sec), then drops. A
console.log()larger than 5 KB kills the WS. So large snapshots go via segments (REST poll), notconsole.log(WebSocket). Short events — opposite, WebSocket, because low latency is critical.
PostgreSQL gives you SQL and jsonb — that's the format of "the bot wrote what it could, we'll figure it out later."
Known traps
WebSocket on Screeps lives by its own rules. No standard ping/pong from the server — you have to call recv() with a 15-second timeout and reconnect. Large console messages drop the connection — so the full Room.getEventLog() doesn't go to console, only to segments.
Another nuance: the host has DPI throttling against npm registry, and Docker builds were dying with exit 146 on RUN npm ci. Not CPU and not OOM — network discrimination. A proxy through privoxy 8118 (in the local infra) cures it. That story made it into this session's CLAUDE.md so the next session doesn't open it from scratch.
Energy economy: the dashboard as a P&L report
Energy in Screeps works like oil in a real economy:
- Regenerates at sources at a fixed rate (cap 10 e/tick per source).
- Spent on spawning creeps, controller upgrade, towers, wall repair, lab reactions.
- Transported via links (with a 3% tax) and haulers.
- Convertible into
batteryvia factory and sellable.
If you don't see the energy flow split by where + on what, any economic dip = a detective story for nothing. "Storage dropped, no idea why" — a symptom of missing P&L.
In my dashboard each room has an "Energy Balance" panel — two stacked columns side by side per hour: income on the left, spend on the right.
Income (4 categories):
sources— actual harvest from miners, incremented in code in a wrapper aroundcreep.harvest()sk_deliver— deliveries from skHaulers in Source Keeper rooms (bonus sources)market_buy— energy purchases on the market (when things are bad)terminal_recv— transfers from my other rooms
Spend (11 categories):
spawn— cost of spawning creeps (sum of body costs)tower— tower attacks and repairs (10e per action)upgrade—upgradeController(1e × WORK parts/tick)build,repair,fortify— construction, repair, wall fortification (precise counters fromtask.build.jsand friends)renewal— creep renewal (exact formulaceil(cost / 2.5 / parts))link_tax— 3% tax on each link transferterminal_send+term_fee— energy on outgoing sends and feespower—PowerSpawn.processPower(50e per op)
These counters live in Memory.rooms[name]._eo = {b, r, f, u, ...}. Every 100 ticks a snapshot goes into segment 0 → into PostgreSQL → into Grafana as a stacked bar per hour.
The instrumentation overhead is under 0.01 CPU per tick. The price of visibility is pennies.

What the dashboard showed the first night
I thought the main spend item in a developed room was the upgrader. The dashboard showed otherwise: fortify (wall reinforcement) at RCL 7 is 4× more expensive than upgrade. Towers eat more than they seemed to, because I wasn't accounting for wall repair. And link_tax steadily ate 3% of all energy.
You couldn't see this from the code. You can only see it in a stacked bar over time.
Global market: a trading terminal
Screeps has a full-blown exchange. Every player can post a BUY or SELL order on any resource with a price and volume. Trades happen between players, the market charges a fee based on the distance between terminals. Trade history exists, order-book depth exists.
And here's where it gets interesting: prices aren't static. They depend on who dumps big volume, who undercuts, which T3 compounds are needed for boosts right now. On a chart of fair price for mineral H over a week — it's a real exchange pattern: rallies, corrections, occasional pumps from a single player with a big stockpile.
Anti-speculation: fair price instead of median
First naive implementation: bot sees the best SELL order at 50 cr/u, buys. An hour later it turns out that order was for 10 units from a speculator and the real price is 200. I overpaid.
The fix came from real exchanges: cumulative volume. Price is computed not from "best order" but like this:
fairSell = price at which cumulative order volume ≥ FAIR_PRICE_VOLUME (500 units)
We walk down from the best order, sum amount, and take the price of the order whose total crosses 500. Tiny solo orders of 10-20 units don't move fair price — they get filtered out.
In Grafana there are three panels:
- Fair SELL of base minerals (H/O/Z/K/U/L/X) — raw material price trends
- Fair BUY of base minerals — at what level the market is willing to bid
- Fair SELL of T2/T3 compounds (LH2O, ZH2O, XLH2O...) — for evaluating reaction-chain profitability
And two — about real credit flows:
- Mineral purchases on the market/hour (units) — stacked bars, what's being bought and how much
- Spend on purchases/hour (cr) — the cash equivalent
This isn't a game anymore. It's a small algo-trading system with real P&L.
Ledger: profit accounting per room
Even closer to an exchange — an internal ledger. Every buy/sell goes into Memory.market.ledger[room][resource]:
{
spent: 1_240_000, // credits spent on buying this resource into this room
bought: 8400, // units bought
revenue: 3_650_000, // revenue from selling this resource out of this room
sold: 14_200 // units sold
}
Prod3.ledger() in the console — a "income/spend by period, room and resource" report. In Grafana there's a "Sales revenue" panel (credits/hour), where every market.order_fill event (partial/full fill of my orders) is logged.
When I say in chat "the bot pulled 1.4M cr/hour purely from T2 LHO2" — that's not a guess, that's a number from the dashboard.
Not just energy: tasks, towers, threats
Beyond the energy economy and the market, all the bot's "physical-world events" land on the dashboard. Every creep gets a task from a priority queue (fill_extensions, unload_link, surplus_to_terminal, repair, ...) — types and frequencies are written to a taskLog segment. Every tower trigger (attack/heal/repair) — to attacks. That gives a second observation plane: not "how much energy" but what is the bot actually doing.

When the bot acts strange, the first thing I do is look at these panels. A repair spike with no combat? Somebody is throwing themselves at a wall. A sharp drop in unload_link? Links are full, haulers are the bottleneck. fill_spawn at 0 with energy in storage? Haulers are stuck on another task.
Charts replace intuition. And that's what turns a gaming session into engineering work.
Claude Code as a trader on this dashboard
The final piece of architecture — Grafana's MCP integration with Claude Code on my laptop. Through mcp__grafana__* the assistant sees all 44 panels, runs SQL against PostgreSQL, sets event annotations, generates deep links.
When I ask in chat "why aren't credits growing?", the flow is:
mcp__grafana__query_prometheusor direct SQL onscreeps_logs.segment_records— pulls the current ledger- Cross-checks against the rule in
docs/decision-frameworks/market-decisions.md - Answers with numbers, links to panels and an explanation
The dashboard is context. Without it the assistant would guess. With it — it sees what I see.
What's special about this architecture
The main thing — observability as a first-class concern. Not "we'll bolt logs on later" but from day one:
- Structured snapshots in
Memorysegments (json, not strings) - Clear channel split: WS — short-latency events, REST poll — periodic snapshots
- PostgreSQL as universal storage for time-series + jsonb event blobs
- Grafana as visualization, not as single source of truth (the source is Postgres + bot code)
- 30-day retention — kills "yesterday's bug is uncatchable" as a class of problems
- MCP integration into the LLM — the assistant works with the same data I do
This is a pattern I carry over from B2B telecom and SaaS to any long-running process. A bot for an online game is, in this sense, no different from OLT monitoring or an NMS service.
Not every game allows this
Most online games are a closed black box for a bot. You can scrape the screen, parse client logs, plug into an API if the developer explicitly published one — but it's always a workaround.
Screeps was designed for programmers. WebSocket, REST, a JS runtime for AI, JSON Memory segments — that's part of the product, not a leak. So you can watch the bot with the same tools you use on a prod service, and optimize the same economy you'd optimize in real trading.
If you're ever building a game for developers — look at how Screeps did it. Not GraphQL, not gRPC, no proprietary protocol. Plain old WS + REST + JSON. And full freedom for whoever wants to observe.
In the next post — about the CPU crisis: how I hit the 20 CPU/tick ceiling with 43 creeps in 6 rooms, the decomposition I did (intents 8.25 hard ceiling vs 4.8 untracked = 90% of the optimization potential), and how a module-level cache in an ephemeral runtime took bucket from 2-22 to 13-52.
Comments are powered by Giscus + GitHub. Clicking transfers data to GitHub Inc. (USA). No click — no transfer.