← § BLOG

MMO as production: Grafana, PostgreSQL and WebSocket for a game bot

Not every game lets you plug in your own monitoring. In Screeps the bot is plain JavaScript and the game world exposes WebSocket and REST. I built an external collector and watch the energy economy and global market like an exchange.

#screeps#observability#grafana#postgres#architecture

Screeps is the only MMO I know where the player's AI is plain JavaScript that runs 24/7 on the developer's servers. The bot lives in a shard, ticks at 1Hz, consumes real CPU from a fixed budget, and trades on a shared global market with other players.

Because of that, Screeps turns into a small production system. And like any production system, it needs monitoring. Not "open the client and look." But dashboards, time series, event annotations, alerts.

Main Screeps window: room E49S39 on shard3, with sources, extensions, spawn, creeps tagged spwn/idle/renew/pick
This is the bot's 'reality' — room E49S39, my main (RCL 8). Yellow dots are energy extensions, spawn in the center, creeps under pick/idle/renew/spwn tags executing tasks from the priority queue. The game client shows the current tick but no history: what happened to this room an hour ago — the client doesn't remember. Hence everything below.

In most games this is impossible: internal state is closed, no API, at best a battle log. In Screeps there's a WebSocket to console.log and CPU per-tick, REST API to room state and Memory segments, and a global market with trade history. I plugged all of that into my own Grafana and now watch the bot the way I watch a prod service.

Why a game needs production-grade observability

A Screeps bot is a distributed agent in an unmanaged environment:

  • CPU budget — 20 CPU/tick at GCL 6. Exceed it → bucket drops → creeps stall. You need avg/p99/trends per role and method.
  • Energy economy — internal resource. A source regenerates 10 e/tick, an upgrader consumes 22 e/tick. If outflow steadily beats income — storage drains in hours.
  • Global market — a real exchange: bid/ask, depressed orders, volume, volume filter. Buying 10K units of a mineral = a real ~2M credits. Without price monitoring I burned credits into the void.
  • Combat and attacks on colonies — events you want to see as dots on a graph, not dig out of 24 hours of logs.

The Screeps client shows "here and now." What was on the Z-mineral exchange two hours ago — unavailable. You collect history yourself.

Architecture: WebSocket + REST → PostgreSQL → Grafana

                                       ┌──────────────────────────────┐
[Screeps WS]  ──console+cpu──┐         │                              │
                              ├──→ [screeps-logger] ──→ [PostgreSQL]  │
[Screeps REST] ──segments────┘                              │         │
                                                            ▼         │
                                                       [Grafana]      │
                                                            ▲         │
                                                            │         │
                                                       [Claude Code]  │
                                                       MCP grafana    │
                                                            │         │
                                                            └─────────┘

Three data sources:

  1. WebSocket wss://screeps.com/socket/websocket — subscriptions to user:{id}/console and user:{id}/cpu. Real-time stream: every console.log from the bot, runtime errors, CPU and memory every tick.
  2. REST GET /api/user/memory-segment — once a minute I poll 10 memory segments. The bot writes structured snapshots there: economy (roles, storage, credits), market (deals, fills), labs (reactions), spawns/deaths/attacks.
  3. REST GET /api/user/code — rarer, when I need to know the deployed code version.

The external collector is a FastAPI app in Docker. It dumps everything into PostgreSQL, which Grafana reads. No Kafka, no Loki — a regular relational DB, because the data is structured and the volume is small (4-6 GB over 30 days at 30-day retention).

Why not a log file

The temptation to "write to a file and parse it" is strong, but doesn't work for two reasons:

  • The bot doesn't write to disk. Screeps has no filesystem — only Memory (about 2 MB per user) and segments (10 × 100 KB). Anything you want to persist lives in that constrained budget and is wiped on reset.
  • The WS channel is fragile. A connection lives ~5 ticks (15 sec), then drops. A console.log() larger than 5 KB kills the WS. So large snapshots go via segments (REST poll), not console.log (WebSocket). Short events — opposite, WebSocket, because low latency is critical.

PostgreSQL gives you SQL and jsonb — that's the format of "the bot wrote what it could, we'll figure it out later."

Known traps

WebSocket on Screeps lives by its own rules. No standard ping/pong from the server — you have to call recv() with a 15-second timeout and reconnect. Large console messages drop the connection — so the full Room.getEventLog() doesn't go to console, only to segments.

Another nuance: the host has DPI throttling against npm registry, and Docker builds were dying with exit 146 on RUN npm ci. Not CPU and not OOM — network discrimination. A proxy through privoxy 8118 (in the local infra) cures it. That story made it into this session's CLAUDE.md so the next session doesn't open it from scratch.

Energy economy: the dashboard as a P&L report

Energy in Screeps works like oil in a real economy:

  • Regenerates at sources at a fixed rate (cap 10 e/tick per source).
  • Spent on spawning creeps, controller upgrade, towers, wall repair, lab reactions.
  • Transported via links (with a 3% tax) and haulers.
  • Convertible into battery via factory and sellable.

If you don't see the energy flow split by where + on what, any economic dip = a detective story for nothing. "Storage dropped, no idea why" — a symptom of missing P&L.

In my dashboard each room has an "Energy Balance" panel — two stacked columns side by side per hour: income on the left, spend on the right.

Income (4 categories):

  • sources — actual harvest from miners, incremented in code in a wrapper around creep.harvest()
  • sk_deliver — deliveries from skHaulers in Source Keeper rooms (bonus sources)
  • market_buy — energy purchases on the market (when things are bad)
  • terminal_recv — transfers from my other rooms

Spend (11 categories):

  • spawn — cost of spawning creeps (sum of body costs)
  • tower — tower attacks and repairs (10e per action)
  • upgradeupgradeController (1e × WORK parts/tick)
  • build, repair, fortify — construction, repair, wall fortification (precise counters from task.build.js and friends)
  • renewal — creep renewal (exact formula ceil(cost / 2.5 / parts))
  • link_tax — 3% tax on each link transfer
  • terminal_send + term_fee — energy on outgoing sends and fees
  • powerPowerSpawn.processPower (50e per op)

These counters live in Memory.rooms[name]._eo = {b, r, f, u, ...}. Every 100 ticks a snapshot goes into segment 0 → into PostgreSQL → into Grafana as a stacked bar per hour.

The instrumentation overhead is under 0.01 CPU per tick. The price of visibility is pennies.

Grafana row about energy: storage, terminal, controller upgrade progress and rate, spawn+ext, creeps by role — all panels split across my six rooms
One screen — all six rooms over a day. You can see E45S38 (green) pulling storage to 150K while the others sit at a steady 50K. Upgrade rate drops from 65 to 20 e/tick — that's the moment I cut booster supplies for upgraders and other tasks stopped eating my compound stock. Without the chart that transition would only be visible after the fact — through a 'controller is crawling' complaint.

What the dashboard showed the first night

I thought the main spend item in a developed room was the upgrader. The dashboard showed otherwise: fortify (wall reinforcement) at RCL 7 is 4× more expensive than upgrade. Towers eat more than they seemed to, because I wasn't accounting for wall repair. And link_tax steadily ate 3% of all energy.

You couldn't see this from the code. You can only see it in a stacked bar over time.

Global market: a trading terminal

Screeps has a full-blown exchange. Every player can post a BUY or SELL order on any resource with a price and volume. Trades happen between players, the market charges a fee based on the distance between terminals. Trade history exists, order-book depth exists.

And here's where it gets interesting: prices aren't static. They depend on who dumps big volume, who undercuts, which T3 compounds are needed for boosts right now. On a chart of fair price for mineral H over a week — it's a real exchange pattern: rallies, corrections, occasional pumps from a single player with a big stockpile.

Anti-speculation: fair price instead of median

First naive implementation: bot sees the best SELL order at 50 cr/u, buys. An hour later it turns out that order was for 10 units from a speculator and the real price is 200. I overpaid.

The fix came from real exchanges: cumulative volume. Price is computed not from "best order" but like this:

fairSell = price at which cumulative order volume ≥ FAIR_PRICE_VOLUME (500 units)

We walk down from the best order, sum amount, and take the price of the order whose total crosses 500. Tiny solo orders of 10-20 units don't move fair price — they get filtered out.

In Grafana there are three panels:

  • Fair SELL of base minerals (H/O/Z/K/U/L/X) — raw material price trends
  • Fair BUY of base minerals — at what level the market is willing to bid
  • Fair SELL of T2/T3 compounds (LH2O, ZH2O, XLH2O...) — for evaluating reaction-chain profitability

And two — about real credit flows:

  • Mineral purchases on the market/hour (units) — stacked bars, what's being bought and how much
  • Spend on purchases/hour (cr) — the cash equivalent

This isn't a game anymore. It's a small algo-trading system with real P&L.

Ledger: profit accounting per room

Even closer to an exchange — an internal ledger. Every buy/sell goes into Memory.market.ledger[room][resource]:

{
  spent: 1_240_000,   // credits spent on buying this resource into this room
  bought: 8400,       // units bought
  revenue: 3_650_000, // revenue from selling this resource out of this room
  sold: 14_200        // units sold
}

Prod3.ledger() in the console — a "income/spend by period, room and resource" report. In Grafana there's a "Sales revenue" panel (credits/hour), where every market.order_fill event (partial/full fill of my orders) is logged.

When I say in chat "the bot pulled 1.4M cr/hour purely from T2 LHO2" — that's not a guess, that's a number from the dashboard.

Not just energy: tasks, towers, threats

Beyond the energy economy and the market, all the bot's "physical-world events" land on the dashboard. Every creep gets a task from a priority queue (fill_extensions, unload_link, surplus_to_terminal, repair, ...) — types and frequencies are written to a taskLog segment. Every tower trigger (attack/heal/repair) — to attacks. That gives a second observation plane: not "how much energy" but what is the bot actually doing.

Grafana panels: stacked bars of task types per hour, top tasks for 24h, tower triggers per room, threat character
Task stack over 24 hours: `to_storage` 6.94K, `unload_link` 4.72K, `fill_spawn` 4.71K — that's the pulse of hauler work. Tower triggers are security telemetry now: a steady background of solo enemy scouts and a peak at 02:00 with two towers in one room (intrusion in E49S38). The bottom 'threat character' panel separates real attacks from passers-by.

When the bot acts strange, the first thing I do is look at these panels. A repair spike with no combat? Somebody is throwing themselves at a wall. A sharp drop in unload_link? Links are full, haulers are the bottleneck. fill_spawn at 0 with energy in storage? Haulers are stuck on another task.

Charts replace intuition. And that's what turns a gaming session into engineering work.

Claude Code as a trader on this dashboard

The final piece of architecture — Grafana's MCP integration with Claude Code on my laptop. Through mcp__grafana__* the assistant sees all 44 panels, runs SQL against PostgreSQL, sets event annotations, generates deep links.

When I ask in chat "why aren't credits growing?", the flow is:

  1. mcp__grafana__query_prometheus or direct SQL on screeps_logs.segment_records — pulls the current ledger
  2. Cross-checks against the rule in docs/decision-frameworks/market-decisions.md
  3. Answers with numbers, links to panels and an explanation

The dashboard is context. Without it the assistant would guess. With it — it sees what I see.

What's special about this architecture

The main thing — observability as a first-class concern. Not "we'll bolt logs on later" but from day one:

  • Structured snapshots in Memory segments (json, not strings)
  • Clear channel split: WS — short-latency events, REST poll — periodic snapshots
  • PostgreSQL as universal storage for time-series + jsonb event blobs
  • Grafana as visualization, not as single source of truth (the source is Postgres + bot code)
  • 30-day retention — kills "yesterday's bug is uncatchable" as a class of problems
  • MCP integration into the LLM — the assistant works with the same data I do

This is a pattern I carry over from B2B telecom and SaaS to any long-running process. A bot for an online game is, in this sense, no different from OLT monitoring or an NMS service.

Not every game allows this

Most online games are a closed black box for a bot. You can scrape the screen, parse client logs, plug into an API if the developer explicitly published one — but it's always a workaround.

Screeps was designed for programmers. WebSocket, REST, a JS runtime for AI, JSON Memory segments — that's part of the product, not a leak. So you can watch the bot with the same tools you use on a prod service, and optimize the same economy you'd optimize in real trading.

If you're ever building a game for developers — look at how Screeps did it. Not GraphQL, not gRPC, no proprietary protocol. Plain old WS + REST + JSON. And full freedom for whoever wants to observe.


In the next post — about the CPU crisis: how I hit the 20 CPU/tick ceiling with 43 creeps in 6 rooms, the decomposition I did (intents 8.25 hard ceiling vs 4.8 untracked = 90% of the optimization potential), and how a module-level cache in an ephemeral runtime took bucket from 2-22 to 13-52.

Share
Discussion

Comments are powered by Giscus + GitHub. Clicking transfers data to GitHub Inc. (USA). No click — no transfer.

MMO as production: Grafana, PostgreSQL and WebSocket for a game bot · Grigoriy Masich