Prompt is code: how I taught an LLM to stop inventing filters and to call SQL

I plugged a local LLM into OLT monitoring — and caught three categories of hallucinations. The fix wasn't a model upgrade, it was tool design and the system prompt.

8 May 2026#llm#prompt-engineering#mcp#tool-design

On the assistant's first day in the OLT monitoring system the bot returned a list of ONUs "in Almati." There's no city "Almati" in my database — there's "Almaty" (1200 ONUs) and "Alma-Ata" (the old name of the same location, 80 ONUs). The model decided they were one city, made up a middle name, and returned an empty result.

That's not a model bug. It's a gap in tool design and the system prompt. After three days of iteration, every "weirdness" of the bot turned into a commit to the prompt — versioned, testable, explainable.

Symptom 1: the model doesn't call SQL for aggregates

For the question "how many ports are over 75% full?" the bot honestly tried olt.list — pulled 1000 OLTs, counted in its head per OLT, and either hallucinated a number or said "a lot."

First thought — the model is dumb. Real cause — it had no aggregate tool. It knew about olt.list and olt.get but didn't know it could count without pulling a list. The behavior was reasonable: no hammer — bang with a shoe.

Fix — a tool for the job:

{
  "name": "olt.stats_by_olt",
  "description": "Top-N aggregate: for each OLT counts ports above threshold. Returns array (olt_id, name, port_count, total_ports), sorted descending.",
  "params": {
    "threshold": "double, 0..1, default 0.75",
    "limit": "int, default 10"
  }
}

Plus an inline schema in the system prompt: "for aggregates and top-N use *.stats_by_* tools, don't pull lists and don't count in your head." Two hours of iteration later the model stopped trying to sum olt.list by hand.

Lesson: if the model isn't calling the tool you need — usually the tool isn't there, or its description doesn't answer "when do I call this."

Symptom 2: filter hallucinations

"Show ONUs in Almati", "find OLTs by Huawei vendor", "how many devices does operator BaikalTel have" — my system has no "Almati", no Huawei among supported OLT vendors (only CDATA, BDCOM, GateRay), and no operator with that name. The bot made them up, then honestly returned "nothing found", hiding the fact that the filter was bogus.

Prompt patches like "don't make up cities and vendors" worked poorly — the model still merged similar values. The fix turned out to be architectural: give the model a way to learn the allowed values.

Tool olt.distinct_values:

{
  "name": "olt.distinct_values",
  "description": "Returns unique values for a field. Use BEFORE any filter so you know the allowed values. Never guess cities/vendors/operators.",
  "params": {
    "field": "vendor | operator | city | po_version"
  }
}

Plus a rule in the prompt: "before a filter on city/vendor/operator — always call olt.distinct_values, pick the closest, ask the user if ambiguous." Behavior changed instantly: now the bot first checks if "Almati" is in the list, doesn't find it, and asks "do you mean Almaty or Alma-Ata?"

Lesson: filter hallucinations don't get cured by prompt negations. They get cured by a source-of-truth tool and an explicit instruction to call it before filtering.

Symptom 3: internal IDs leaking out + domain knowledge

Two different but related bugs:

ID leak. The model showed olt_port_id: 4521, contract_id: 88273 to the user. Useful for debugging, useless for an operator, and a potential leak of contract ids. Cured by one line in the prompt: "never output fields ending in _id. Only human-readable: OLT name, port number, subscriber's full name."

Signal thresholds. The bot decided on its own that "-25 dBm is a bad signal." In our system bad < -27, warn -25..-27, good > -25. The model took numbers from its training data, not from the domain. I wrote thresholds straight into the system prompt:

ONU signal level thresholds (Rx Optical Power, dBm):
  good   : > -25
  warn   : -25..-27 inclusive
  bad    : < -27 (needs attention)
  fatal  : < -30 (likely a break)

Never decide "good/bad" yourself. Always check against the thresholds above.

Problem gone in one run.

Lesson: domain values, thresholds, formats — those are part of the spec, not part of the LLM's "general knowledge." They live in the system prompt and update with the product.

Prompt as code: what that means in practice

After three days of iteration my system prompt turned into a full artifact with its own discipline:

Lives in the repo at resources/prompts/system.md, not in a Java string
Versioned in git, reviewed on PRs like normal code
Covered by smoke tests: a set of "canonical questions" with expected tool calls and a check that the model didn't invent a filter
Each rule in the prompt has a comment "added 2026-05-02 — model confused Almaty and Alma-Ata"

This is closer to a feature config than to "creative text." Every phrase is a reaction to a specific observed bug.

What I learned

Tool design > model size. Answer accuracy went up more from adding the distinct_values tool than from any attempts to pick a beefier model.
"Model is dumb" usually means "the right tool is missing." If behavior is weird — look at what the model has at hand before blaming weights.
Hallucinations are cured by source-of-truth, not by prohibition. Give a tool that returns truth and require it to be called.
The domain lives in the system prompt. Thresholds, formats, security policies — anything specific to the product belongs in the prompt, not in hopes of "LLM common sense."
The system prompt is code. Version, test, release. Don't edit it live in prod.

The LLM doesn't "understand" your domain. It understands its own — general human. The distance between them is closed by the system prompt and tools, not by model selection.

Telegram X (Twitter)

Discussion

Comments are powered by Giscus + GitHub. Clicking transfers data to GitHub Inc. (USA). No click — no transfer.

Open discussion on GitHub ↗