The more beautiful the design, the more we doubt it with real data — BASTION’s verification discipline before going to production

The more beautiful the design, the more we doubt it with real data — BASTION’s verification discipline before going to production

6 min read

// BASTION Technical Note ・ 2026-06-12 ・ Hideyuki Chinda / BESTNET LLC

“It probably works” doesn’t cut it for production infrastructure #

BASTION is an “AI Ops Platform” in which AI analyzes infrastructure logs, automatically blocks attacks, and controls bandwidth quality. Put that way it sounds flashy, but the thing we are most careful about in development is not adding features. It is the unglamorous discipline of “only putting into production what we can prove works.”

Putting automated control into production infrastructure means a mistake in the AI’s judgment turns directly into an incident. Misjudge non-attack traffic as an attack and block it, and legitimate customers get locked out. Throttle bandwidth when there is no congestion, and the service degrades. That is why we do not ship things that are only at the “should be correct in theory” or “probably works” stage to production.

In this article I describe what we actually do before putting new detection logic or control into production.

Engineers are drawn to “beautiful designs” — and that is the trap #

To be honest, we had a “theoretically elegant model” of our own too. On paper it was coherent and felt good to explain. Designs like that are seductive.

But production traffic does not behave according to assumptions made on paper. We verified that model against real data from our own infrastructure and found places where it behaved differently from what we expected. The action we took was simple — we rebuilt the model, dialing its claims back to the range the data actually supports. Not “adopt it because the diagram is beautiful,” but adopt only the range the data supports.

This is not a defeat. On the contrary, being able to refute your own beautiful hypothesis with your own data is, we believe, indispensable for a product entrusted with production infrastructure.

The craft of verification — testing a hypothesis without touching production #

Concretely, we verify in this order.

1. Test the hypothesis by observation alone (READ-ONLY) #

New detection logic first runs in an “observation-only” mode that writes nothing to production. It never touches production blocking or control; it merely records, against real data, “how it would have decided had this been production.” This way, even when a hypothesis is wrong, no one is harmed.

2. Confirm it can tell a real anomaly from mere noise #

The value of detection logic is decided less by “being able to find anomalies” than by “not calling non-anomalies anomalies.” Can it pick out only the real anomalies from logs full of background noise? What we look at is not whether a single metric crossed a threshold, but whether several observation surfaces begin to deviate in step with one another (a coordinated onset, or co-onset) — captured as a relative change from each target’s own baseline. We apply simulated load and events in our own environment and confirm, with real data, whether the logic can separate a “genuine coordinated anomaly” from “unrelated fluctuations that merely happened at the same moment.” If it cannot, that detection is useless.

3. Pit the opposing view against your own hypothesis #

We deliberately throw conflicting interpretations at our own hypothesis: “Is this detection really doing the work? Did it just happen to be right?” Does the conclusion hold even if we change how the observation window is drawn? Could a different explanation produce the same result? Any claim that cannot survive refutation is taken out of the product’s description.

What is proven goes to production; what is not goes to the observation layer #

Only what passes verification advances — in stages — to production. Writes to production always go through “dry run → limited production → full production,” three stages. An emergency-stop command is in place from the very beginning. And every production rollout passes a gate where “the AI proposes and a human approves.”

This discipline shows directly in how the product is presented. BASTION’s two modules, for instance, differ in operational status.

  • The security module has a track record of detecting and automatically blocking numerous attack campaigns in real-world operation, and runs in full auto in production.
  • The quality module (dynamic bandwidth control), by contrast, is implemented as a feature but is still in staged rollout, centered on observation. Whether the control acts as intended when a line genuinely congests is something we confirm with data from that very situation first — which is exactly why we do not write that it is “running at full capacity in production.”

“Implemented” is not the same as “works in production.” We refuse to blur that distinction, even on the product page.

We verify the AI (the LLM) itself, too — without trusting it #

This “verify, don’t trust” stance is aimed at the local LLM at BASTION’s core as well.

Once, a local LLM produced a number that did not exist in a monitoring report (we wrote up the episode and the countermeasure in “How a local LLM fabricated numbers in a monitoring report, and what we did about it”). Since then, of the LLM’s outputs, the judgments that actually drive the system — the final decision of whether an IP is an attacker, and the execution of a block — are not left to the LLM, but made by mathematical, deterministic judgment matched against actual machine logs. What the LLM is good at is summarizing and shaping natural language, not production decision-making. Both the LLM and the Agent (placed in a DMZ that may itself be compromised) are, for us, things to “verify, not trust.”

Why this unglamorous discipline is the real differentiator #

The AI Ops and AI security space is awash in flashy language. But what the people on the ground entrusted with production infrastructure truly care about is “will it cause an incident?” and “won’t a false detection or false control land us in trouble?”

So we speak not in buzzwords but in the fact that it actually runsattack-campaign detection operating in production, the fact that in a closed-network configuration no logs ever leave the premises, and the very way we build: only put into production what we can prove.

Don’t decide by beauty. Don’t decide on paper. Be able to doubt your own beautiful hypothesis with data. That is our discipline for bringing BASTION to a state in which it may be placed into production infrastructure.


Updated on 2026年6月13日

What are your feelings

  • Happy
  • Normal
  • Sad