AI agents in production: guardrails, evals, and when NOT to use them

What we learned shipping CyberFort Lab with 9 agents running 24/7. No theory — only the decisions that keep an agent from doing harm or burning money.

CyberFort Lab runs 9 specialized AI agents, 24/7, on real client infrastructure. We know what hurts when you ship agents to production. What follows is the stuff tutorials skip: what already cost us money or time to learn.

When NOT to use an agent

If the flow is deterministic (clear input → clear output, no ambiguity), do not use an agent. Use code. Agents are expensive, slow and non-deterministic. Good for: unstructured text classification, extraction, drafting, judgment-based routing. Bad for: calculations, data transformations, integrations with deterministic APIs (that is just code calling another API).

Output guardrails — always

Three layers: structural validation (Pydantic, Zod, JSON Schema) — output must parse or be discarded; semantic validation — a second model or rules verify the output makes sense in context; rate limits and kill switches — if an agent is doing more than N actions per minute, it shuts down automatically.

Evals before changes — always

An agent without an eval suite is like code without tests. Build a dataset of 50–100 cases with expected outputs. Whenever you change prompt, model or tools, run the evals. If quality drops, no deploy. It is laborious up front and the best 12-month investment.

Observability — higher than for normal code

Every model call: log prompt, response, tokens, latency, cost, tool called. LangSmith, Helicone, Langfuse — whatever — but all of it. Without this, debugging a strange behavior becomes impossible.

Cost control — the hidden problem

A poorly designed loop can burn 500 USD in an hour. Soft caps per user, hard caps per agent, alert as you approach the cap. We learned this the hard way.

Primary model vs cheap model

Use the big model for the main task and a cheap model to validate or pre-classify. Cascade: if the cheap model is confident, do not call the big one. Typical reduction: 60–80% of total cost without losing eval-measured quality.

Human in the loop — define where

For reversible actions (send a message to a user): the agent decides. For irreversible actions (transfer money, delete data, contact an important customer): always human approval. The rule: if you are wrong, can it be undone? If not, there is human approval.

How we help at Athrun Data Intelligence

We build production AI agents under this model. 30-min call to see if your use case fits agents or is better solved with regular code. We talk honestly about when not to apply AI.