Building Production AI Agents: Architecture, Guardrails, and Evaluation

A prototype agent that works in a demo is easy. A production agent that is safe, reliable, observable, and cost-controlled is hard.

This article focuses on practical production design choices.


Production Agent Architecture

A robust architecture separates concerns:

  • input router
  • planner/executor loop
  • tool gateway
  • memory subsystem
  • policy engine
  • evaluator and telemetry pipeline

Separation helps enforce control boundaries and simplify incident response.


Tool Governance and Permissions

Every tool should define:

  • input schema
  • output schema
  • side-effect level (read-only vs write)
  • required permission scope
  • rate and retry policy

Never allow unconstrained arbitrary tool execution from raw model output.


Guardrail Layers

Use multiple layers:

  1. pre-action validation (policy, auth, schema)
  2. runtime checks (timeouts, quotas, anomaly checks)
  3. post-action validation (result sanity)
  4. final response filters (safety/compliance)

Single-layer guardrails are brittle.


Evaluation Framework for Agents

Evaluate agent quality across dimensions:

  • task completion rate
  • tool-call success rate
  • step efficiency (steps/task)
  • policy compliance rate
  • cost per successful task
  • human escalation rate

Agent evaluation is process-centric, not only answer-centric.


Test Matrix You Should Have

At minimum:

  • happy path tasks
  • ambiguous task definitions
  • tool outage simulation
  • malformed tool responses
  • adversarial prompts
  • high-cost action attempts

A strong test matrix catches most expensive failures before launch.


Observability and Traceability

Log at step level:

  • user goal
  • plan decisions
  • selected tools and args
  • tool outputs
  • policy decisions
  • final response and confidence

Step-level traces are essential for audits, debugging, and trust.


Cost and Latency Control

Agent systems can be expensive due to multi-step loops. Control with:

  • max step budget
  • model tiering by task complexity
  • tool result caching
  • deterministic shortcuts for common tasks

Measure cost per completed goal, not just cost per LLM call.


Incident Response for Agents

Agent incidents differ from chat incidents because actions can have side effects.

Playbook:

  1. stop risky action paths
  2. switch to read-only mode
  3. rollback planner/prompt/tool policy versions
  4. audit affected actions
  5. apply remediation and policy update

Have “kill switches” for high-risk tools.


End-to-End Code Example (Guardrailed Tool-Calling Agent)

from dataclasses import dataclass
from typing import Dict, Any, Callable

MAX_STEPS = 5
ALLOWED_TOOLS = {"search_kb", "create_ticket"}

@dataclass
class Decision:
    action: str
    args: Dict[str, Any]

# Mock planner (replace with LLM planner call)
def planner(goal: str, state: Dict[str, Any]) -> Decision:
    if not state.get("found_answer"):
        return Decision("search_kb", {"query": goal})
    if state.get("needs_escalation"):
        return Decision("create_ticket", {"title": "Escalation", "priority": "high"})
    return Decision("finish", {"answer": state.get("answer", "Done")})

# Mock tools

def search_kb(query: str) -> Dict[str, Any]:
    if "refund" in query.lower():
        return {"answer": "Refund allowed within 30 days", "needs_escalation": False}
    return {"answer": "No exact policy found", "needs_escalation": True}

def create_ticket(title: str, priority: str) -> Dict[str, Any]:
    return {"ticket_id": "SUP-1042", "status": "created", "title": title, "priority": priority}

TOOLS: Dict[str, Callable[..., Dict[str, Any]]] = {
    "search_kb": search_kb,
    "create_ticket": create_ticket,
}

def validate_action(decision: Decision) -> None:
    if decision.action == "finish":
        return
    if decision.action not in ALLOWED_TOOLS:
        raise ValueError(f"Blocked tool: {decision.action}")
    if decision.action == "create_ticket":
        if decision.args.get("priority") not in {"low", "medium", "high"}:
            raise ValueError("Invalid priority")

def run_agent(goal: str) -> Dict[str, Any]:
    state: Dict[str, Any] = {}
    trace = []

    for step in range(1, MAX_STEPS + 1):
        d = planner(goal, state)
        trace.append({"step": step, "decision": d.action, "args": d.args})

        validate_action(d)

        if d.action == "finish":
            return {"ok": True, "result": d.args.get("answer"), "trace": trace}

        tool_fn = TOOLS[d.action]
        out = tool_fn(**d.args)
        trace[-1]["tool_output"] = out

        if d.action == "search_kb":
            state["answer"] = out.get("answer")
            state["found_answer"] = True
            state["needs_escalation"] = out.get("needs_escalation", False)
        elif d.action == "create_ticket":
            return {"ok": True, "result": f"Escalated: {out['ticket_id']}", "trace": trace}

    return {"ok": False, "error": "max steps reached", "trace": trace}

if __name__ == "__main__":
    result = run_agent("Customer asks for refund after purchase")
    print(result)

This example demonstrates schema checks, tool allowlisting, and step-budget control.


Rollout Strategy

For production release:

  1. shadow mode with full traces
  2. read-only tool mode
  3. limited write actions for low-risk cohorts
  4. canary rollout with strict guardrails
  5. full rollout after stability window

Progressive rollout is critical for side-effecting agent systems.


Key Takeaways

  • Production agents require architecture-level guardrails, not prompt-only safeguards.
  • Tool governance and permission scoping are core safety controls.
  • Evaluate process metrics (steps, tool success, compliance), not just final answers.
  • Step-level observability and rollback controls are mandatory.