AI Agent Guardrails: Detecting Drift Before It Ships

Here's how this one taught me the lesson. Inside a production multi-agent system I run, a QA agent — whose entire job was to reject content that broke the rules — started approving it. Nothing crashed. No error fired. It just quietly got more permissive over weeks, until I noticed the wrong things shipping. Agent-Constitution is the fix I built so it wouldn't happen again.

"Guardrails" is one of the most-hyped words in AI and one of the least-implemented. The reason agents drift is mundane: the rules live inside a prompt, and prompts get edited, trimmed, and "improved" over time. After enough iterations the agent no longer follows the original constraints — and because you never wrote those constraints down separately, you can't even tell when or how the drift happened.

The root cause If your agent's rules only exist as prose inside its prompt, you have no baseline. Drift is invisible because there's nothing to diff against. The fix isn't a smarter model — it's a separate, versioned source of truth.

Step 1: Write the Constitution

A constitution is a separate, versioned document — not buried in the prompt — that defines the agent's rules in three sections:

Capabilities — what the agent does, as active statements. "Searches the web for relevant content on a given topic."
Constraints — what it must never do, as prohibitions. "Must never fabricate citations or invent URLs."
Decision rules — how to behave when instructions conflict. "If research conflicts with the user's stated goal, surface the conflict rather than suppress it."

Because it's a standalone file under version control, every change to the agent's rules is now a reviewable diff with a timestamp — not an invisible edit lost in a 2,000-word prompt.

Step 2: Validate Outputs Against It

The validator checks an agent's response against the constitution's constraints before that response reaches a user:

            Python — validate a response
            
python src/validator.py \
  --constitution my-agent-constitution.md \
  --response "agent output here"

This is the eval-to-guardrail idea in practice: the same rules you'd check in pre-production become a runtime gate. A constraint violation doesn't get logged for later — it gets caught at the door.

Step 3: Detect Drift in the Prompt Itself

The subtler tool. The drift detector compares the current prompt against the original constitution and flags where they've diverged:

            Python — detect drift
            
python src/drift_detector.py \
  --constitution my-agent-constitution.md \
  --prompt current_prompt.txt

This catches the slow QA-agent failure directly: when someone "improves" the prompt and quietly drops a hard constraint, the detector sees the original rule is no longer represented and flags it — before the looser agent ships.

Constitution

versioned rules: capabilities, constraints, decision rules

Validator

gate outputs against the rules at runtime

Drift detector

diff the live prompt vs. the original — catch erosion early

Why This Beats a Generic Guardrails Library

Off-the-shelf guardrails check for toxicity, PII, and jailbreaks — important, but generic. They don't know your agent's job. A constitution encodes the domain-specific rules that actually matter for your use case, and the drift detector watches the one failure mode generic tools can't see: your own team slowly eroding the rules through ordinary prompt edits.

What I Built

Agent-Constitution ships the template, validator, drift detector, and example constitutions for research, content, and QA-gate agents. It pairs naturally with failure memory (catch repeats) and evals (catch regressions) — three angles on the same goal: agents that stay reliable. More on that philosophy in the boring infrastructure that actually ships.