How do I make MCP tools safe to expose to an agent?

Design each MCP tool least-privilege: declare read-only tools (readOnlyHint), confine file access to an allowlisted root and reject paths that escape it, scope which roles may call which tool, keep connections outbound-only with credentials scrubbed, pin the tool list so a server cannot silently swap tools (rug-pull defense), and append every call to an immutable audit log. Gate write or deploy tools behind human-in-the-loop approval.

Does a weaker, more constrained agent beat a powerful one?

For anything touching real data or production, yes. Trust is a property of the boundaries you put around the agent, not of the model's raw capability. A modest model that can only read one folder and write one audited output is safer to run than a frontier model with root access, and it is the version a regulated organization will actually allow.

Least-Privilege AI Agents: After the npm Attack, Trusted Beats Powerful

Q: Were Claude Code and Codex hacked?

The tools themselves were not breached. Attackers published malicious npm packages (314 of them in one campaign) that targeted the agents' hooks and session-startup mechanisms, so that installing a poisoned dependency let the attacker run code inside the agent's session and exfiltrate AWS keys, SSH keys, and vault passwords. It is a supply-chain attack on what the agent is allowed to run, not a flaw in the model.

Q: What is least privilege for AI agents?

Least privilege means giving an agent (and each tool it can call) the narrowest access needed for the task and nothing more: read-only instead of read-write, one directory instead of the whole filesystem, no network unless required, scoped and short-lived credentials, and an audit log of every action. The agent should not have the keys to the whole machine just because it might occasionally need one drawer.

Q: Should I give an AI coding agent full filesystem or shell access?

No, not by default. Full access means any prompt injection, poisoned dependency, or model mistake has the whole machine as its blast radius. Confine the agent and its tools to a specific workspace, make tools read-only where possible, block path traversal outside the allowed root, and require human approval for anything destructive or outbound.

In June 2026, a supply-chain campaign pushed 314 malicious npm packages that specifically targeted the hooks and session-startup mechanisms of Claude Code and Codex. Install a poisoned dependency, and attacker code ran inside your agent's session, exfiltrating AWS credentials, SSH keys, and vault passwords. The agents were not "hacked." They were simply handed the whole computer, and someone walked through the open door.

This is the security story of the year, and the lesson is uncomfortable for the hype cycle: as agents get more capable, the thing that bites you is not capability. It is access. The fix is not a smarter model. It is less privilege.

The core idea Give an agent, and every tool it can call, the narrowest access the task needs and nothing more. Read-only over read-write. One directory over the whole disk. No network unless required. Scoped, short-lived credentials. An audit log of every action. Trust is a property of the boundaries, not of the model.

Why this is happening now

A 2024-era "agent" answered questions. A 2026 agent runs shell commands, installs dependencies, edits files, and chains hundreds of tool calls autonomously. Every one of those powers is also an attack surface. When the agent has broad system access, a single prompt injection, a poisoned package, or an ordinary model mistake has your entire machine as its blast radius.

The industry consensus that formed after the npm attack is blunt: stop giving agents the whole computer. Credential firewalls, outbound-only connections, and task isolation are now table stakes, not nice-to-haves. The interesting work has moved from "what can the agent do" to "what can it not do."

What least-privilege looks like for an MCP tool

Most agent power flows through tools (increasingly, MCP tools). So that is where the boundaries belong. A least-privilege tool has most of these properties:

Read-only by default. If a tool only needs to read, it should declare that (readOnlyHint) and be incapable of writing. Most "give the agent context" tools fall here.
Confined to a root. File access is allowlisted to one workspace directory, and any path that tries to escape it (../../etc/passwd, a symlink, an absolute path) is rejected before anything is read.
Capability-scoped. Which agent role may call which tool is explicit. A research role cannot call a deploy tool.
Outbound-only, credentials scrubbed. No inbound surface; secrets are masked before they can ever be logged or sent.
Rug-pull pinned. The tool list is hashed and pinned, so a server cannot silently swap a benign tool for a malicious one between calls.
Audited. Every call appends to an immutable log: who called what, with what arguments, and what was allowed, blocked, or quarantined.
Human-gated for the dangerous bits. Write, delete, and deploy tools sit behind a human-in-the-loop approval, not autonomous execution.

A worked example: a tool that cannot touch your machine

Concrete beats abstract. My Agent-Context tool gives an agent a dependency map of a codebase (what depends on a file before it is changed). It is also a deliberate least-privilege example, and it now ships as a one-click Claude Desktop extension:

Read-only. It builds a graph and answers queries. It cannot write a single byte to your repo.
Root-confined. On install you pick one workspace folder. A per-call path argument may narrow to a subdirectory but can never escape that root, the same path-traversal defense that the relevant CVEs (e.g., CVE-2025-53110 / 53109) exploit when it is missing.
No network. It makes zero outbound calls and returns a structural summary (file and edge counts, dependents and dependencies), never your source.

So even if the model went haywire or someone injected a prompt mid-session, the worst this tool can do is read files inside one folder you chose. That is the whole point: the tool is powerful for its job and powerless for everything else.

The production version: a hardened gateway

For tools that genuinely need to act (run code, hit external services), the controls move up a layer into a mediation gateway. The one inside my SDLC engine enforces, on every single call: capability scoping (fail-closed for unknown roles), path-argument boundary enforcement against an allowlisted root, SHA-256 tool-drift pinning (refuse if a server's tool list changed), input and output sanitization with a quarantine path for high-risk responses, per-server circuit breakers and timeouts, a human-approval gate keyed by a stable call id, and an append-only forensic log of every decision. Paired with scoped, short-lived, signed credentials, an agent gets exactly the keys it needs, for exactly as long as it needs them, and every use is on the record.

Read-only

default for tools that don't need to write

One root

not the whole filesystem; path-escape blocked

Audited

every call logged, dangerous ones human-gated

The honest part

Least-privilege is not free. Confinement adds a little friction, capability scoping is more upfront design, and an audit log is one more thing to store. It will not stop a determined attacker who owns your machine already, and it does not replace dependency hygiene (pin and verify what your agent installs). What it does is shrink the blast radius from "everything" to "this one folder, read-only, logged", which is the difference between a fun demo and something you would let near real data or run inside a regulated organization.

That last point is the one that matters commercially. The agents that get adopted in banking, healthcare, and finance will not be the most capable ones. They will be the ones a compliance officer can reason about: constrained, auditable, on-shore. Trusted beats powerful.

The takeaway

The npm attack is a preview, not an anomaly. As agents touch more, "what is it allowed to do" becomes the whole game. Build tools that are powerful for their job and inert for everything else, and put the dangerous capabilities behind a gate with a paper trail. For the broader case that this boring, unglamorous infrastructure is what actually ships, see the boring infrastructure that actually ships; for the read-only tool above, see why your AI keeps breaking code it can't see.

FAQ

Were Claude Code and Codex hacked?
Not the tools themselves. Attackers shipped 314 malicious npm packages targeting the agents' hooks and startup mechanisms, so a poisoned dependency could run code in the agent's session and steal AWS/SSH/vault credentials. It is a supply-chain attack on what the agent is allowed to run.

What is least privilege for AI agents?
Giving the agent and each tool the narrowest access the task needs: read-only over read-write, one directory over the whole disk, no network unless required, scoped short-lived credentials, and an audit log of every action.

Should I give an AI coding agent full filesystem or shell access?
No, not by default. Full access makes the whole machine the blast radius of any injection, poisoned package, or mistake. Confine it to a workspace, prefer read-only tools, block path traversal, and human-gate anything destructive or outbound.

How do I make MCP tools safe to expose?
Read-only where possible (readOnlyHint), confine file access to an allowlisted root, scope which roles call which tool, keep connections outbound-only with secrets scrubbed, pin the tool list against rug-pulls, log every call, and gate writes/deploys behind human approval.

Does a more constrained agent beat a powerful one?
For anything touching real data, yes. Trust comes from the boundaries, not the model. A modest model confined to one read-only folder with an audit log is safer to run, and the only version a regulated org will allow.