Stop Building Toy MCP Servers: A Blueprint for Production Integrations

The Model Context Protocol won. It is how agents talk to databases, filesystems, and internal APIs, and it is now screened for in AI engineering interviews. But search GitHub for "MCP server example" and you will find a thousand variations of the same forty-line script: one tool, a happy path, and nothing between the model and your data.

That script is a great way to learn the protocol and a catastrophic thing to point at a company's real data layer. The gap between the two is not the protocol, it is everything a toy omits. Here is the blueprint, and an honest account of which parts my own toolkit implements versus which parts you must add.

The toy, and what it ignores

A toy MCP server registers a tool, dispatches a call, returns a result. It assumes the caller is trusted, the data is single-tenant, the call always succeeds, and nobody is watching. Point that at a heterogeneous production data layer and every one of those assumptions becomes an incident:

An agent acting for tenant A retrieves tenant B's rows, because the tool never scoped the query.
A reasoning loop calls your search tool 400 times in ten seconds and takes the database down.
A tool throws, the exception escapes, and the whole MCP session dies mid-conversation.
Something goes wrong in production and you have no trace, no request id, no structured log to reconstruct what the agent actually did.

None of these are exotic. They are Tuesday. A production-grade server is defined by how it handles them.

The five requirements

1. Per-tool error boundaries

A failure in one tool must never take down the session. Every tool call runs inside its own boundary that catches, logs, and returns a structured error the model can read and react to. This is the one piece my toolkit ships today, and it is the floor, not the ceiling:

            JavaScript — every tool call is isolated (mcp-agent-toolkit/src/server.js)
            
server.setRequestHandler(CallToolRequestSchema, async (request) => {
    const { name, arguments: args } = request.params;
    try {
        let result;
        if      (name.startsWith('blackboard_')) result = handleBlackboard(db, name, args);
        else if (name.startsWith('scar_'))       result = handleScars(db, name, args);
        else if (name.startsWith('cache_'))      result = handleCache(db, name, args);
        else throw new Error(`Unknown tool: ${name}`);
        return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] };
    } catch (err) {
        // The session survives; the model receives a structured error.
        return { content: [{ type: 'text', text: `Error: ${err.message}` }], isError: true };
    }
});
            
        

2. Tenant isolation

This is the one that ends careers. The model must never be handed a raw connection to a shared data layer. Every tool receives a tenant context (resolved from the authenticated session, never from a model-supplied argument) and every query is scoped by it at the data-access layer, not in the prompt. The model can ask for "all open tickets"; the tool decides that means "all open tickets for this tenant," and there is no string the model can emit to escape that scope. Treat tenant id like a foreign key the model is structurally incapable of forging.

3. Token-bucket rate limiting

An agent loop is an adversary you wrote yourself. It will call a tool as fast as the protocol allows. A token-bucket limiter per tenant (and ideally per tool) caps burst and sustained call rate, and returns a structured "rate limited, retry after N" the model can actually back off on. Without it, one runaway reasoning loop is a self-inflicted denial of service against your own database.

4. Structured logging and observability

You cannot debug what an autonomous agent did from a stack trace. Every tool call should emit a structured JSON log line: timestamp, tenant, tool name, arguments (redacted), duration, outcome, and a trace id that ties the whole agent run together. Ship those to Datadog, Jaeger, or whatever you run. When an agent does something surprising in production, this is the only artifact that lets you reconstruct the decision path.

5. Dual transports with auth on the remote one

Local tools speak stdio (the standard pattern, and what my toolkit uses) for Claude Desktop, Claude Code, and any local MCP client. But a server that backs a real product also needs a remote transport (HTTP with SSE) so it can run as a service, and the moment it is remote it needs authentication and authorization on every connection. stdio trusts the local process; a remote transport trusts no one until they prove who they are.

Honest scope of my toolkit My MCP Agent Toolkit implements requirement 1 (per-tool error boundaries), durable persistence via node:sqlite, and the stdio half of requirement 5. It is a clean starter with a real tool layer, not a multi-tenant production server. Tenant isolation, rate limiting, structured logging, and the remote transport are the blueprint you build on top. I am telling you this because the whole point of the post is that toys pretend to be production. Mine does not.

The build order that matters

If you are taking a server from toy to production, the sequence is not arbitrary. Tenant isolation first, because a leak is unrecoverable and everything else is moot if data crosses tenants. Then error boundaries, so failures degrade instead of cascading. Then rate limiting, before the first runaway loop finds you. Then observability, so the next incident is debuggable. The remote transport and its auth come last, when you actually need the server to be a service rather than a local helper.

The one-line test Ask of any MCP server: "if the model calls this tool 1,000 times in a loop, on behalf of the wrong tenant, and the third call throws, what happens?" A toy cannot answer. A production server answers in four boring, reassuring sentences.

What I Built

The MCP Agent Toolkit exposes three agent-infrastructure tools (a shared blackboard, SCAR failure memory, and an LLM response cache) over MCP, with per-tool error boundaries and node:sqlite persistence, in 13 tests. It is the honest starting point described above: the tool layer and the protocol done right, with the multi-tenant production hardening laid out as the next build. If you are learning MCP, clone it. If you are shipping MCP to customers, clone it and then build the other four requirements before you go live.