// AI Dev Notes

Building with AI in 2026

Practical notes from 18 months shipping multi-agent systems. Claude tricks, OpenAI patterns, image models, and the fundamentals nobody explains clearly.

The Token-Mix Fallacy: Why Processing More Tokens Can Cut Your LLM Bill

Reducing input tokens is the cheapest lever on the board. Output tokens cost 3-5x more, cached input is up to 90% off, and a structured, reranked retrieval payload lowers the total bill even when it sends more input. The math, worked, with the real RRF code.

Read post →

Why I Banned Probabilistic Control Flow in My Agent Systems

Letting an LLM decide what to do next wins on Twitter and fails in production: infinite loops, latency spikes, exploded token budgets. Control flow belongs in deterministic code. The model proposes; a DAG state machine with Kahn's topological sort disposes. With the real orchestration code.

Read post →

Stop Building Toy MCP Servers: A Blueprint for Production Integrations

90% of MCP examples are forty-line toy scripts. Pointing one at a company's real data layer needs tenant isolation, token-bucket rate limiting, structured logging, per-tool error boundaries, and dual transports. The blueprint, plus an honest account of what my toolkit does and doesn't do.

Read post →

The Vibe Coding Hangover: Why AI Prototypes Don't Survive Enterprise SLAs

AI collapses the cost of producing code without collapsing the cost of owning it, and production is where ownership comes due. The antidote isn't abstinence, it's the discipline AI skips: circuit breakers, timeout ceilings, bounded loops, and guardrails. Named correctly, with real code.

Read post →

Zero-Cloud AI: Hardware-Encrypted Local RAG on Mobile

For regulated data, cloud RAG isn't slow or expensive, it's illegal. A privacy-first mobile architecture: SQLCipher AES-256 at rest with the key in the secure enclave, on-device FTS5 lexical search, and quantized local embeddings for semantic retrieval that never leaves the device.

Read post →

I'm Tired of AI Hype Too. Here's the Boring Infrastructure That Actually Ships.

Demos lie, agents drift, providers go down. No hype — just the unglamorous systems that make AI agents reliable in production: evals, failure memory, drift guardrails, provider fallbacks, and real context engineering.

Read post →

Context Engineering in 2026: Code Graphs, Blast Radius, and Token Budgets

Context engineering, minus the buzzword: feed the model the 200 tokens that matter, not 8,000 of noise. Build a dependency graph, score blast radius, and inject only the relevant slice into the prompt.

Read post →

Giving AI Agents Memory: Failure Memory (SCARs) vs. Solution Recall

Everyone stores successes. The bigger win is storing failures so the agent stops repeating them. How failure memory (scars) and solution recall work — with the real prompt-injection blocks each one prepends.

Read post →

LLM Fallback Chains: Building an AI Gateway with Circuit Breakers

Your LLM provider will go down. A multi-provider fallback chain with session circuit breakers, budget-aware routing, token trimming, and injection guardrails — the reliability layer no demo ever shows you.

Read post →

AI Agent Guardrails: Detecting Drift Before It Ships

AI agents drift — a prompt edited over weeks quietly stops following its own rules. A versioned constitution plus a drift detector catches it before your users do. Born from a real 2am production incident.

Read post →

How to Actually Evaluate AI Agents (Evals That Catch Regressions)

Most AI demos are cherry-picked. Evals are how you tell a robust system from a lucky screenshot — run as regression tests: a fixed set, LLM-as-judge, RAGAS metrics, and a CI threshold that blocks bad changes.

Read post →

I Built the Chatbot You're Talking To: Grounded RAG on Serverless

How the assistant in the corner of this site works: a prebuilt static index (no vector DB), in-memory cosine retrieval, an anti-hallucination contract, a free-provider LLM failover chain, and per-IP rate limiting that fails open. The demo is the product.

Read post →

Claude Tricks Every AI Developer Should Know in 2026

Prompt caching, extended thinking, tool use with streaming, context window management, and the subtle patterns that separate production-grade Claude integrations from demos.

Read post →

OpenAI in 2026: o3, GPT-4o, and Structured Outputs That Actually Work

When to use o3 vs GPT-4o vs GPT-4o-mini. Structured outputs with JSON schema enforcement. Batch API for 50% cost reduction. Vision, function calling, and streaming patterns that hold up in production.

Read post →

Image Models in 2026: FLUX, Ideogram, Midjourney v7 and When to Use Each

A practical comparison of every major image model available right now. FLUX for control, Midjourney for artistry, Ideogram for text, Imagen 4 for photorealism. With prompting techniques for each.

Read post →

The Concepts Behind Multi-Agent Systems: What This Portfolio Actually Uses

Blackboard pattern, DAG schedulers, circuit breakers, LLM routing with 9 providers, SHA-256 response caching, session state, and why each one exists. Explained through real production code.

Read post →

AI Dev Tips & Tricks 2026: From Vibe Coding to Production-Ready Agents

18 months of hard-won patterns. Eval-driven development, async everything, cost tracking, rate limit management, local LLM fallbacks, prompt versioning, and the anti-patterns that will burn you.

Read post →

Build Your First MCP Server in 2026 (Node.js Tutorial)

MCP is now screened in AI engineer interviews. A practical walkthrough: tools, SQLite persistence via node:sqlite, stdio transport, and wiring into Claude Desktop. With working code.

Read post →

RAG in Production: What Actually Works in 2026

Naive RAG is dead. 80% of failures start in chunking. Hybrid search + reranking is the production standard. Qdrant vs Pinecone tradeoffs. Anti-hallucination contracts. RAGAS for evals.

Read post →

How I Build: A Vibe Coder's Workflow in 2026

I can't pass a LeetCode hard. I ship production-grade AI systems. Here's exactly how: the tools, the mental models, the limits, and why "understanding systems by experience" turns out to be enough.

Read post →