The OpenAI model lineup has moved fast in 2026. GPT-4o is no longer the flagship — the lineup is now GPT-5.5 at the top, GPT-5.4 as the production workhorse, and the o-series reasoning models with dramatically lower prices than a year ago. Here is the current picture and what to do with it.
1. The Model Lineup: June 2026
Budget options: GPT-4.1 nano at $0.10/$0.40 per 1M for highest-volume simple tasks. GPT-5.4 Nano at $0.20/$1.25 as an upgraded cheap option. GPT-5.5 Pro at $30/$180 exists for highest-stakes reasoning tasks where cost is secondary.
2. The Real Decision Matrix
High-stakes reasoning, complex multi-step? → GPT-5.5 or GPT-5.5 Pro
Demanding extraction, messy documents? → GPT-5.5 or GPT-5.5 Pro
Standard production workload, agents? → GPT-5.4
Reasoning with budget constraint? → o3 ($2/M, 87% cheaper than o1)
Cheap reasoning for routing / triage? → o4-mini ($1.10/M)
Simple classification, bulk labeling? → GPT-5.4 Nano or GPT-4.1 nano
Highest volume, simplest tasks? → GPT-4.1 nano ($0.10/M)
The mental model that OpenAI now recommends: use GPT-5.5 or GPT-5.5-pro for demanding extraction or messy documents. Use smaller GPT-5.x models for simple classification, routing, and normalized field extraction to reduce latency and cost. The right model is the cheapest one that produces correct output for your specific task.
3. Schema-First Structured Outputs
The paradigm shift in 2026 is schema-first development: define your Pydantic or Zod schema first, then write the prompt around it. Not the other way around. This forces you to think about the output contract before writing the system prompt, which improves prompt quality and eliminates schema drift.
Structured Outputs enforce schema compliance at the model level. The model guarantees the schema. No more parsing failures from hallucinated fields, extra keys, or missing required fields.
from openai import OpenAI
from pydantic import BaseModel
from typing import List, Literal, Optional
# Step 1: Define the schema (before the prompt)
class BugReport(BaseModel):
severity: Literal["critical", "high", "medium", "low"]
file: str
line: int
description: str
suggested_fix: Optional[str] = None # Optional fields need = None
class AuditResult(BaseModel):
passed: bool
issues: List[BugReport]
summary: str
# Step 2: Build the prompt around the schema
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-5.4",
messages=[
{"role": "system", "content": "You are a code auditor. Return findings matching AuditResult schema."},
{"role": "user", "content": f"Audit this code:\n\n{code}"}
],
response_format=AuditResult,
)
result: AuditResult = response.choices[0].message.parsed
# result.issues is a typed list of BugReport objects — no parsing needed
4. The 16k Output Token Limit
Structured Outputs now support up to 16k output tokens for complex extractions. This matters for tasks like extracting all issues from a large codebase audit, pulling structured data from long documents, or generating detailed JSON reports. Set max_tokens high enough to fit the full response — a truncated JSON mid-object still counts as a failure.
response = client.beta.chat.completions.parse(
model="gpt-5.4",
messages=messages,
response_format=AuditResult,
max_tokens=16000 # allow full output for complex schemas
)
message = response.choices[0].message
# Handle refusals as first-class errors (not exceptions)
if message.refusal:
print(f"Model refused: {message.refusal}")
else:
result = message.parsed
5. Function Calling vs Structured Outputs
Both produce structured data, but serve different purposes. Know which to use:
- Function calling: The model decides when and whether to call a function during conversation. It returns a function name and arguments. You execute and return the result. Use this for agentic tool use where the model drives the workflow.
- Structured outputs via response_format: You want the final response in a guaranteed schema. The model always produces the schema. No decision-making. Use this for typed responses at the end of a task.
In a multi-agent pipeline: use function calling for agents that call tools (search, code execution, file access). Use structured outputs for agents that produce final artifacts (reports, audit results, classifications).
6. The Batch API: 50% Off Async Work
Any work that does not need real-time results should go through the Batch API. It processes within 24 hours at 50% the standard price. Thousands of classification jobs, documentation generation, data extraction at scale — all at half cost.
import json
from openai import OpenAI
client = OpenAI()
requests = []
for i, doc in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-5.4-nano", # cheap model for bulk work
"messages": [
{"role": "user", "content": f"Classify this document: {doc}"}
],
"response_format": {"type": "json_schema", "strict": True, "schema": classification_schema},
"max_tokens": 128
}
})
with open("batch_input.jsonl", "w") as f:
for r in requests:
f.write(json.dumps(r) + "\n")
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch.id}")
7. o3 in a Multi-Model Pipeline
o3 is not a general-purpose model — it is a reasoning model. The practical pattern from OpenAI's own docs: use GPT-5.4 to triage order details and identify issues, then feed structured data into o3 to make the final decision. o3 thinks longer and produces more reliable answers on ambiguous, multi-step decisions. It is overkill for simple extraction.
# Step 1: Extract structured data cheaply
extraction = client.beta.chat.completions.parse(
model="gpt-5.4",
messages=[{"role": "user", "content": f"Extract key fields from: {raw_document}"}],
response_format=ExtractedData,
)
data = extraction.choices[0].message.parsed
# Step 2: Use o3 for the actual decision (feeds on structured input)
decision = client.chat.completions.create(
model="o3",
messages=[
{"role": "system", "content": "Given structured data, make a final approval decision."},
{"role": "user", "content": data.model_dump_json()}
]
)
8. Rate Limits: TPM Is Often the Real Bottleneck
OpenAI rate limits run on two axes: requests per minute (RPM) and tokens per minute (TPM). In agent pipelines with large prompts, TPM is usually the actual bottleneck, not RPM. Track both. Use exponential backoff with jitter on 429s, and implement a sliding window counter to pre-rate-limit yourself before hitting the API limit.
import random, time
def call_with_retry(fn, max_retries=5):
for attempt in range(max_retries):
try:
return fn()
except openai.RateLimitError:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)