The OpenAI model lineup has moved fast in 2026. GPT-4o is no longer the flagship — the lineup is now GPT-5.5 at the top, GPT-5.4 as the production workhorse, and the o-series reasoning models with dramatically lower prices than a year ago. Here is the current picture and what to do with it.

1. The Model Lineup: June 2026

GPT-5.5
$5 / $30 per 1M tokens. Flagship. High-stakes extraction, demanding reasoning.
GPT-5.4
$2.50 / $15 per 1M. Production workhorse. Best cost-quality for most workloads.
o3
$2 / 1M input. 87% cheaper than o1 with better performance. Multi-step reasoning.
o4-mini
$1.10 / 1M input. Cheapest reasoning. Classification with chain-of-thought.

Budget options: GPT-4.1 nano at $0.10/$0.40 per 1M for highest-volume simple tasks. GPT-5.4 Nano at $0.20/$1.25 as an upgraded cheap option. GPT-5.5 Pro at $30/$180 exists for highest-stakes reasoning tasks where cost is secondary.

2. The Real Decision Matrix

Model selection in June 2026 High-stakes reasoning, complex multi-step? → GPT-5.5 or GPT-5.5 Pro Demanding extraction, messy documents? → GPT-5.5 or GPT-5.5 Pro Standard production workload, agents? → GPT-5.4 Reasoning with budget constraint? → o3 ($2/M, 87% cheaper than o1) Cheap reasoning for routing / triage? → o4-mini ($1.10/M) Simple classification, bulk labeling? → GPT-5.4 Nano or GPT-4.1 nano Highest volume, simplest tasks? → GPT-4.1 nano ($0.10/M)

The mental model that OpenAI now recommends: use GPT-5.5 or GPT-5.5-pro for demanding extraction or messy documents. Use smaller GPT-5.x models for simple classification, routing, and normalized field extraction to reduce latency and cost. The right model is the cheapest one that produces correct output for your specific task.

3. Schema-First Structured Outputs

The paradigm shift in 2026 is schema-first development: define your Pydantic or Zod schema first, then write the prompt around it. Not the other way around. This forces you to think about the output contract before writing the system prompt, which improves prompt quality and eliminates schema drift.

Structured Outputs enforce schema compliance at the model level. The model guarantees the schema. No more parsing failures from hallucinated fields, extra keys, or missing required fields.

Python — schema-first structured outputs from openai import OpenAI from pydantic import BaseModel from typing import List, Literal, Optional # Step 1: Define the schema (before the prompt) class BugReport(BaseModel): severity: Literal["critical", "high", "medium", "low"] file: str line: int description: str suggested_fix: Optional[str] = None # Optional fields need = None class AuditResult(BaseModel): passed: bool issues: List[BugReport] summary: str # Step 2: Build the prompt around the schema client = OpenAI() response = client.beta.chat.completions.parse( model="gpt-5.4", messages=[ {"role": "system", "content": "You are a code auditor. Return findings matching AuditResult schema."}, {"role": "user", "content": f"Audit this code:\n\n{code}"} ], response_format=AuditResult, ) result: AuditResult = response.choices[0].message.parsed # result.issues is a typed list of BugReport objects — no parsing needed
JSON Mode is Legacy Use strict: true (Structured Outputs) exclusively. JSON Mode only asks the model to try to produce valid JSON — it does not enforce your schema. In 2026, treating JSON Mode and Structured Outputs as interchangeable is a reliability bug waiting to happen.

4. The 16k Output Token Limit

Structured Outputs now support up to 16k output tokens for complex extractions. This matters for tasks like extracting all issues from a large codebase audit, pulling structured data from long documents, or generating detailed JSON reports. Set max_tokens high enough to fit the full response — a truncated JSON mid-object still counts as a failure.

Python — handle schema refusals response = client.beta.chat.completions.parse( model="gpt-5.4", messages=messages, response_format=AuditResult, max_tokens=16000 # allow full output for complex schemas ) message = response.choices[0].message # Handle refusals as first-class errors (not exceptions) if message.refusal: print(f"Model refused: {message.refusal}") else: result = message.parsed

5. Function Calling vs Structured Outputs

Both produce structured data, but serve different purposes. Know which to use:

In a multi-agent pipeline: use function calling for agents that call tools (search, code execution, file access). Use structured outputs for agents that produce final artifacts (reports, audit results, classifications).

6. The Batch API: 50% Off Async Work

Any work that does not need real-time results should go through the Batch API. It processes within 24 hours at 50% the standard price. Thousands of classification jobs, documentation generation, data extraction at scale — all at half cost.

Python — batch API with current models import json from openai import OpenAI client = OpenAI() requests = [] for i, doc in enumerate(documents): requests.append({ "custom_id": f"doc-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-5.4-nano", # cheap model for bulk work "messages": [ {"role": "user", "content": f"Classify this document: {doc}"} ], "response_format": {"type": "json_schema", "strict": True, "schema": classification_schema}, "max_tokens": 128 } }) with open("batch_input.jsonl", "w") as f: for r in requests: f.write(json.dumps(r) + "\n") batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch") batch = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h" ) print(f"Batch ID: {batch.id}")

7. o3 in a Multi-Model Pipeline

o3 is not a general-purpose model — it is a reasoning model. The practical pattern from OpenAI's own docs: use GPT-5.4 to triage order details and identify issues, then feed structured data into o3 to make the final decision. o3 thinks longer and produces more reliable answers on ambiguous, multi-step decisions. It is overkill for simple extraction.

Python — model tiering with o3 # Step 1: Extract structured data cheaply extraction = client.beta.chat.completions.parse( model="gpt-5.4", messages=[{"role": "user", "content": f"Extract key fields from: {raw_document}"}], response_format=ExtractedData, ) data = extraction.choices[0].message.parsed # Step 2: Use o3 for the actual decision (feeds on structured input) decision = client.chat.completions.create( model="o3", messages=[ {"role": "system", "content": "Given structured data, make a final approval decision."}, {"role": "user", "content": data.model_dump_json()} ] )

8. Rate Limits: TPM Is Often the Real Bottleneck

OpenAI rate limits run on two axes: requests per minute (RPM) and tokens per minute (TPM). In agent pipelines with large prompts, TPM is usually the actual bottleneck, not RPM. Track both. Use exponential backoff with jitter on 429s, and implement a sliding window counter to pre-rate-limit yourself before hitting the API limit.

Python — backoff with jitter import random, time def call_with_retry(fn, max_retries=5): for attempt in range(max_retries): try: return fn() except openai.RateLimitError: if attempt == max_retries - 1: raise wait = (2 ** attempt) + random.uniform(0, 1) time.sleep(wait)
"Use the cheapest model that produces correct output. Then optimize the prompt. Then consider fine-tuning. In that order."
Key Takeaway GPT-5.4 is the production default in June 2026, not GPT-4o. o3 dropped 87% in price versus o1 and is now viable for complex reasoning tasks. JSON Mode is legacy — use strict Structured Outputs. And schema-first development means defining the Pydantic model before writing the prompt.