OpenAI in 2026: GPT-5.5, o3, and Structured Outputs That Actually Work

The OpenAI model lineup has moved fast in 2026. GPT-4o is no longer the flagship — the lineup is now GPT-5.5 at the top, GPT-5.4 as the production workhorse, and the o-series reasoning models with dramatically lower prices than a year ago. Here is the current picture and what to do with it.

1. The Model Lineup: June 2026

GPT-5.5

$5 / $30 per 1M tokens. Flagship. High-stakes extraction, demanding reasoning.

GPT-5.4

$2.50 / $15 per 1M. Production workhorse. Best cost-quality for most workloads.

$2 / 1M input. 87% cheaper than o1 with better performance. Multi-step reasoning.

o4-mini

$1.10 / 1M input. Cheapest reasoning. Classification with chain-of-thought.

Budget options: GPT-4.1 nano at $0.10/$0.40 per 1M for highest-volume simple tasks. GPT-5.4 Nano at $0.20/$1.25 as an upgraded cheap option. GPT-5.5 Pro at $30/$180 exists for highest-stakes reasoning tasks where cost is secondary.

2. The Real Decision Matrix

            Model selection in June 2026
            
High-stakes reasoning, complex multi-step?   → GPT-5.5 or GPT-5.5 Pro
Demanding extraction, messy documents?       → GPT-5.5 or GPT-5.5 Pro
Standard production workload, agents?        → GPT-5.4
Reasoning with budget constraint?            → o3 ($2/M, 87% cheaper than o1)
Cheap reasoning for routing / triage?        → o4-mini ($1.10/M)
Simple classification, bulk labeling?        → GPT-5.4 Nano or GPT-4.1 nano
Highest volume, simplest tasks?              → GPT-4.1 nano ($0.10/M)
            
        

The mental model that OpenAI now recommends: use GPT-5.5 or GPT-5.5-pro for demanding extraction or messy documents. Use smaller GPT-5.x models for simple classification, routing, and normalized field extraction to reduce latency and cost. The right model is the cheapest one that produces correct output for your specific task.

3. Schema-First Structured Outputs

The paradigm shift in 2026 is schema-first development: define your Pydantic or Zod schema first, then write the prompt around it. Not the other way around. This forces you to think about the output contract before writing the system prompt, which improves prompt quality and eliminates schema drift.

Structured Outputs enforce schema compliance at the model level. The model guarantees the schema. No more parsing failures from hallucinated fields, extra keys, or missing required fields.

            Python — schema-first structured outputs
            
from openai import OpenAI
from pydantic import BaseModel
from typing import List, Literal, Optional

# Step 1: Define the schema (before the prompt)
class BugReport(BaseModel):
    severity: Literal["critical", "high", "medium", "low"]
    file: str
    line: int
    description: str
    suggested_fix: Optional[str] = None   # Optional fields need = None

class AuditResult(BaseModel):
    passed: bool
    issues: List[BugReport]
    summary: str

# Step 2: Build the prompt around the schema
client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-5.4",
    messages=[
        {"role": "system", "content": "You are a code auditor. Return findings matching AuditResult schema."},
        {"role": "user",   "content": f"Audit this code:\n\n{code}"}
    ],
    response_format=AuditResult,
)

result: AuditResult = response.choices[0].message.parsed
# result.issues is a typed list of BugReport objects — no parsing needed
            
        

JSON Mode is Legacy Use strict: true (Structured Outputs) exclusively. JSON Mode only asks the model to try to produce valid JSON — it does not enforce your schema. In 2026, treating JSON Mode and Structured Outputs as interchangeable is a reliability bug waiting to happen.

4. The 16k Output Token Limit

Structured Outputs now support up to 16k output tokens for complex extractions. This matters for tasks like extracting all issues from a large codebase audit, pulling structured data from long documents, or generating detailed JSON reports. Set max_tokens high enough to fit the full response — a truncated JSON mid-object still counts as a failure.

            Python — handle schema refusals
            
response = client.beta.chat.completions.parse(
    model="gpt-5.4",
    messages=messages,
    response_format=AuditResult,
    max_tokens=16000   # allow full output for complex schemas
)

message = response.choices[0].message

# Handle refusals as first-class errors (not exceptions)
if message.refusal:
    print(f"Model refused: {message.refusal}")
else:
    result = message.parsed
            
        

5. Function Calling vs Structured Outputs

Both produce structured data, but serve different purposes. Know which to use:

Function calling: The model decides when and whether to call a function during conversation. It returns a function name and arguments. You execute and return the result. Use this for agentic tool use where the model drives the workflow.
Structured outputs via response_format: You want the final response in a guaranteed schema. The model always produces the schema. No decision-making. Use this for typed responses at the end of a task.

In a multi-agent pipeline: use function calling for agents that call tools (search, code execution, file access). Use structured outputs for agents that produce final artifacts (reports, audit results, classifications).

6. The Batch API: 50% Off Async Work

Any work that does not need real-time results should go through the Batch API. It processes within 24 hours at 50% the standard price. Thousands of classification jobs, documentation generation, data extraction at scale — all at half cost.

            Python — batch API with current models
            
import json
from openai import OpenAI

client = OpenAI()

requests = []
for i, doc in enumerate(documents):
    requests.append({
        "custom_id": f"doc-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-5.4-nano",   # cheap model for bulk work
            "messages": [
                {"role": "user", "content": f"Classify this document: {doc}"}
            ],
            "response_format": {"type": "json_schema", "strict": True, "schema": classification_schema},
            "max_tokens": 128
        }
    })

with open("batch_input.jsonl", "w") as f:
    for r in requests:
        f.write(json.dumps(r) + "\n")

batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
            
        

7. o3 in a Multi-Model Pipeline

o3 is not a general-purpose model — it is a reasoning model. The practical pattern from OpenAI's own docs: use GPT-5.4 to triage order details and identify issues, then feed structured data into o3 to make the final decision. o3 thinks longer and produces more reliable answers on ambiguous, multi-step decisions. It is overkill for simple extraction.

            Python — model tiering with o3
            
# Step 1: Extract structured data cheaply
extraction = client.beta.chat.completions.parse(
    model="gpt-5.4",
    messages=[{"role": "user", "content": f"Extract key fields from: {raw_document}"}],
    response_format=ExtractedData,
)
data = extraction.choices[0].message.parsed

# Step 2: Use o3 for the actual decision (feeds on structured input)
decision = client.chat.completions.create(
    model="o3",
    messages=[
        {"role": "system", "content": "Given structured data, make a final approval decision."},
        {"role": "user",   "content": data.model_dump_json()}
    ]
)
            
        

8. Rate Limits: TPM Is Often the Real Bottleneck

OpenAI rate limits run on two axes: requests per minute (RPM) and tokens per minute (TPM). In agent pipelines with large prompts, TPM is usually the actual bottleneck, not RPM. Track both. Use exponential backoff with jitter on 429s, and implement a sliding window counter to pre-rate-limit yourself before hitting the API limit.

            Python — backoff with jitter
            
import random, time

def call_with_retry(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except openai.RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
            
        

"Use the cheapest model that produces correct output. Then optimize the prompt. Then consider fine-tuning. In that order."

Key Takeaway GPT-5.4 is the production default in June 2026, not GPT-4o. o3 dropped 87% in price versus o1 and is now viable for complex reasoning tasks. JSON Mode is legacy — use strict Structured Outputs. And schema-first development means defining the Pydantic model before writing the prompt.