OpenAI’s o3 and o4‑mini: Bold Reasoning with a Side of Hallucinations

Exploring the features, performance, and pitfalls of OpenAI’s latest “reasoning” models

4/21/20253 min read

OpenAI rolled out two new “reasoning” models—o3 and its streamlined counterpart o4‑mini—that for the first time let ChatGPT “think with images” by integrating visual inputs into their internal chain‑of‑thoughts (The Verge, Apr 16, 2025). These models also agentically invoke ChatGPT’s full suite of tools—web browsing, Python execution, image analysis, and file interpretation—to tackle complex, multi‑step tasks . Yet OpenAI’s own benchmarks reveal a surprising spike in hallucinations: o3 errs in about 33% of PersonaQA queries, while o4‑mini misses the mark nearly half the time at 48%, compared to under 16% for earlier reasoning models. This trade‑off between bold new capabilities and factual reliability raises critical questions for AI development and the future path toward genuine Artificial General Intelligence (AGI).

Introduction

OpenAI’s latest update marks a watershed in large‑language‑model development. For the first time, ChatGPT can not only read and write but also see and reason over images—annotating diagrams, rotating photos, and spotlighting details as part of its logical deductions . Alongside these visual upgrades, o3 and o4‑mini embed deep tool integration, autonomously browsing the web, running Python code, and parsing files—all within a single conversational thread.

What Are o3 and o4‑mini?

o3: OpenAI’s flagship reasoning model, designed to devote extra “think time” to complex prompts by internally generating longer chains of intermediate reasoning steps before answering.
o4‑mini: A cost‑ and speed‑optimized variant that retains the core reasoning and visual capabilities of o3 but at lower latency and computational expense, making it suitable for high‑volume or real‑time tasks (Wikipedia, accessed Apr 21, 2025).

Both models replace earlier “mini” offerings—o1‑mini and o3‑mini—and join the company’s paid‑tier lineup alongside GPT‑4.1 and the soon‑arriving GPT‑5.

Key Features

1. Visual Chain‑of‑Thought

o3 and o4‑mini can natively process images during their reasoning phase—zooming in on circuit diagrams, highlighting sections of a data chart, or rotating a whiteboard sketch to better understand the user’s intent.

2. Autonomous Tool Use

The models automatically invoke ChatGPT’s extended toolkit:

Live Web Browsing for up‑to‑the‑minute data and news.
Python Execution to analyze datasets and generate charts on the fly.
Image Analysis & Generation to interpret or create visual assets.
File Interpretation for extracting structured information from PDFs, spreadsheets, and more

This unified workflow eliminates the need to juggle multiple tools manually, offering a seamless AI assistant experience.

Performance & Hallucination Rates

Defining “Hallucinations”

A hallucination occurs when the model confidently delivers incorrect or fabricated information—a critical concern for domains like medicine, law, and academia.

Benchmark Outcomes

OpenAI’s internal PersonaQA benchmark—designed to probe person‑centric knowledge—revealed:

o3: Hallucination rate of 33%.
o1: 16% (prior reasoning model baseline).
o3‑mini: 14.8%.
o4‑mini: Hallucinates 48% of the time.

According to OpenAI’s system card:

“o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims. More research is needed to understand the cause of this result.” Experts suggest that as models extend their internal reasoning trajectories—making more intermediate assertions—they also compound risks of error at each step.

The AGI Debate & Broader Implications

Some within OpenAI have touted o3’s performance as a step toward AGI—a system matching human‑level problem‑solving across domains. However, AGI remains a loosely defined aspiration, generally requiring robust transfer learning and self‑directed goal‑setting well beyond current LLM capabilities.

The hallucination spike underscores a persistent challenge: scaling reasoning prowess often outpaces safeguards on factual accuracy. Future research may explore hybrid architectures that ground neural outputs in symbolic verification or real‑time fact‑checking to tether AI claims to verifiable data.

GPT‑5 and Safety Measures

GPT‑5’s timeline has been adjusted to integrate lessons from the o‑series rollout, particularly around mitigating hallucinations. OpenAI’s updated preparedness framework aims to rigorously audit model behavior in diverse scenarios, but the unexpected rise in errors calls for continued transparency and third‑party evaluation.

Conclusion

OpenAI’s o3 and o4‑mini models herald an era of multi‑modal AI assistants that can see, think, and act across texts, images, and tools. Yet, the sharp uptick in hallucinations serves as a caution: true progress demands that new capabilities be matched by robust reliability measures. As the AI community watches closely, the balance between bold innovation and trustworthy performance will define the next frontier of intelligent systems.
https://openai.com/index/introducing-o3-and-o4-mini/ , https://platform.openai.com/docs/models/o4-mini