
The other day I was brainstorming with ChatGPT and all of a sudden it went into this long, fantasy story that had nothing to do with my queries. It was so ridiculous that it made me laugh. Lately, I haven't seen mistakes like this as often with text prompts, but I still see them pretty regularly with image generation.
These random moments when a chatbot strays from the task are known as hallucinations. What's odd is that the chatbot is so confident about the wrong answer it's giving; one of the biggest weakness of today's AI assistants. However, a new study from OpenAI argues these failures aren’t random, but a direct result of how models are trained and evaluated.
Why chatbots keep guessing when they shouldn’t

Research points to a structural issue causing hallucinations; essentially the problem stems from benchmarks and leaderboards ranking AI models and rewarding confident answers.
In other words, when a chatbot says “I don’t know,” it gets penalized in testing. That means the models are effectively encouraged to always provide an answer, even if they’re not sure it’s right.
In practice, that makes your AI assistant more likely to guess than admit uncertainty. For everyday queries, this can be harmless. But in higher-stakes cases, from medical questions to financial advice, those confident errors can quickly turn dangerous.
As a power user, that's why I always fact-check and ask the chatbot to cite the source. Sometimes if the information seems too far-fetched and I ask for a source, the chatbot will say something like, "Good catch!" or something similar, still not admitting it was wrong.
Newer models aren’t immune

Interestingly, OpenAI’s paper found that reasoning-focused models like o3 and o4-mini actually hallucinate more often than some older models. Why? Because they produce more claims overall, which means more chances to be wrong.
So, if a model is “smarter” at reasoning, it really doesn’t make it more honest about what it doesn’t know.
What can fix this problem?

Researchers argue that the solution is to change how we score and benchmark AI. Instead of punishing models for saying “I’m not sure.” The most valuable tests should reward calibrated responses, uncertainty flags or the ability to defer to other sources.
That could mean your future chatbot might hedge more often, less “here’s the answer” and more “here’s what I think, but I’m not certain.” It may feel slower, but it could dramatically reduce harmful errors. Proving that crtitical thinking on our part is still important.
Why it matters for you

If you’re using popular chatbots including ChatGPT, Gemini, Claude or Grok you’ve almost certainly seen a hallucination. This research suggests it’s not entirely the model’s fault, but the way they are tested; as if it's a game testing which can be right most often.
For users, that means we need to be diligent and consider AI answers as a first suggestion and not the final word. And for developers, this is a sign that it's time to rethink how we measure success so that future AI assistants can admit what they don’t know instead of getting things completely wrong.
Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!