Here are a few references to “reliability” of GPT-5, including about “hallucinations”, as OpenAI claimed that it is significantly better in this aspect (note that this doesn’t include image-synthesis, since that’s a separate module from GPT-5; in fact, I think image-generation requests get handed-off to the separate gpt-image-1 model), so is worth considering. It’s clearly a problem that has many facets, so improvements in one facet don’t necessarily mean it will be improved in others; however, it does seem that the model is improved in several of those where people pointed out failings in the past. And it’s especially important to get this right if we ever want to trust AIs to make accurate medical diagnoses.
For example, you may have heard about the problem of “counting the r’s in strawberry” that previous models tended to get wrong. See this tweet by Daniel Litt, U. of Toronto mathematician:
https://x.com/littmath/status/1954978847759405501#m
Just asked GPT-5 to count letters in various words (b’s in blueberry, y’s in syzygy, r’s in strawberry, i’s in antidisestablishmentarianism, etc.) and it got 10/10 right. I have no doubt one can elicit poor performance in this kind of question but it’s atypical.
Also see this tweet where a guy did a more thorough test:
https://x.com/ctjlewis/status/1955202094131974438#m
there’s no point showing a heatmap here because i measured a straight diagonal line. i couldn’t produce an error at all for strings up to 50 characters over API. lengths 1-50 chars, 10 samples each, 500 samples total, zero errors.
Also see this tweet from OpenRouter about their testing of various models:
https://x.com/OpenRouterAI/status/1956030489900560769#m
After one week, GPT-5 has topped our proprietary model charts for tool calling accuracy🥇
In second is Claude 4.1 Opus, at 99.5%
GPT-5 is in first-place with a score of 99.9%, which is a 5x lower error-rate than Claude 4.1 Opus.
Then there are some of the benchmark results in the OpenAI’s GPT-5 System Card technical report:
https://openai.com/index/gpt-5-system-card/
They ran tests using some of their own new benchmarks, but then also used ones by Meta, Google Deepmind (LongFact), U. of Washington (FActScore), and saw large improvements. They saw less improvement – but still significant – on SimpleQA, which is OpenAI’s own older benchmark. However, on SimpleQA they only showed the “no web” results. In this setting, “accuracy” is mainly measuring how good a model’s latent memory of facts is, where it can’t look them up to verify. And “hallucination rate” here is the difference between error rate and refusal rate. You see an improvement here, though it’s less impressive than the other reliability gains.
Finally, it’s worth pointing out that, even if it isn’t used in GPT-5, OpenAI does seem to know how to mitigate hallucinations quite a lot in certain specific domains like math. This is evident from their system that won an IMO gold on the 2025 exam without using an external verifier, just text – pages and pages of natural language math without a single error.