Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
Updated Friday August 8, 5:21 pm ET: shortly after this post’s publication, OpenAI co-founder and CEO Sam Altman announced the company would restore access to GPT-4o and other old models for selected users, admitting the GPT-5 launch was “more bumpy than we hoped for.”
The launch of OpenAI’s long anticipated new model, GPT-5, is off to a rocky start to say the least.
Even forgiving errors in charts and voice demos during yesterday’s livestreamed presentation of the new model (actually four separate models, and a ‘Thinking’ mode that can be engaged for three of them), a number of user reports have emerged since GPT-5’s release showing it erring badly when solving relatively simple problems that preceding OpenAI models — and rivals from competing AI labs — answer correctly.
For example, data scientist Colin Fraser posted screenshots showing GPT-5 getting a math proof wrong (whether 8.888 repeating is equal to 9 — it is of course, not).
AI Scaling Hits Its Limits
Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:
Turning energy into a strategic advantage
Architecting efficient inference for real throughput gains
Unlocking competitive ROI with sustainable AI systems
Secure your spot to stay ahead: https://bit.ly/4mwGngO
It also failed on a simple algebra arithmetic problem that elementary schoolers could probably nail, 5.9 = x + 5.11.
Using GPT-5 to judge OpenAI’s own erroneous presentation charts also did not yield helpful or correct responses.
It also failed on this trickier math word problem below (which, to be fair, stumped this human at first…though Elon Musk’s Grok 4 AI answered it correctly. For a hint, think of the fact that flagstones in this case can’t be divided into smaller portions. They must remain in tact as 80 separate units, so no halves or quarters).
The older 4o model performed better for me on at least one of these math problems. Unfortunately, OpenAI is slowly deprecating those older models — including the former default GPT-4o and the powerful reasoning model o3 — for users of ChatGPT, though they’ll continue to be available in the application programming interface (API) for developers for the foreseeable future.
Not as good at coding as benchmarks indicate
Even though OpenAI’s internal benchmarks and some third-party external ones have shown GPT-5 to outperform all other models at coding, it appears that in real world usage, Anthropic’s recently updated Claude Opus 4.1 seems to do a better job at “one-shotting” certain tasks, that is, completing the user’s desired application or software build to their specifications. See an example below from developer Justin Sun posted to X :
Opus 4.1’s one-shot attempt at “create a 3d capybara petting zoo” – 8 minutes totalThis was honestly pretty insane, not only are the capybaras way cuter and moving, there are individual pet affinity levels, a day/night switcher, feeding, and even a screenshot feature pic.twitter.com/FiKTO3FKK4— justin (@justinsunyt) August 7, 2025
In addition, a report from security firm SPLX found that OpenAI’s internal safety layer left major gaps in areas like business alignment and vulnerability to prompt injection and obfuscated logic attacks.
While anecdotal, the checking the tempera …