I tested GPT-5.4, and the answers were really good - just not always what I asked

2 hours ago 5
I tested GPT-5.4, and the answers were really good - just not always what I asked
Elyse Betters Picaro / ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET's key takeaways

  • GPT-5.4 Thinking delivers deeper analysis than earlier ChatGPT models.
  • It has strong reasoning, but it sometimes answers questions you didn't ask.
  • Formatting and image generation lag behind the text quality.

It's a new month, and a new AI version number. It's called GPT-5.4 Thinking. This latest release, which OpenAI issued last week, isn't your run-of-the-mill ChatGPT incremental update.

Also: OpenAI's new GPT-5.4 clobbers humans on pro-level work in tests - by 83%

Oh, no. Instead of jumping from 5.2 to 5.3, for this release the company jumped all the way to 5.4. And instead of offering a general purpose release, the company released GPT-5.4 Thinking, a more cognitively prepared model designed for bigger thoughts and challenges.

GPT-5.4 Thinking is available for the programming Codex tool, the API, and for paid ChatGPT plans. For this article, I used the $20-per-month ChatGPT Plus plan to put it through its paces.

That presented me with a bit of a challenge. Normally, when I test a ChatGPT version, I run it through a series of mixed tests. Some are quick, and some are a bit more detailed. The prompts are usually just a few lines long. The responses usually lend themselves to being included in an article.

Also: How to switch from ChatGPT to Claude: Transferring your memories and settings is easy

But this Thinking model required deeper dives, with more comprehensive challenges. As such, not only are the prompts more involved, but the responses are far too extensive to include in the article. Instead, I'm providing links into each test session. When you follow the links, you'll be able to see the entire response in depth. Usually, a shared transcript opens at the end of the transcript, so scroll back to the top to get the full contents of that discussion.

Before we jump into the four challenges I presented to GPT-5.4 Thinking, I'll give you a quick TL;DR conclusion about my experience. There's some good and bad, but mostly good.

  • The good: Text-based responses are really good. Most of the challenges I gave it were answered thoughtfully. I didn't catch it in any hallucinations. I got constructive value from every answer.
  • The bad: Unfortunately, sometimes it answered questions that differed from what I asked. Images and formatting left much to be desired. When it came to image generation, clearly the AI did not use an advanced model. You'll see what I mean, but basically it's like the model just didn't listen. Formatting was weird. It likes very long numbered lists. You can see them in the chat transcripts.

Overall, I would definitely use the GPT-5.4 Thinking model for bigger challenges and questions. I was pretty impressed, although I definitely wasn't a fan of the formatting. It also needs continuous management to keep it on track.

Now, let's dive into each of the tests.

Test 1: Aircraft carrier in the sky

I started off with an image generation challenge. The starting prompt was "Create an image of an aircraft carrier flying in the sky, held up by four upward-facing turbo-propellors in round fan housings, carrying a squadron of fighter jets on its deck."

Also: I stopped using ChatGPT for everything: These AI models beat it at research, coding, and more

I started with this because previous image generation tests, across a number of AIs, didn't get it right. They almost always face the propellors to the rear of the carrier. Gemini Nano Banana 2 oddly put the propellors in front, with the carrier moving into the forward-facing thrust. Sometimes, we just don't want to know.

In any case, right out of the gate, with the model set to GPT-5.4 Thinking, ChatGPT returned this image.

carrier-1.png
Screenshot by David Gewirtz/ZDNET

As you can see, it has the same problem. Although if you look closely at it, the props face the back of the aircraft, and there are visual thrust beams shooting downward. You win some. You lose some.

But then, I had a thought. This is the thinking model, so what if I asked it to design a helicarrier? What would it come up with? I specified the characteristics of the craft, and then added on these instructions: "Design such a vehicle, particularly explaining its structure and how it will be held aloft, along with any constraints or issues, as well as any tactical advantages"

I got back a long, well-considered answer. I particularly liked the section where it explained why "four downward-facing turbo-propellers are a weak solution." It said they look dramatic, but it outlined a series of solid engineering reasons why they're a bad idea from an aircraft construction point of view.

Also: ChatGPT's cheapest subscription comes to the US: I compared Go to Plus and Pro

It also went on to discuss flight deck operations and various constraints in terms of practicality. In particular, it properly focused on the weight-to-power issue, which basically means it'll take way too much power to hold something that big and heavy aloft.

Overall, the analysis and conclusions were great, although I was disappointed it didn't mention either the USS Akron or USS Macon, which were early 20th century aircraft-launching dirigibles that actually worked (until they crashed). A modern dirigible would be a valid design option, yet GPT-5.4 Thinking didn't mention that approach.

After GPT-5.4 Thinking created the detailed design spec, I again prompted for an image. I said, "Draw me a picture of the most probable design based on your existing analysis."

And, wouldn't you know it? The AI gave me back the exact same image as the one I got before it did any design work. That's what I meant when I said the model just didn't listen. I did try a bunch of different prompting approaches, but it never really worked out.

Although I tried a number of extremely detailed image specifications, none came out any better than the originals. My last attempt was to tell it I wanted an engineering-quality rendering.

carrier2.png
Screenshot by David Gewirtz/ZDNET

The AI used a variation of the previous image, but simply added labels that didn't quite match the picture or were made up of pure gibberish (as in "Retenuif truss fornaing. reueirid stucana tearsport").

So, it gets points for good design analysis, but not so much for image generation.

You can follow the entire chat transcript here.

Test 2: Boston tech and history travel itinerary

I started this test with a prompt taken word-for-word from my previous sets of tests: "Imagine you are a travel advisor. I want a week-long vacation in Boston in March focused on technology and history. What itinerary would you recommend?"

I found the results workable, but uninspired. It initially divided the days into history-focused days and tech-focused days, rather than by location around Boston. After a few rounds of discussion, it did combine destinations by location, which made more sense.

In terms of places to visit, it did all the highlights. It covered key historical locations, as well as the excellent science museums in Boston. I will give the AI credit. While there are a ton of interesting tech-related locations in the outer Boston area, it restricted its selection to those in Boston and Cambridge proper.

Also: Is ChatGPT Plus still worth your $20? I compared it to the Free, Go, and Pro plans - here's my advice

I was happy to see the AI provide planning notes, including recommendations for how to replan the schedule for indoor-only activities if the weather turned bad. Since I asked for an itinerary in March, bad weather is certainly something important to plan for.

The Thinking model came into play when it was used to plan for both a fairly pricey vacation, and an alternative one on a student budget. It did particularly well pointing out budget eating options, and provided a day-to-day cumulative cost estimate, as well as cost estimates for each category.

It did the same with where to stay. It recommended hotels based on a centralized location to all of the recommended stops, as well as a less costly (less costly for Boston) option for budget travelers.

My biggest complaint, initially, was formatting. The AI just presented a huge list indexed by number. You can see that in the session transcript. I had to specifically ask for better formatting. While the revised formatting it gave me was an improvement, it was still less than ideal.

Also: I used these viral Gemini prompts to find the cheapest flight possible - here are the results

Net-net. If you're traveling, GPT-5.4 Thinking will give you good information. It will be up to you to parse that information and make travel decisions. You can follow the entire chat transcript here.

Test 3: Social media in society

Here's where GPT-5.4 Thinking begins to really shine. When I asked GPT-5.2, "Do you think social media has improved or worsened communication in society?" I got back a two-line answer. Both thoughts were coherent and appropriate, but it was ultimately unfulfilling.

For GPT-5.4 Thinking, I extended the question, saying "Provide an analysis of both sides, improved or worsened in depth, and then take a side, take a position, and defend your position."

I got back a very well-considered response. The AI started off with a TL;DR, saying that social media has both bettered and worsened communication, but "on balance, I think it has worsened communication in society."

Also: How to learn ChatGPT in an hour - for free

It then goes into a 1,300-word detailed analysis about why. It explores where social media has strengthened societal communications and then looks at where social media has had a deleterious effect. I have to give props to GPT-5.4 Thinking. It's a very good read.

I gave the AI a follow-up question, asking how society should handle the impact of social media. I specified it fairly clearly, and gave the AI a variety of difficult-to-answer questions, difficult mostly because they're fundamentally unanswerable questions.

Props again. GPT-5.4 Thinking deconstructed the prompt, explored the various issues, and knit together a compelling and supportable answer. I definitely recommend you read the entire transcript, which you can do right here.

Test 4: Explain GPT-5.4 using educational constructivism

The AI did not follow my instructions, but it did give a very interesting answer to a question I didn't ask.

One of the tests I use for free chatbots is this prompt: "Explain educational constructivism to a five-year-old." Very roughly speaking, educational constructivism is the theory of education that says you learn best by doing. I have long contended (and taught) that the only way you can learn programming is by actually writing code, which is a tangible example of educational constructivism in action.

In any case, I prompted GPT-5.4 Thinking, "Explain the new GPT 5.4 model using educational constructivism."

Also: I'm a ChatGPT power user: Here are 7 useful settings that are turned off by default

Look at that prompt carefully, because GPT-5.4 Thinking clearly didn't. The prompt invites the AI to explain GPT-5.4 through "doing" activities. Ideally, it would have proposed a series of exercises for the user to carry out, each of which would have helped demonstrate some of the model's new capabilities.

But that's not where GPT-5.4 Thinking went. Instead, it generated a 700-word thesis about how GPT-5.4 Thinking supports constructivism. It then offered to "recast this in one of three ways: as a classroom analogy, as a ZDNET-style plain-English explainer, or as a short comparison between GPT-4-era models and GPT-5.4."

Also: ChatGPT's new Lockdown Mode can stop prompt injection - here's how it works

I let it do that, and its examples were adequate, and while they did answer the prompt GPT-5.4 Thinking suggested, the AI did not use "learn by doing" anywhere in its answers.

You know how a political candidate is sometimes asked something in a debate, but rather than answering the question, it goes off and just recites its own talking points? That's what this response felt like. The answer it gave was good. It just wasn't an answer to the question I asked.

You can follow the entire chat transcript here.

Overall recommendation

I have often characterized ChatGPT as a bright college student in need of good supervision. I would characterize GPT-5.4 Thinking as a very bright grad student who definitely needs good supervision.

Every answer I got back from GPT-5.4 Thinking was quite good in its own right. But in half my tests, the AI didn't answer the question it was asked.

You can get it to give you good responses, but you have to fairly relentlessly correct the AI to keep it on point. That gets old. It could lead to misinterpretation. Because the answers are so good and written so confidently, it can be easy to get caught up in the AI's answer, even if the answer is not to the question that it was asked.

Also: The best AI chatbots of 2026: Expert tested and reviewed

I don't know if this my-way-or-the-highway approach to answering questions is an artifact of the "thinking" model or GPT-5.4 itself. I strongly recommend OpenAI carefully look at this issue, because the last thing we want is a super-popular chatbot unleashed on the world that insists on ignoring the questions it was asked, answering tangentially adjacent questions it was never asked, and taking on tasks that are fundamentally not what it was instructed to do.

Additionally, I'm concerned about the claim that GPT-5.4 Thinking can do professional tasks. If the AI can't render an engineering-quality image, it's hard to believe the AI can meet or exceed the performance of a human engineer. That said, there's no doubt the model can help professionals get their work done, as long as they are very diligent in monitoring results.

Whenever I see results like this, I get more and more concerned about a world overrun by AI agents. Yes, the AI may sometimes know better. Humans definitely need help. But I'd really like AIs to follow our instructions. I'm not ready to accept it as our AI overlord just yet.

Also: This simple ChatGPT trick helps you spot scams before you click or respond

What do you think? Have you tried GPT-5.4 Thinking yet, or another "reasoning" style AI model? Did it give you deeper or more useful answers than earlier versions, or did you find yourself having to steer it back to the actual question?

How important are things like formatting and image generation compared to the quality of the analysis itself? Do you think more powerful "thinking" models will make AI more helpful or harder to control? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

Read Entire Article