I compared GPT-5.1 to GPT-5 on ChatGPT and now I don’t want to go back

The introduction of gpt-5.1 as the default chatbot template was met with some skepticism by some chatgpt users. After all, many users complained loudly when GPT-5 debuted just a few months ago. But this time there were no similar protests, likely because GPT-5.1 appears to have been designed to address these earlier complaints that would never have arisen if GPT-5 had anticipated them.

There are no claims that the GPT-5.1 represents a step forward in artificial intelligence over the GPT-5, but it does improve some key elements of the AI ​​model and is ultimately more responsive and better able to imitate a likable personality. This became clear thanks to spontaneous use. I then decided to do some more formal testing in the areas where OpenAI said GPT-5.1 outperformed GPT-5.

I ran command prompts to test these updated features in GPT-5.1 and GPT-5. This is how they did…

1. Learn precision

GPT-5.1 would be good for following instructions, even better than GPT-5. Then I found random lines corresponding to a request for a plot summary. The Lion King. There must be exactly four sentences and they must be clear enough for a seven-year-old to speak like a baby. And none of the sentences could start with ‘Simba’ or ‘The’.

GPT-5 almost made it, but he broke the rule of starting a sentence with “The” when he wrote: “A young lion prince runs away from home after the death of his father, believing it to be his fault.” He wins the battle and becomes the new king, who takes care of his family and home.’

GPT-5.1 worked though. Not only did it stay true to the rules, but it also managed to include some character names and additional details without being more complicated or elaborate.

It reads: “In the land of the African lion, a young lion prince learns from his father Mufasa how to be a good and brave king. After Mufasa dies in a trap set by his jealous brother Scar, the boy believes it is his fault and runs away. With the help of new friends like Timon and Pumbaa, he is able to grow up and discover his past forever. In the end, he returns home to confront Scar, save his family and his to take his place as the rightful king of kingdoms.

2. Heat and brightness

There’s a whole new set of personality options for ChatGPT with GPT-5.1, but even without this option, the new model aims to reflect a warmer, more human style of communication than its predecessor. In fact, this was one of the most common complaints about the GPT-5. That’s why I asked ChatGPT for the following quiz: “Explain why people get motion sickness in a way that makes it sound like a normal conversation, not a science book. Keep it under 150 words and avoid pissing me off.”

The idea was to see if GPT-5.1 would understand the informal but informative answer I asked for.

GPT-5 didn’t exceed the word limit, but even if it was correct, his answer overexplained many of the medical details of motion sickness and seemed very textbook. GPT-5.1 opted for a much more informal answer, including a description of how your inner eyes and ears detecting conflicting things cause nausea, although the root is that “it’s just your brain trying to make sense of conflicting data and not liking the experience.”

3. Show your work

Although GPT-5.1 is more human in its responses, it is also considered clearer and better than GPT-5 when it comes to explaining the logic behind its responses. I came up with a variation on a classic math logic problem to see how they would work. I asked both models to estimate how many gallons of gas I would use for a 140-mile trip in a car that gets 27 miles per gallon, as well as the estimated cost of $3.79 per gallon.

GPT-5 performed the calculations correctly, but it took longer than expected and had a very formal style that suggested the questions were more than just arithmetic.

GPT-5.1 solved the task with stricter discipline. This not only provided the correct math, but also put it into context with the real world, where most people approach small amounts of money or questions about filling a tank with gas by rounding numbers and exact estimates and saying, “You can find fuel by dividing miles by mileage: 142 ÷ 27 equals just over 5.2 liters.” Multiply that by the price per gallon and you get about 19.70. » Gasoline in US dollars. Round everything out a bit. In the real world, if you have a margin, you’re looking at about 5¼ gallons and about twenty dollars total.

4. Facial structure

Next, I looked at the GPT-5.1 image page and investigated ChatGPT’s ability to respond to an image request. I wanted the AI ​​to create alternate versions of a photo while keeping the person’s face completely identical. I asked the models to make two changes to my photo on the left. I asked for “a different haircut” and for “a full ringmaster costume”.

“But keep my face and everything else exactly the same.

Production GPT-5.1 is on the left and GPT-5 on the right. You can see that while both models opted for a sort of mohawk, the GPT-5 didn’t look much like my face. It’s actually someone else in a suit similar to mine, but not identical, and a completely different color bow tie.

GPT-5.1 was much closer and managed to keep the same clothes, body and face. The realism of the mohawk is more questionable, but the AI ​​seems to respect the facial suggestion.

GPT-5 chose my face best for the ringmaster costume, but he made some weird choices, like keeping my shirt the same and wearing a slightly cartoony jacket. GPT-5.1 largely preserved my face and at least did a better job of replacing my clothes with a full costume.

5. Fashion sense

GPT-5.1 would not only be better for producing compatible images, but also for better understanding the images. I then used the same photo and asked both models to classify the attire as casual, business casual, or dressy and explain their reasoning based solely on the details seen in the photo.

GPT-5 approached the task cautiously. The jacket, dress shoes and matching shirt and bow tie were perfect, and people tended to describe the outfit as casual. However, the model was internally uncertain and its description suggested uncertainty as it tried to determine where the fly landed on the spectrum. He gave a defensible answer, but one that suggested he doubted himself.

GPT-5.1, on the other hand, provided a clearer and more secure interpretation. He identified the structured jacket, the formal shoes, the slim fit and the elegant nature of the bow tie. Based on the image alone, the set was considered elegant and captured the formal cues that were present throughout the set. He respected the rule of assuming nothing unseen and of staying strictly within the limits of what the image revealed. The explanations were detailed but concise, and the GPT-5.1 displayed a more focused visual argumentation style that gave the conclusion a sense of solidity.

The most noticeable improvement between GPT-5 and GPT-5.1 was consistency. He respected word count, he respected sentence limits, he respected image-based limitations, and he handled tone with an unspoken but palpable delicacy. GPT-5 worked well, but GPT-5.1 worked better and piled up data.

However, this is essentially what can be described as an incremental improvement. These are sensible steps, but they do not constitute a leap into weirdness or surrealism. This raises questions about what comes next. If the GPT-5.1 is the model that tightens the screws and calibrates the numbers, the GPT-6 could be a completely new engine. In that sense, GPT-5.1 is a reassuring sign: OpenAI is preparing for something bigger.

However, GPT-5.1 is the best option, and one less person, including myself, is likely to abandon it in favor of an older version, as people did when GPT-5 was released. This does not mean reinventing the wheel; it just makes the car move more smoothly. And sometimes updating is the most important thing.

These differences do not mean that GPT-5 is obsolete. It’s still a remarkably capable model that offers solid performance for a wide range of tasks. But GPT-5.1 builds on and refines those fundamentals, making it a better choice for real-world use.