The AI ​​is bad at Sudoku. It’s even worse when showing your work

gettyimages-2227477411
6 minutes

Chatbots are really impressive when you see them do things they’re good at, like writing a basic email or creating strange, futuristic-looking images. But ask generative AI to solve one of those puzzles at the back of a newspaper, and things can quickly go off the rails.

That’s what researchers at the University of Colorado at Boulder discovered when they challenged large language models to solve Sudoku. And not even the typical 9×9 puzzles. A simpler 6×6 puzzle was often beyond the capabilities of an LLM without outside help (in this case, specific puzzle-solving tools).

A more important finding came when the models were asked to show their work. For the most part, they couldn’t. Sometimes they lied. Sometimes they explained things in ways that didn’t make sense. Sometimes they would hallucinate and start talking about the weather.

If generation AI tools can’t explain their decisions accurately or transparently, that should make us cautious about giving these things more control over our lives and decisions, said Ashutosh Trivedi, a computer science professor at the University of Colorado at Boulder and one of the study’s authors. paper published in July in Findings of the Association for Computational Linguistics.

“We would really like those explanations to be transparent and reflect why the AI ​​made that decision, and not the AI ​​trying to manipulate the human by giving an explanation that a human might like,” Trivedi said.

The paper is part of a growing body of research on the behavior of large language models. Other recent studies have found, for example, that models hallucinate in part because their training procedures incentivize them to produce results that the user will like, rather than accurate results, or that people who use LLM to help them write essays are less likely to remember what they wrote. As AI drives more and more of our daily lives, the implications of how this technology works and how we behave when using it become enormously important.

When you make a decision, you can try to justify it, or at least explain how you arrived at it. An AI model may not be able to do the same accurately or transparently. Would you trust it?

Why LLMs have difficulty with Sudoku

We’ve seen AI models fail at basic games and puzzles before. OpenAI’s ChatGPT (among others) has been completely crushed at chess by the computer’s opponent in a 1979 Atari game. A recent Apple research paper found that the models may have problems with other puzzles, such as the Tower of Hanoi.

It has to do with the way LLMs work and fill information gaps. These models try to fill in those gaps based on what happens in similar cases in their training data or other things they’ve seen in the past. With a sudoku, the question is one of logic. The AI ​​might try to fill each gap in order, based on what seems like a reasonable answer, but to solve it properly, it has to look at the big picture and find a logical order that changes from puzzle to puzzle.

Chatbots are bad at chess for a similar reason. They find logical next moves, but they don’t necessarily think three, four, or five moves ahead, the fundamental skill needed to play chess well. Chatbots also sometimes tend to move chess pieces in ways that don’t really follow the rules or put pieces in meaningless danger.

You might expect LLMs to be able to solve Sudoku because they are computers and the puzzle consists of numbers, but the puzzles themselves are not really mathematical; They are symbolic. “Sudoku is famous for being a number puzzle that can be done with anything other than numbers,” said Fabio Somenzi, a CU professor and one of the authors of the research paper.

I used a sample message from the researchers’ paper and provided it to ChatGPT. The tool showed how it worked and repeatedly told me it had the answer before showing a puzzle that didn’t work and then going back and correcting it. It was like the robot was delivering a presentation that kept getting last-second edits: This is the final answer. No, actually, it doesn’t matter. this is the final answer. He finally got the answer, through trial and error. But trial and error is not a practical way for a person to solve a sudoku in the newspaper. That’s too much erasing and ruins the fun.

AI struggles to show its work

The Colorado researchers didn’t just want to see if robots could solve puzzles. They asked for explanations about how the robots worked through them. Things didn’t go well.

When testing OpenAI’s o1-preview reasoning model, the researchers saw that the explanations, even for correctly solved puzzles, did not accurately explain or justify their moves and were wrong in basic terms.

“One thing they are good at is giving explanations that seem reasonable,” said Maria Pacheco, an assistant professor of computer science at CU. “They align themselves with humans, so they learn to talk the way we like, but whether they are true to the actual steps that need to be taken to solve the problem is where we’re struggling a little bit.”

Sometimes the explanations were completely irrelevant. Since work on the paper was completed, researchers have continued to test new models released. Somenzi said that when he and Trivedi were running OpenAI’s o4 reasoning model through the same tests, at one point it seemed to give up completely.

“The next question we asked, the answer was the weather forecast for Denver,” he said.

Explaining yourself is an important skill

When you solve a puzzle, you will almost certainly be able to guide someone else through your thinking. The fact that these LLMs failed so spectacularly at that basic work is not a trivial problem. Since AI companies constantly talk about “AI agents” who can take actions on their behalf, it is essential to be able to explain yourself.

Let’s consider the types of jobs that are currently assigned to AI, or are planned for the near future: driving, doing taxes, deciding business strategies, and translating important documents. Imagine what would happen if you, a person, did one of those things and something went wrong.

“When humans have to put a face to their decisions, they better be able to explain what led them to that decision,” Somenzi said.

It’s not just about getting an answer that seems reasonable. It has to be precise. One day, an AI’s explanation of itself might have to hold up in court, but how can its testimony be taken seriously if it is known to lie? You wouldn’t trust a person who didn’t explain themselves, and you wouldn’t trust someone who told you what you wanted to hear instead of the truth.

“Having an explanation is very close to manipulation if it’s done for the wrong reason,” Trivedi said. “We have to be very careful with the transparency of these explanations.”