An earlier model scored just 13% on the qualifying exam for the International Mathematical Olympiad (the top high school math competition). The new model, called o1, raised that score to 83%.Niloofar Mireshghallah, a University of Washington postdoctoral scholar in the Paul G. Allen School of Computer Science & Engineering, studies the privacy and societal implications of large language models, such as ChatGPT.
UW News spoke with her about why math and reasoning have so challenged these artificial intelligence models and what the public should know about OpenAI’s new release.There are two main reasons. One is that it is hard to figure out rules and principles when the model does next-word prediction. You need to go back and forth a bit and deduce to do math. Regarding more logical or commonsense reasoning, another reason for difficulty is that, as my advisor Yejin Choi says, commonsense is like dark matter. It’s there, but we don’t see it or say it.
We know that the door to the fridge shouldn’t be left open, but there is little text saying that. If there is no text for something, models won’t pick it up. The same goes for social norms or other forms of reasoning!This whole take its time is a simplification of what is happening, which we call test-time computation. Up until now, big companies would scale models by sizing up both the models and training data. But the companies might have reached a saturation there—as in, there is not more pre-training data, and sizing up models may not help us much more.
This investment in test time helps the model do internal reasoning, so it can try to decompose problems and do multiple iterations. This is called chain-of-thought reasoning, which is like showing your work in a math problem, but for language and thinking tasks. Instead of just giving a final answer, the AI works step by step, writing down each step of its reasoning process.
Imagine you’re asked to solve a word problem: If Sally has 3 apples and gives 2 to her friend, how many does she have left? A normal AI response might just say, 1 apple.
But with chain-of-thought reasoning, it would look more like this:
Sally starts with 3 apples
She gives away 2 apples
To find out how many are left, we subtract: 3–2 = 1
Therefore, Sally has 1 apple left
This step-by-step process helps in a few ways: It makes the AI’s reasoning more transparent, so we can see how it arrived at its answer and, in the case of a mistake, potentially spot where things went wrong.
Chain-of-thought reasoning is especially useful for more complicated tasks, such as answering multi-step questions, solving math problems or analyzing situations that require several logical steps.
In a sense, the model can test its own response, as opposed to just doing next-word prediction. One problem before was that if a model predicted one word wrong, it kind of had to commit, and it would get derailed because all its following predictions are based in part on that wrong prediction.
This form of chain-of-thought reasoning and response generation is the closest procedure we have to human thinking so far. We are not entirely sure how this internal reasoning fully works, but now the model can take the time to test its own response.
Researchers have shown models finding their own mistakes and ranking their own responses when offered multiple choices. For instance, in a recent paper , we showed that LLMs would spoil birthday surprises when generating a response, but when asked if their response is appropriate, they would realize the mistake. So this self-testing can help the model come up with a more logical response.I think one thing that people should be careful with is to still fact-check the model outputs, and not get fooled by the model thinking and taking its time. Yes, we are getting better responses, but there are still failure modes.
*Note:
1.Source: Coherent Market Insights, Public sources, Desk research
2.We have leveraged AI tools to mine information and compile it