Apple says generative AI cannot think like a human - research paper pours cold water on reasoning models

5 months ago 71

(Image credit: Apple)

Apple researchers have tested advanced AI reasoning models — which are called large reasoning models (LRM) — in controlled puzzle environments and found that while they outperform 'standard' large language models (LLMs) models on moderately complex tasks, both fail completely as complexity increases.

The researchers from Apple, which is not exactly at the forefront of AI development, believe that the current LRMs and LLMs have fundamental limits in their ability to generalize reasoning, or rather thinking the way humans do.

Apple researchers studied how advanced AI models — the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs — handle increasingly complex problem-solving tasks. They moved beyond standard math and coding benchmarks and designed controlled puzzle environments, such as Tower of Hanoi and River Crossing, where they could precisely adjust problem complexity. Their goal was to evaluate not just final answers but also the internal reasoning processes of these models, comparing them to standard large language models under equal computational conditions. Through the puzzles, they aimed to uncover the true strengths and fundamental limits of AI reasoning.

Apple researchers discovered that LRMs perform differently depending on problem complexity. On simple tasks, standard LLMs, without explicit reasoning mechanisms, were more accurate and efficient and delivered better results with fewer compute resources. However, as problem complexity increased to a moderate level, models equipped with structured reasoning, like Chain-of-Thought prompting, gained the advantage and outperformed their non-reasoning counterparts. When the complexity grew further, both types of models failed completely: their accuracy dropped to zero regardless of the available compute resources. (Keep in mind that the the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs have limitations when it comes to their training.)

A deeper analysis of the reasoning traces revealed inefficiencies and unexpected behavior. Initially, reasoning models used longer thought sequences as problems became harder, but near the failure point, they surprisingly shortened their reasoning effort even when they had sufficient compute capacity left. Moreover, even when explicitly provided with correct algorithms, the models failed to reliably execute step-by-step instructions on complex tasks, exposing weaknesses in logical computation. The study also found that model performance varied significantly between familiar and less-common puzzles, suggesting that success often depended on training data familiarity rather than true generalizable reasoning skills.

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

Read Entire Article