Apple researchers claim OpenAI's o3 model is an…

Apple researchers claim OpenAI's o3 model is an "illusion of thinking", inconsistent with a human's thought process

A man records a live video about new Apple hardware following Apple's "It's Glowtime" event in Cupertino, California, September 9, 2024.

We've come a long way from the early days of OpenAI's ChatGPT launch and Bing Chat (now Microsoft Copilot). Key players in the AI landscape, including OpenAI, Google, and Anthropic, are seemingly leaning more toward reasoning models as their core focus.

Last year, a report emerged claiming these top AI labs would soon be unable to develop advanced AI models due to a lack of high-quality content for model training. However, the OpenAI CEO, Sam Altman, dismissed the claims, citing that "there's no wall" with former Google CEO Eric Schmidt indicating that there's no evidence showing scaling laws have begun.

A new research paper by Apple seemingly raises concerns about the reasoning capabilities exhibited by the latest LRMs (large reasoning models). The research findings reveal that while the LRMs outperform standard LLM AI models at moderately complex tasks, they both struggle to produce desired results as the complexity of tasks increases.

The research paper specifically singles out Anthropic's Claude 3.7 Sonnet Thinking, OpenAI's o3, Google's Gemini, and DeepSeek's R1 LRMs, assessing their reasoning capabilities across a wide range of benchmarks beyond standard math and coding benchmarks, and designing controlled puzzle environments, including the "Tower of Hanoi".

The researchers wanted to establish and evaluate the models' reasoning capabilities, rather than their capability to arrive at the desired outcome or answer. According to the Apple researchers' findings:

"While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood."

Apple researchers referred to large reasoning models as merely an "illusion of thinking." The study revealed that standard LLMs and LRMs posted similar results and responses when tasked with simple queries. However, LRMs seemed to have the competitive edge as the queries became more complex.

The researchers attributed the competitive edge of LRMs to their structured reasoning mechanisms, which promote "Chain-of-Thought" prompting. Interestingly, LRMs and standard LLMs without reasoning capabilities failed to deliver results when the queries became too complex.

Perhaps more interestingly, the research suggested that the reasoning models took longer to process complex queries; however, as they edged closer to failure, they surprisingly shortened the process despite "having an adequate token budget".

Despite being provided with correct algorithms, the LRMs appeared to struggle with handling complex tasks in the conventional step-by-step reasoning process, highlighting their flaws and inconsistencies in logical computation.

This news follows a separate report suggesting that Apple might be two years behind OpenAI's ChatGPT. Microsoft CEO Satya Nadella argued that the ChatGPT maker had a 2-year runway to build and develop ChatGPT uncontested.

Apple continues to lag behind in the AI race, delaying its Apple Intelligence AI strategy to 2026, prompting multiple users to brand it as vaporware and an afterthought designed to push iPhone 16 sales.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here