Back in engineering school, I had a professor who used to glory in the misleading assignment. He would ask questions containing elements of dubious relevance to the topic at hand in the hopes that it would knock us off our focus or that it would provide a distraction that would send us down a rabbit hole of unnecessary research.
Here's an example of the type of question he would ask. His questions were much harder and engineering-focused, but I've used this exact question because it's directly related to the study we'll be discussing:
Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
Also: These AI models reason better than their open-source peers - but still can't rival humans
My professor's goal was to help us identify what was relevant to the project at hand, and help us learn to ignore or set aside all the natural distractions that come with doing research.
It was initially a very painful -- but ultimately very useful -- set of lessons for first-year engineers.
I'm reminded of this challenge because of a research paper that came out this month from a team of Apple AI and machine learning researchers led by Samy Bengio, senior director, and Oncel Tuzel, distinguished scientist.
Their paper, "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models," included the math problem shown above. If you look at the question, the phrase "but five of them were a bit smaller than average," should have no impact on the overall kiwi count.
The researchers found that large language models (LLMs) like OpenAI's GPT-40-mini, GPT-40, o1-mini, and o1-preview fall prey to the sorts of questions that involve reasoning as distinguished from very high-level text processing.
Now, to be fair, I ran that query against ChatGPT GPT-4o, which answered correctly. I wouldn't take that to mean that Apple's conclusions are incorrect, only that ChatGPT correctly handled this one.
On the other hand, we all know that the AI could have just as easily answered with some discussion of the number of actual Kiwi birds waddling through the nighttime forests of Otorohanga, New Zealand.
It makes sense, then, that the ultimate conclusion of Apple's research is that LLMs are incapable of true reasoning and rely on pattern matching instead.
To some degree, this is a tell-us-something-we-don't-know sort of conclusion. Even so, it is good to have researchers -- of the caliber that Apple has applied to this problem -- confirm it scientifically. And with that, let's dive into the science.
Benchmark datasets
As it turns out, asking ChatGPT to translate the Constitution into pirate-speak is not a comprehensive way to test LLMs, even if it does result in a rollicking good time.
Instead, researchers have developed far less amusing but more effective AI test frameworks designed to measure how well language models evaluate math problems.
In 2021, OpenAI introduced GSM8K, a benchmark dataset used to evaluate the reasoning of LLMs. The acronym tells what the dataset contains: 8,000 grade school math problems.
Also: ChatGPT vs. Microsoft Copilot vs. Gemini: Which is the best AI chatbot?
The dataset, when applied to an AI, helps researchers determine how accurate the AI is, and whether it can work out reasoning problems as well as basic math. GSM8K is considered the gold standard for evaluating the mathematical reasoning capabilities of LLMs, particularly with arithmetic and word problems.
Because it's open source, GSM8K also has been widely used in the AI field (both inside and outside of OpenAI) to test tasks requiring step-by-step reasoning. It has a clear problem structure, which has made it a trusted tool for AI researchers doing early-stage testing on their LLMs.
The Apple researchers, on the other hand, consider this dataset fundamentally flawed. They contend that the test results from GSM8K may present an overly positive view of a given LLM's capabilities. That's because the test set is based on fixed, familiar questions that may have been used in the LLM's training set.
The paper cited above introduces a new dataset, GSM-Symbolic, that the researchers say overcomes the limitations in GSM8K. GSM-Symbolic offers more varied and complex problems, which prevent the LLMs from working off of stored training data.
The paper mentions that some models, like Google's Gemma2-9B, showed markedly different results using the two benchmark datasets. Gemma2-9B was able to solve the problems in OpenAI's GSM8K dataset correctly, but accuracy dropped by 15% when it was subjected to Apple's GSM-Symbolic set of tests.
The Apple researchers found that as questions increased in complexity (they called it "adding clauses"), accuracy dropped. This metric wasn't shown in GSM8K because the data was fixed. According to Apple, models that showed accuracy -- in the high 80-90% range -- could drop to the 40% range as the number of clauses increased.
Also: AI agents are the 'next frontier' and will change our working lives forever
Apple contends that there is some risk of data contamination in GSM8K, meaning that the models might have been trained on parts of the dataset. GitHub, which hosts the GSM8K dataset, has been used to help train LLMs.
Using GitHub for training data has never seemed like a good idea to me. I have old code in my GitHub repository and I'm very well aware of how buggy it is. I wouldn't want to use that as example code to train my students, let alone use it to teach the AIs we rely upon for good answers.
In any case, Apple's GSM-Symbolic does not appear to be open source. So while Apple's researchers contend it's the better solution for testing LLMs, you can't have access to it unless you work at Apple in the right group and bleed in six colors.
What does it all mean?
A part of me is suspicious about Apple's motivation for this paper, in that it seems like some sort of super-nerd competitive comparison beat-down of Open Al, especially since Apple is coming out with its own Al offerings.
On the other hand, Apple is planning to include ChatGPT in its Apple Intelligence offerings, so it doesn't seem appropriate to attribute sheer competitive orneriness as the justification for producing a paper like this. Therefore, I believe that the motivations were probably just what they seem: genuine academic interest in improving understanding of learning model performance and accuracy.
The research proves what we pretty much knew all along: LLMs perform better at pattern matching than they do at logical reasoning. They use pattern recognition in their training and processing, rather than actual deduction. The fact that so much of the world's information can be convincingly portrayed simply out of pattern recognition is startling, but it still doesn't get us computers that can really reason.
Also: The best AI for coding (and what not to use)
Mathematical reasoning is spotty. The example that Apple's researchers used as a failed test passed during my tests. That's not to say Apple's team is wrong, but it goes to the premise that AIs are inconsistent and ever-evolving. Therefore, relying on LLMs for mathematical results isn't necessarily a practical approach. If you want good math, use old-school algorithms and traditional software engineering test and validation methods or at least double-check the results the AI gives you.
Another concern to those considering relying on LLM data in production scenarios is the drop in accuracy as the complexity increases. While that pattern does accurately reflect how humans deal with data (the more complex it gets, the more headaches we get), the difference between LLMs and us is that we do practice actual reasoning.
So what are the business implications of the research results in Apple's paper? That's next.
Business implications and risk mitigation
The implications are obvious unless you've been looking at AI through rose-colored glasses. AI is a helpful tool, but don't rely on it to handle complex decisions. It's just not wise to abdicate all responsibility to an AI or a LLM because it's a promising new technology.
I've shown a few times how I used AI to help give me some insights based on corporate data, but I always turkey-tested the results by thinking through the analysis, seeing if it met with my inner knowing, and -- ultimately -- making my own determinations and decisions. The AI was an interesting supporting tool, but my own management background was key to making decisions for my own business.
Also: Want to work with AI? Make sure you level up your domain expertise
AIs are full of potential. I've used them to help me program, for instance. I'm sure ChatGPT saved me a month of programming time last year. But I didn't rely on the AI to design my code or write the business logic sections. I used it simply to give me interfaces into very common APIs which I would have otherwise had to spend time looking up, and which were easy to test.
Don't expect AI to replace your subject matter experts. AI can support the efforts of human experts, but when it comes to deep reasoning or critical thinking, AIs are fallible. Look at it this way: If you wouldn't trust a college freshman or your neighbor's kid to make decisions about your business, don't trust an AI.
We know that AIs hallucinate. We know that they sometimes come up with completely nutball conclusions based on the data they've been given. If your business is relying on data to make decisions, don't assume an AI will give you the right data.
That brings us to risk mitigation: Invest in AI cautiously. Look for strategic areas where it excels.
For example, in my day-to-day work, I find AI delivers high returns in the photo editing capabilities of Photoshop to remove backgrounds, or the gimbal that points the camera at me no matter where I am in the room when recording a YouTube video. I also use it for generative text and generative images, but never for mission-critical projects.
Also: I've been testing AI image generators for years - and I'm shocked by my new top pick
Make absolutely sure you have systems in place to ensure that human oversight is actually happening and not slipping. You must constantly involve human intelligence in the loop, especially for critical operations.
Extend that caution to your team. Everyone's been reading and hearing about the wonders of generative AI, but may not be aware of its limitations. Be sure all your team members know that tools like LLMs are just that: tools. Resist the temptation of complacency.
Apple's research conclusions
It's interesting that Apple, which has put so much marketing hype into Apple Intelligence, is also showcasing the limits of the technology. In a way, that kind of transparency is encouraging.
Apple has been using machine learning as a tool for regularly improving its photo processing capabilities. But while those technologies use a great deal of math, they do not require independent human reasoning.
Expect to continue seeing Apple invest heavily in AI technologies where AI is strong, even along the company's supply chain. But I don't expect that Apple's executive team will cede decision making to an LLM.
Also: AI 'won't replace' creative skills, study finds
This research shows both that LLMs have notable limitations as project complexity increases and that Apple is investing in testing the limits of LLMs and factoring those results into how much it relies upon these new technologies.
For a company rarely transparent about its underlying decision making, this paper is compelling insight into the detailed research Apple is doing to help it understand the strengths and limits of the hottest new technology of the decade.
What do you think? Did Apple come to the right conclusions? Have you tried to use AI for decision making? What are you using LLMs for now, and what do you hope to use them for in the future? Let us know in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.