Recent findings from Apple researchers have cast doubt on the mathematical prowess of large language models (LLMs), challenging the notion that artificial intelligence (AI) is on the brink of human-like reasoning.
In a test of 20 state-of-the-art LLMs, performance on grade-school math problems plummeted when questions were slightly modified or irrelevant information was added, Apple found. Accuracy dropped by up to 65.7%, revealing a startling fragility in AI systems when faced with tasks requiring robust logical reasoning.
This weakness could have far-reaching implications for commerce relying on AI for complex decision-making. Financial institutions, in particular, may need to reassess their use of AI in tasks involving intricate calculations or risk assessment.
At the heart of this debate lies the artificial general intelligence (AGI) concept — the holy grail of AI that could match or surpass human intelligence across various tasks. While some tech leaders predict AGI’s imminent arrival, these findings suggest we might be further from that goal than previously thought.
“Any real-world application that requires reasoning of the sort that can be definitively verified (or not) is basically impossible for an LLM to get right with any degree of consistency,” Selmer Bringsjord, professor at Rensselaer Polytechnic Institute, told PYMNTS.
Bringsjord draws a clear line between AI and traditional computing: “What a calculator can do on your smartphone is something an LLM can’t do — because if someone really wanted to make sure that the result of a calculation you called for from your iPhone is correct, it would be possible, ultimately and invariably, for Apple to verify or falsify that result.”
Not all experts view the limitations exposed in the Apple paper as equally problematic. “The limitations outlined in this study are likely to have minimal impact on real-world applications of LLMs. This is because most real-world applications of LLMs do not require advanced mathematical reasoning,” Aravind Chandramouli, head of AI at data science company Tredence, told PYMNTS.
Potential solutions exist, such as fine-tuning or prompt-engineering pre-trained models for specific domains. Specialized models like WizardMath and MathGPT, designed for mathematical tasks, could enhance AI’s capabilities in areas requiring rigorous logical thinking.
The debate extends beyond math to a fundamental question: Do these AIs truly understand anything? This issue is central to discussions about AGI and machine cognition.
“LLMs have no understanding whatsoever of what they do. They are just searching for sub-linguistic patterns from among those that are in the stored data that are statistically analogous to those in that data,” Bringsjord said.
Said Chandramouli: “While their coherent answers can create the illusion of understanding, the ability to map statistical correlations in data does not imply that they genuinely understand the tasks they are performing.” This insight highlights the challenge of distinguishing between sophisticated pattern recognition and true comprehension in AI systems.
Eric Bravick, CEO of The Lifted Initiative, acknowledges current limitations but sees potential solutions. “Large language models (LLMs) are not equipped to perform mathematical calculations. They don’t understand mathematics,” he said. However, he suggests that pairing LLMs with specialized AI sub-systems could lead to more accurate results.
“When paired with specialized AI sub-systems that are trained in mathematics, they can retrieve accurate answers rather than generating them based on their statistical models trained for language production,” Bravick said. Emerging technologies like retrieval-augmented generation (RAG) systems and multimodal AI could address current limitations in AI reasoning.
The field of AI continues to evolve rapidly, with LLMs showing remarkable language processing and generation capabilities. However, their struggles with logical reasoning and mathematical understanding reveal significant work still needed to achieve AGI.
Careful evaluation and testing of AI systems remain crucial, particularly for high-stakes applications requiring reliable reasoning. Researchers and developers may find promising paths in approaches like fine-tuning, specialized models and multimodal AI systems as they work to bridge the gap between current AI capabilities and the envisioned robust, general intelligence.
For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.