Artificial intelligence (AI) chatbots that dazzle millions with eloquent poetry and computer code are operating with fundamental gaps in their grasp of how the world works, according to new research from MIT.
The findings raise urgent questions for businesses racing to integrate large language models into their operations. Experts warn that these AI systems may excel at pattern matching and language tasks while lacking the coherent worldview to make reliable decisions about complex real-world scenarios.
“While LLMs [large language models] excel at finding patterns within datasets and can make predictions about other patterns, they are limited to the data that they have access to, which may limit the accuracy of a decision in real time,” Nick Rioux, co-founder and CTO of AI-powered purchasing software company Labviva, told PYMNTS. “Humans do this.”
LLMs can appear remarkably capable while lacking a genuine understanding of what they’re doing, according to MIT researchers who developed new ways to test AI systems’ grasp of the world. They found that while language models could expertly navigate New York City streets and play complex games, they failed basic tests of understanding the underlying systems they were working with.
The researchers developed evaluation metrics based on the classic Myhill-Nerode theorem to assess whether AI models truly comprehend the structure of tasks like navigation and game-playing. When put to these rigorous tests, even models that performed impressively on standard benchmarks revealed major gaps — for instance, a model that could reliably plot routes through Manhattan actually had an incoherent internal map of the city that bore little resemblance to reality.
Most strikingly, GPT-4 and other leading language models could solve complex logic puzzles while failing basic tests of understanding the puzzle’s constraints and rules. The research suggests that AI systems are primarily pattern-matching rather than building coherent world models, raising questions about their reliability for real-world applications where genuine understanding is crucial.
According to Paul Ferguson, who founded Clearlead AI and works as an AI consultant, AI models may face challenges with certain tasks depending on the specific model being used. This often demonstrates why tailored, customized AI solutions can be more effective than generic ones.
“While general-purpose LLMs are impressive, I’ve seen significantly better results when we develop and fine-tune systems specifically for a company’s particular customer base and domain,” he said.
“However, customer needs and demands naturally evolve, so building mechanisms that allow the system to adapt and learn continuously is essential. This might include regular retraining cycles based on new customer interactions or implementing feedback loops that help the system improve its responses over time.”
Ferguson said model explainability tools are one of the best ways to ensure LLMs make accurate decisions for business tasks like pricing and inventory. These tools let us look inside the AI’s decision-making to ensure it uses legitimate factors, not just shortcuts from its training data.
“We also need robust monitoring systems that can alert us to any unusual patterns or data drift: it’s like having an early warning system that tells you when your AI might be starting to make less reliable decisions,” he added. “This is particularly important in dynamic environments where conditions are constantly changing.”
Ben Lengerich, assistant professor in the Department of Statistics at the University of Wisconsin-Madison, told PYMNTS that while LLMs don’t update their knowledge on their own, connecting them to real-time data sources like APIs and inventory databases helps them handle changing situations accurately. This combination allows LLMs to remain functional even as business needs change. He used the example of customer service, where an LLM’s conversation skills combined with current inventory data can provide up-to-date product availability information.
“Building LLMs into AI systems rather than using them as standalone models can improve their reliability in dynamic tasks,” he said.