Despite its alleged “societal-scale” risk, artificial intelligence’s (AI’s) biggest threat may be to itself.
Not to humanity.
Copies of copies generally tend to get worse — and if the training data used to power future generative AI engines, including large language models (LLMs), gaussian mixture models (GMMs) and variational autoencoders (VAE), continues to be scraped from the internet, it will inevitably come to be trained on content that was produced by today’s generative AI tools.
And that isn’t good news for the reliability and usability of those AI models.
A group of leading academic researchers has published a paper entitled “The Curse of Recursion: Training on Generated Data Makes Models Forget” showing that the content generated by various AI models becomes progressively degraded and loses intelligibility when successively trained on learning data produced by other models.
“[T]he use of LLMs at scale to publish content on the internet will pollute the collection of data to train them,” the paper stated.
“Furthermore, we show that this process is inevitable,” it added.
The researchers have termed this phenomenon “model collapse.”
It holds far-reaching implications for organizations attempting to leverage generative AI’s revolutionary business capabilities. It also underscores the importance of data sovereignty and future-fit governance processes as they relate to AI-powered integrations across corporate operations.
See also: 10 Insiders on Generative AI’s Impact Across the Enterprise
As the research paper noted, nearly all the material stored online was originally produced and curated by humans.
But the internet revolutionized the way information was able to be shared, and it created new modes of communication, allowing text to be analyzed, modified and surfaced by search platforms and other layered-on solutions.
Now, generative AI, so called because of its ability to independently produce and generate content, is building another information layer atop the online landscape.
“At a high level, generative AI has the potential to create a new data layer, like when HTTP was created and gave rise to the internet beginning in the 1990s. As with any new data layer or protocol, governance, rules and standards must apply,” Shaunt Sarkissian, founder and CEO of AI-ID, told PYMNTS last month.
At the center of many business use concerns around the integration of generative AI solutions lies ongoing questions around the integrity of data and information fed to the AI models, as well as the provenance and security of those data inputs.
The fact that AI models trained on data produced by other AI models has been shown to result in degenerative processes, where over time the models forget the true underlying data distribution and produce increasingly simplistic outputs, only serves to emphasize the importance of using good data when developing AI tools meant to be used in an enterprise setting.
After all, achieving the type of competitive advantage in today’s operating environment that can translate to sustainable business success frequently boils down to a firm’s ability to access and leverage best-in-class data.
Read also: Preparing for a Generative AI World
“We’re about to fill the internet with blah,” wrote Ross Anderson, one of the Cambridge scientists behind the research paper and the founder of the Foundation for Information Policy Research, on his personal blog. “This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that…”
“LLMs are like fire — a useful tool, but one that pollutes the environment,” Anderson added. “How will we cope with it?”
To avoid model collapse, access to genuine human-generated content is essential, as are effective governance standards that provide a firm go-forward infrastructure for AI training.
“The future [of AI model development] will be … more of a continual dance where there is active learning and reinforcement learning where multiple, highly-trained experts are part of the workflow to continually improve a model,” Erik Duhaime, co-founder and CEO of Centaur Labs, told PYMNTS in May.
The world is now at a tipping point where the sweeping digitization of the business ecosystem has gifted firms with untold terabytes of proprietary data about their customers and their operations, providing a fertile foundation for AI models that avoids the need to scrape the internet for inputs.
But before looking to tap wholly-owned company data whose provenance is never in question, businesses must ensure first that they have the appropriate data infrastructure in place to propel their business processes into the 21st century while avoiding its foundational pitfalls.