Training data is the key competitive moat when it comes to generative artificial intelligence (AI).
The winner-take-most dynamics of digital tech’s operational ecosystem create network effects that frequently play out across, and are captured by, huge platforms where products and services become more valuable with a larger user network.
With generative AI added into the mix, those large user networks are creating positive feedback loops by way of activatable, platform-native public data that can be fed into increasingly data-hungry AI models.
Microsoft CEO Satya Nadella said Monday (Oct. 2), per Reuters, while testifying during the U.S. Justice Department’s antitrust fight against Google, that Big Tech companies are competing for vast troves of content needed to train their AI models. Without naming Microsoft competitor and Google parent Alphabet, he complained other companies are locking up exclusive deals in a way that is “problematic.”
While the data landscape is becoming increasingly competitive and expensive as tech firms large and small alike ramp up their efforts to build differentiated content libraries, John Schmidtlein, Google’s lead lawyer, argued that Microsoft’s data deficit is due to a series of strategic errors and an inferior search product, according to the report.
Schmidtlein noted that even when Microsoft pre-installed its Bing search engine on Windows devices, it still failed to gain more than a 20% market share, with users consistently switching to Google, the report said. Adding to Microsoft’s inability to gain a foothold in the search market was the company’s failure to prepare for the mobile revolution, Schmidtlein said, as well as its failure to appropriately invest in resources like servers and engineers to build a market-leading search product.
Google’s dominance in search and position as the internet’s first stop for users also provides it a competitive advantage when it comes to building out its AI models.
Read also: Does Meta’s 3.59 Billion Users Give It Competitive AI Moat?
Generative AI models are typically opaque black boxes trained on publicly available web data and further refined using licensed content.
An analysis by The Washington Post in April of just one data set used for training AI found that nearly the entire 30-year history of the internet has been scraped by tech companies looking to add to the billions, even trillions, of parameters their models are trained on.
The Atlantic reported Aug. 19 that a dataset known as “Books3,” which included more than 191,000 books — many of them pirated — was used to train generative AI models later published by Meta, Bloomberg and others.
Today’s neural networks would not have had their breakthroughs without these digital landfills that pockmark the internet.
Increasingly, every activity that takes place on the internet will in some way eventually benefit an AI model by being integrated into its vast sea of training fodder. Once a piece of data has been fed into a model, it is nearly impossible to extricate that information or un-teach the AI, Fortune reported Aug. 30.
See also: Peeking Under the Hood of AI’s High-Octane Technical Needs
The latest generation of Meta’s AI products announced Wednesday (Sept. 27) were trained using public posts on the company’s own Facebook and Instagram platforms.
Using public data on wholly-owned products is becoming a popular way for tech platforms to build out their own distinct AI content libraries, and more companies are updating their privacy policies to allow for user data to be collected to train AI models.
“The algorithm is only as good as the data that it’s trained on,” Erik Duhaime, co-founder and CEO of healthcare AI startup Centaur Labs, told PYMNTS in an interview published in May.
Companies including X (formerly Twitter), Microsoft, Instacart, Meta, Zoom and more are apparently of the mind that their data is best.
Elon Musk-owned X’s latest privacy agreement informed users that their biometric data, job and education history, will be used to train AI and machine learning models, Bloomberg reported Aug. 30.
“We may use the information we collect and publicly available information to help train our machine learning or artificial intelligence models for the purposes outlined in this policy,” the company’s updated policy stated.
Musk has threatened to sue Microsoft over the “illegal” use of data from X.
Meanwhile, per a Sept. 28 Reuters report, Meta said it purposefully excluded public LinkedIn data from its training set because of privacy concerns. LinkedIn is owned by Microsoft.
“A lot of people are using OpenAI’s [application programming interfaces (APIs)] to test and sort of play with [generative AI] technology,” Taylor Lowe, CEO and co-founder of AI developer platform Metal, told PYMNTS in an interview published in July. “This won’t last forever … and enterprises are right to push for data privacy, up to and including running the whole stack on their own network.”
For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.