To Understand Big Tech’s AIs, Follow Their Data Trails

By PYMNTS | September 29, 2023

The ability of some of the largest and most valuable businesses in the world to differentiate their generative artificial intelligence (AI) products is inherently limited.

As AI firms look to stand out in today’s crowded landscape, they are increasingly realizing that differentiation isn’t just an option; it’s the key to survival. Product-level differentiation can only come from the differences in training the data their AI models are built from.

A healthy dose of baked-in scalability helps too.

But as generative AI platforms build out more capabilities across voice, image and text, they will need more — and more varied — data. Access to that data is only getting tighter and more expensive.

Meta Platforms introduced a suite of new AI products Wednesday (Sept. 27), including AI-powered chatbots with distinct personality profiles. The announcement revealed that Meta’s latest AI products were built atop a foundational model trained using public posts on the company’s own Facebook and Instagram platforms, including both text-based content and photos collected from users posting publicly.

Meta has the built-in advantage of more than 77% of internet users active on at least one of its platforms, but for firms without native access to public data from billions of users there for the scraping, a mini-war is brewing over licensing access to safe and well-structured data sets that can be used to build out the next generation of increasingly multimodal AI models.

Advertisement: Scroll to Continue

These future-fit models, which are already starting to be brought to market, can work across and generate various content types, moving seamlessly between text queries and image creation, to describing out loud the image in question.

They are sophisticated black boxes, and to do their job effectively and efficiently, they must be trained on the best data available.

The Devil Is in the Data Training Program

Generative AI technologies are trained using large quantities of data, with companies scouring the entire corpus of internet history as best they are able. AI models can identify and learn the patterns in the data, enabling them to subsequently generate new content when queries match the observed patterns gleaned from training data.

Lifting up the hood of the biggest generative AI platforms is only possible by working backward from the data on which the models have been trained.

OpenAI said in a company blog post that its models, including the models that power ChatGPT, are developed using three primary sources of information: information that is publicly available on the internet; information that it licenses from third parties; and information that users or human trainers provide.

Most AI firms take a similar approach — and similar, publicly available web data — to building out their foundational models. It is the addition of proprietary fine-tuning techniques, such as Anthropic’s constitutional AI or the use of Reinforcement Learning from Human Feedback (RLHF), that can set apart AI platforms in minute ways that increasingly scale up and compound.

The Problem With ‘Publicly Available’ Web Data

As consumer and platform awareness of data privacy and personally identifiable information grows, many websites have tightened their privacy rules and cracked down on third-party scraping access.

“Publicly available on the web” can mean a lot of different things, with a lot of different abuses hidden along the journey.

And AI firms are increasingly finding themselves under fire for the ways in which they collect the corpora of data used to train their models.

A federal lawsuit in June claimed OpenAI trained its ChatGPT tool using millions of people’s stolen data, including private information and conversations, medical data and information about children, all harvested without owners’ knowledge or permission.

The lawsuit added to the growing number of legal actions against generative AI providers, including Meta Platforms and Stability AI, over the use of data to train their AI systems.

Elon Musk has also threatened to sue Microsoft over “illegal” use of data from X, formerly Twitter.

“They trained illegally using Twitter data,” Musk tweeted. “Lawsuit time.”

As the industry matures, firms are rushing to lock up licensing agreements for the datasets they need to build image recognition capabilities and more.

OpenAI has already inked deals with the Associated Press (AP) and Shutterstock to use their content to train its models.

However, some data holders are deciding that rather than license out their valuable libraries, they will build their own AI models using them.

Getty Images launched Monday (Sept. 25) what it termed a commercially safe generative AI tool that combines Getty Images’ creative content with AI technology to provide a responsible and reliable solution for commercial use.

Getty is one of the few companies with millions of high-quality images that can potentially be licensed in a single transaction, but if its own AI offering becomes popular enough, it may not decide to give up control of its own intellectual property.

In the long run, data may very well end up being the scarcest and most valuable input for training AI systems.

For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.

To Understand Big Tech’s AIs, Follow Their Data Trails

Get the Full Story

The Devil Is in the Data Training Program

The Problem With ‘Publicly Available’ Web Data

Recommended

Trending News

The Big Story

Featured News

Subscribe

Partner with PYMNTS

Topics

Featured

Stay Current

To Understand Big Tech’s AIs, Follow Their Data Trails

Get the Full Story

The Devil Is in the Data Training Program

The Problem With ‘Publicly Available’ Web Data

Recommended

Trending News

The Big Story

Featured News

Subscribe

Partner with PYMNTS