The demand for high-quality data, essential for powering artificial intelligence (AI) conversational tools like OpenAI’s ChatGPT, may soon outstrip supply and potentially stall AI progress, industry analysts warned.
The exponential reliance on comprehensive datasets is a double-edged sword for AI development. While necessary for enhancing the sophistication of models like ChatGPT, the impending data shortage is raising alarms within the tech community.
The shortage of AI training data stems from the need for large volumes of high-quality, diverse and accurately labeled data that is representative of the real-world scenarios the model will encounter. Acquiring such data is often a time-consuming task, as it may involve manual annotation by domain experts, collection from a wide range of sources and careful curation to ensure data quality and eliminate biases.
Additionally, AI companies face complex copyright challenges when acquiring training data, requiring careful navigation of legal provisions, permissions and content filtering processes.
“Humanity can’t replenish that stock faster than LLM companies drain it,” Jignesh Patel, a computer science professor at Carnegie Mellon University and co-founder of DataChat, a generative AI platform for analytics, told PYMNTS. “Specialized LLMs, on the other hand, depend much less on publicly available data. Think of an LLM that automates a risk review process in finance. That process might be unique to one bank or investment firm, and there may be little or no documentation about it.”
Authors and publishers have sued AI companies, accusing them of unlawfully using their works to develop the technology, highlighting the importance of training data in the AI sector.
ChatGPT and other AI chatbots pull from a vast web library, including articles and Wikipedia, to fuel their responses. They digest this data to master the intricacies of human language, undergoing rigorous training on diverse text to grasp the context and interpret queries accurately.
This immersion in online content enables them to simulate realistic conversations. However, experts warned in a 2022 paper on open-access site Arxiv that there might not be enough high-quality information to learn from within the next few years, making it challenging to improve these AIs.
Researchers are using several strategies to navigate the challenge of data scarcity. One technique involves leveraging computational techniques to fabricate synthetic data. This approach allows for the enrichment of datasets, providing machine learning models with a diverse array of scenarios for training, akin to an extensive preparatory course before a final exam.
Another strategy includes employing human supervision in the data generation process. While artificial intelligence has made significant strides, it still lacks the nuanced understanding and ethical discernment inherent to human judgment.
Patel explained that large language models (LLMs) can generate artificial examples to train themselves in a process called “self-improvement.”
“But the full implications of that tactic are still unknown,” he said. “If the LLM has biases, then its artificial training data could carry those biases, resulting in a detrimental feedback loop.”
An example highlighting the challenges associated with synthetic data involves a project aimed at creating data for Google Starline, which focuses on the human body, facial expressions and movements. Tony Fernandes, the CEO of UEGroup, shared with PYMNTS that his team is actively supplying this data. They are employing a recording device to collect information across diverse skin tones.
“Any artificially created version of that source would create risks, both from a market and regulatory standpoint, because not enough work was done in that area in the past,” he added.
One solution to the data problem might be finding better ways to share data. Nikolaos Vasiloglou, the vice president of research in machine learning at RelationalAI, shared with PYMNTS that there’s a significant issue with content creators being reluctant to make their high-quality data available.
He pointed out that these creators either don’t want to share their data without getting paid or are unsure about selling it because they feel the offered prices don’t reflect the data’s actual value.
“If OpenAI were to add attribution to the responses that are provided, for example, companies would likely be more eager to contribute free content in exchange for putting their brand name in front of millions of users or for a dividend,” he said. “Attribution will help us create a fair market where content creators and LLM providers would be able to monetize data.”
Some experts believe concerns about data scarcity are exaggerated. Studies indicate that data quality significantly outweighs quantity, even though quantity retains its importance, Ilia Badeev, head of data science at Trevolution Group, told PYMNTS. He noted that as the volume of data increases, the complexity and cost of training also rise, alongside an enhanced likelihood of the model overlooking crucial information during training — a scenario akin to searching for a needle in a haystack, but with data.
“My subjective opinion is that we will soon move from the idea of ‘feeding the model all possible data’ to a more careful selection of data for training,” he said. “In which initially training data will be carefully cleaned, deduplicated, verified (also by AI models), and then based on them, new models will be trained — generative ones to generate new data and verification ones to check the generated data. A closed circle of quality improvement.”