In data centers worldwide, trillions of words sit ready to teach machines how to talk.
This collection of text — known as a corpus in artificial intelligence — is behind the AI chatbots and assistants that have captured the public’s imagination.
The scale of these modern corpora is staggering. When OpenAI released its GPT-3 model in 2020, it made waves not just for its ability to generate human-like text but also for the sheer size of its training data: approximately 500 billion tokens or pieces of text.
That number, once thought enormous, has since been surpassed. Google’s PaLM model, announced in 2022, upped the ante with a corpus of over 780 billion tokens. The race for larger datasets continues, driven by the observed correlation between data size and model performance.
The quantity of data is just one aspect of the challenge. The diversity of these datasets is equally crucial. Common Crawl, a non-profit organization that archives web pages, provides a snapshot of the internet that many AI researchers use as a starting point. Its monthly archives contain about 3 billion web pages — a trove of human communication in all its forms.
Building these datasets is a monumental task. Web scraping algorithms comb through billions of web pages, filtering out spam and low-quality content. Teams of linguists and data scientists clean and preprocess the text, turning raw data into a format that AI models can digest.
Some specialized fields require even more curation. The MIMIC-III database for medical AI research contains de-identified health data from over 40,000 critical care patients. Each record is anonymized and formatted to protect patient privacy while preserving medical information.
In the legal domain, the Case Law Access Project corpus, developed by Harvard Law School, provides access to all official, book-published United States case law from 1658 to 2018. This dataset of over 6.5 million cases is a crucial resource for AI systems that assist with legal research and analysis.
Creating these massive corpora has its challenges. Web scraping can sometimes capture copyrighted material, raising legal questions. The internet isn’t always a perfect reflection of human knowledge or values.
Many research teams are developing more sophisticated filtering techniques to address these issues. Some are even exploring the creation of synthetic datasets generated by other AI models to fill gaps in their training data or to create more balanced representations of specific topics.
Preparing data for AI training involves several key steps. Tokenization breaks the text into individual words or subwords the model can process. Normalization standardizes text by converting to lowercase, removing extra whitespace and handling special characters. Sentence segmentation identifies boundaries, which is important for many natural language processing tasks. Part-of-speech tagging labels words with their grammatical categories. Named entity recognition identifies and categorizes entities like people, organizations and locations. Dependency parsing analyzes the grammatical structure of sentences.
Text-based corpora represent just the beginning. Researchers are now looking to incorporate other types of data — images, audio and even sensor readings — to create more well-rounded AI systems.
The Pile, a dataset created by EleutherAI, is an example of this more diverse approach. It contains English text from 22 sources, including academic papers, coding repositories and web content. The diversity allows AI models to learn from a broader range of human knowledge and communication styles.
For multilingual models, datasets like mC4 (used to train Google’s mT5 model) and OSCAR (Open Super-large Crawled Almanac) provide text in over 100 languages. These resources are crucial for developing AI systems to understand and generate text across multiple languages.
The future of AI corpora may involve real-time updates, allowing models to stay current with the latest information without requiring full retraining. Researchers are also exploring ways to make training more efficient, potentially reducing the volume of data needed while maintaining or improving model performance.
AI continues to evolve, and the development and refinement of these text collections remain a critical challenge. Today’s corpora — their size, diversity and quality — will shape the capabilities of tomorrow’s AI systems, determining how these digital entities understand and interact with the world around them.
For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.