Nearly half of the companies in the S&P 500 have reported their most recent financial earnings.
The terms “artificial intelligence” or “AI” were mentioned a total of 827 times, nearly double the amount of mentions as the quarter prior, Reuters reported Tuesday (Aug. 1).
The generative AI industry itself is expected to grow to $1.3 trillion by 2032.
But while tens of millions of individuals use generative AI tools like OpenAI’s ChatGPT, Google’s Bard, and other solutions from Microsoft, Meta and more, few people understand how the foundational large language models (LLMs) behind today’s leading AI platforms work.
Understanding how LLMs function is essential for using them effectively and discerning their strengths and limitations.
The foundational language models commercialized today are typically trained using deep learning algorithms which are built on a neural network trained using billions of words of ordinary language.
The specific sources of data used for training are often undisclosed by the companies behind these models. However, much of the data comes from publicly available information on the web that has been scraped and analyzed by LLMs.
Digging into how exactly the LLMs process vast amounts of data to “learn” language patterns and perform language-related tasks is where things get tricky and a little inscrutable.
See also: Peeking Under the Hood of AI’s High-Octane Technical Needs
To delve into the working of LLMs, it is important to first understand how they represent words. Unlike humans who represent words using letters, LLMs use word vectors—long lists of numbers that place words in an imaginary “word space.”
At a highly simplified level, word vectors are no different from the oldest of all spatial reference systems — the geographic coordinate system, or latitude and longitude measurements.
Just as Paris or Tokyo can be translated into coordinate reference measurements that inherently encode their distance-relationship to each other, LLMs use word vectors to spatially locate individual words across vector spaces with hundreds or even thousands of dimensions — far too complex a landscape for the human mind, but easy enough for today’s computers to handle.
In this space, words with similar meanings are located closer together, enabling the model to perform various operations based on numerical representations rather than the traditional string of letters.
That’s step one.
On a fundamental level, LLMs are looking to generate responses that seem plausible and natural, and that match up with the data they’ve been trained on. They are compute-based algorithms and inherently have no operational parallel to what humans would consider perception, much less a perception of what’s accurate and what isn’t outside the bounds of a given query.
That’s why one of the key innovations of LLMs is the use of Transformers, which lies at the heart of language models and (not to mix metaphors) serves as their connective tissue.
Dozens of transformers running in sequence to churn through billions, if not trillions, of parameters to instantaneously ingest the complex relationships between data elements (word vectors) before generating a relevant response are what power today’s LLMs.
Each transformer layer takes word vectors as input and adds information to enhance the model’s understanding. The transformer’s multilayered structure allows it to develop a high-level understanding of the context.
At the core of the transformer is the attention mechanism, which acts as a matchmaking service for words. Each word creates a query vector describing the characteristics of words it seeks, while also producing a key vector describing its own attributes. The model then compares these vectors to find the best match and transfers information accordingly.
The attention mechanism allows AI models to compute a weighted sum of their input sequence, removing the need for complex recurrent or convolutional neural networks (the previous engine powering most AI models before 2017) by allowing the model to focus on different parts of the input simultaneously to capture and weigh various aspects of the data input to triage and generate the components of a desired output most effectively.
It is a two-part process. First, the LLM uses “attention” to identify and establish contextual information between words. As part of that process, the weighted neurons create a feed-forward layer, which uses the triaged, attention-based context to predict and generate the most likely and relevant output.
Read also: Why Crypto Needed Celebrity Influencers and Generative AI Does Not
Ultimately, the success of language models lies in their ability to predict the next word.
The models learn through repeated exposure and gradually adjust their weight parameters (neurons) to improve predictions, which is a big reason why training LLMs require an enormous amount of computation and data.
A key innovation of LLMs is that, rather than needing to be fed hand-labeled data to coach them, they learn by trying to predict the next word in ordinary passages of text. Almost any written material — from Wikipedia pages to news articles to computer code — is suitable for training LLM models, and a lot of what’s publicly available on the web has already been scraped and analyzed by LLMs.
For that reason, it’s hard to overstate the sheer number of examples that LLM models see. OpenAI’s GPT-3, which has since been vastly eclipsed by GPT-4’s parameters, was trained on approximately 500 billion words. That is several orders of magnitude more than the number of words a typical human encounters in the first decade of their life.
As LLMs have grown in size and complexity, they have displayed feats of language processing. However, researchers acknowledge that fully comprehending how these models achieve such feats remains an ongoing challenge. Some argue that LLMs are starting to genuinely understand language, while others view them as merely advanced pattern-matching systems.
One of the more pressing dangers of LLMs is that they can encode relational biases (i.e. gendered logical fallacies such as the word “doctor” minus the word “man” equals “nurse”) into their word vectors, which are then scaled up across entire foundation models via the Transformer architecture as the model self-trains.
That’s why understanding the mysterious inner workings of these models is crucial for their oversight.