STATSML 301: A Concept Course on Language Models

Learn the key ideas behind language models

Introduction

As a Data Distiller user, understanding the power and potential of Large Language Models (LLMs) and deep learning breakthroughs is essential because these technologies represent the future of AI. LLMs, powered by advanced neural networks, have revolutionized the way AI models process vast amounts of data, enabling them to generate human-like text, learn from complex patterns, and adapt across various domains. These capabilities are reshaping industries—from marketing and customer engagement to product development and beyond.

Knowledge Representation Holds the Key to Artificial Intelligence

To address the complex problems encountered in artificial intelligence, a large amount of knowledge and effective mechanisms for manipulating that knowledge are essential to create solutions for new challenges.

Human memory stores an immense amount of knowledge about the world and serves as the foundation for higher forms of learning. Systems that cannot learn, in practical terms, only exhibit basic common sense (providing straightforward answers to simple questions). While we haven't yet developed a complete theory of human memory, neural networks—such as the Hopfield network—offer a close analogy to how neural memory might function.

Psychological research highlights several distinctions in human memory. One key distinction is between short-term memory (STM) and long-term memory (LTM). LTM is relatively permanent, while STM, or working memory, holds perceptual information temporarily. In LTM, production rules—stored as knowledge—match themselves against items in STM, firing to modify STM and repeating this process. LTM is further divided into episodic memory and semantic memory. Episodic memory stores personal experiences from an autobiographical perspective, while semantic memory holds facts like “birds fly,” which are not linked to personal experiences.

In the context of neural networks and AI, "memory" typically refers to a model’s ability to store and retrieve information from past experiences or training data. This is achieved through various mechanisms, such as recurrent neural networks (RNNs) or long short-term memory networks (LSTMs). These networks include memory cells that can capture and retain information over sequences, allowing the model to learn from and apply knowledge across time and data.

Facts & Representation Mappings

In AI, different methods are used to represent knowledge (facts) within a program.

Facts: These are the truths about the world that we want to capture, such as "birds fly."
Representation: This is how we encode those facts in a way that the AI program can work with.

There are two key levels involved in representing knowledge:

Knowledge level: This is where we describe the actual facts and behaviors, including the goals of an agent.
Symbolic level: At this level, we take those facts and represent them with symbols that the AI can manipulate.

There are two types of mappings that happen in AI:

Forward mapping: This is where we map facts from the real world into the representation the AI uses.
Backward mapping: This goes in the opposite direction, mapping the representation back to the real-world facts.

However, these mappings aren't always perfect or one-to-one. A single fact might have several possible representations, and multiple facts might share the same representation.

What an AI program does is manipulate these internal representations. The goal is for the AI to create new structures from the information it has, which can also be interpreted as solutions to the problem it's trying to solve. In other words, the AI uses the facts it knows, manipulates them, and generates new facts or answers.

Sometimes, finding the right representation makes solving a problem much easier, even for humans. Think about how changing the way you approach a problem can make it much simpler to solve. The same is true for AI—finding a good representation can turn a complex problem into a trivial one.

If there isn't a good way to represent a problem, no matter how advanced the AI program is, it won’t be able to come up with the right solution. In some cases, it may not be possible to find a perfect representation, so we have to settle for something less ideal.

In AI, we haven't found a single system that works perfectly for every type of knowledge. As a result, multiple methods of knowledge representation are used, each with its strengths and weaknesses depending on the situation.

Building a Knowledge Representation

A knowledge representation should answer the following questions:

How should sets of objects be represented?
Are there any attributes so basic that they occur in almost every problem domain?
Are there any important relationships that exist among attributes of objects?
Given a large amount of knowledge stored in a database, how can relevant parts be accessed when they are needed?
At what level should knowledge be represented? Is there a good set of primitives into which all knowledge can be broken down? Is it helpful to use such primitives?
At what level of detail should the world be represented?”. Another way this question is often phrased is, ”What should be our primitives?” Should there be a small number of low-level ones or should there be a larger number covering a range of granularities?
What knowledge structure should we choose so that we consume less resources?

Kinds of Knowledge Representation

Here are the various knowledge representations:

Simple Relational Knowledge: Represent declarative facts as a set of relations of the same sort used in database systems. This representation is simple but provides very weak inferential capabilities. However, knowledge represented in this form may serve as the input to more powerful inference engines. Providing support for relational knowledge is what database systems are designed to do.
Inheritable Knowledge: One of the most useful forms of inference is property inheritance, in which elements of specific classes inherit attributes and values from more general classes in which they are called. In order to support property inheritance, objects must be organized into classes and classes must be arranged in a generalization hierarchy.
Inferential Knowledge: The power of traditional logic and sometimes even more than that is necessary to describe the inferences that are needed. There are many procedures, some of which reason forward on the knowledge present in the system. One of the most useful procedures is resolution, which exploits a proof-by-contradiction strategy.
Procedural Knowledge: The most commonly used technique for representing procedural knowledge in AI programs is the use of production rules.

Knowledge in Large Language Models (LLMs) is typically represented in the form of pre-trained language models. These models are trained on vast amounts of text data from the internet, which helps them capture a broad spectrum of human knowledge. Here's how knowledge is represented in LLMs:

Word Embeddings: LLMs represent words as dense vectors in high-dimensional spaces, where similar words are closer in the vector space. These word embeddings capture semantic relationships between words, helping the model understand word meanings.
Contextual Embeddings: LLMs go beyond simple word embeddings by considering the context in which words appear. They generate contextual embeddings that change based on the surrounding words. This allows the model to understand how word meanings shift depending on context.
Structured Knowledge: LLMs may include structured knowledge in their training data, such as facts, entities, and relationships. This information can be used to answer factual questions and generate coherent responses.
Commonsense Knowledge: LLMs are trained on a diverse range of texts, enabling them to capture common knowledge about the world. They can answer general knowledge questions and make predictions based on this information.
Attention Mechanisms: LLMs employ attention mechanisms that highlight relevant parts of the input text when generating responses. This helps them focus on the most informative parts of the text.
External Knowledge Sources: LLMs can access external knowledge sources, such as databases or knowledge graphs, to retrieve information during inference. This allows them to provide up-to-date and accurate answers.

Understanding Language

The good news: There is structure to human languages and patterns that thankfully can be learned.

What makes "language" unique is that a finite set of sounds can be created in infinite ways. This set of sounds is called phonemes. For example, in English, the words "cat" and "bat" differ by one phoneme (the sound 'k' vs. the sound 'b'), and changing this phoneme changes the meaning of the word.

Morphemes are composed of phonemes and are the building blocks of words. There is a variety of ways, you can spot these in the English language:

cat and bat are examples of morphemes where the word is the morpheme itself.
telephone word is composed of two morphemes of tele and phone.
Unhappy word is composed of two morphemes Un- and happy.

Each language in the world has a distinctive set of phonemes and rules for combining them into morphemes and morphemes into words. Words will need to follow the rules of grammar to create any number of sentences.

In order to learn any language (you should try this with a foreign language), I need to learn the following patterns:

Phonemes: the basic sounds of the words themselves. I need this to be able to pronounce and understand spoken language.
Morphemes: How phonemes combine to create morphemes
Vocabulary of Words: How morphemes combine to create words. It is much faster this way. Most of the time, we start with words and learn the phonemes and morphemes.
Combining words to create sentences that have meaning: I need to know the rules of grammar. 'Meaning' has different levels of abstraction i.e. what words you choose and how you present that order decides whether it comes across as poetic or prose.

The pattern of learning a language is universal for all infants across the world. It is the exact same pattern that every infant goes through until they start specializing in a specific language. Infants possess the remarkable ability to discern subtle sound differences that signify distinctions in the languages spoken around them. In a very short time, they undergo a rapid learning process that enables them to detect statistical patterns in the language they are exposed to. This allows them to establish phonetic categories, recognize words within the continuous flow of speech, and grasp the structural patterns of their native language, all before they reach the age of 10 months.

A parallel journey unfolds in speech production, with infants exhibiting universal speech patterns during their early months, followed by increasing differentiation by around the age of 10 months. By the end of their first year, when they begin uttering their initial words, the process of language acquisition transitions from universal speech perception and production patterns to language-specific patterns. At the age of 10 months, if you expose the infant to a different language, they can pick it up easily.

In the research around LLMs (Large Language Models), the key is to provide enough examples so that the model can learn the structure of language, including word relationships, sentence syntax, and implicit meanings. When you train on vast amounts of text data and use a truly large model with numerous neural layers and parameters, the model begins to exhibit human-like abilities in generating sentences. However, this training process requires significant computational resources and is costly due to the sheer volume of data and model complexity.

A Note on Emergent Abilities

One of the most fascinating aspects of learning is the idea that latent abilities can emerge from mastering simple tasks. This phenomenon is similar to how we perceive certain "geniuses" as being exceptionally creative. Their unexpected or unplanned abilities often arise from interactions of simpler patterns within their brain, which lead to remarkable outcomes that leave us in awe.

When we examine their creative instincts, a few key characteristics stand out:

Complexity and Non-Linearity: Creativity often emerges from non-linear interactions, where small changes or combinations at one level can lead to significant and sometimes unpredictable outcomes at a higher level. Knowledge from one domain can suddenly resonate and apply to another in unexpected ways.
Self-Organization: Creative individuals often cannot fully explain their creativity. Their brains exhibit self-organizing properties, where new patterns or behaviors spontaneously arise without direct external influence or control.
Unpredictability: Many geniuses are known for their unpredictability. Their creative output is often inconsistent and hard to forecast, making it difficult to predict when or how their next breakthrough will occur.

These emergent abilities arise from the interaction of simpler processes, much like how creative genius in humans can spring from complex, self-organizing brain functions

Understanding Large Language Models

Introduction

Training large language models (LLMs) is like teaching a computer to understand and generate language, just like humans do when they learn to speak or write. These models use vast amounts of text data to learn patterns in language and can then generate responses, predict what comes next in a sentence, or even hold conversations. This chapter breaks down how these models are trained, evaluated, and optimized, making complex concepts accessible while diving deeper into the key steps.

What is Language Modeling?

The Basic Idea Language modeling involves predicting the next word in a sequence, similar to guessing the next line in a song based on the lyrics you've heard so far. For example, if a sentence starts with "The sun rises in the ___," the model is likely to predict "east" because it has seen many similar sentences during training.
How Do Language Models Work? LLMs analyze massive amounts of text data to learn patterns, like which words often appear together or in certain sequences. They use these patterns to predict the most likely next word in a sentence. It’s not just about memorizing; it's about understanding the probabilities of word combinations. For instance, "cat" is more likely to follow "The black" than "sky" would be.

The Model’s Brain: The Transformer

Why Transformers? Transformers revolutionized language modeling by introducing a way to look at relationships between all the words in a sentence simultaneously, rather than processing them one by one. Think of it as reading an entire paragraph at once to understand its meaning, rather than going word by word.
Key Components of the Transformer
- Attention Mechanism: This mechanism allows the model to focus on the important words in a sentence, similar to how we focus on key details when reading. For example, in the sentence "The cat, which was small and fluffy, climbed the tree," the attention mechanism helps the model focus more on "cat" and "climbed the tree" than the details about the cat's appearance.
- Positional Encoding: Since word order matters in language (e.g., "John hit the ball" is different from "The ball hit John"), the model needs to understand the positions of words. Positional encoding helps the model recognize word order in sentences.

How Do We Teach the Model?

Step 1: Pre-training: Pre-training involves feeding the model a vast amount of text from books, articles, and websites. It learns general language rules, just like how a person learns basic grammar by reading. The goal is for the model to get good at predicting the next word across various contexts. During pre-training, the model might see a sentence like "The dog ___" and learn to guess "barked" or "ran" based on the context in similar sentences it has encountered.
Step 2: Fine-Tuning: Fine-tuning is like specialized training. After learning general language rules, the model is further trained on specific tasks, such as answering questions or writing code. This is done using smaller, more focused datasets that are relevant to the task. Fine-tuning helps the model adapt to particular types of content or writing styles.

Breaking Down Text: Tokenization

What is Tokenization? Tokenization splits text into smaller pieces called tokens, which can be words, parts of words, or even individual characters. For example, "reading" might be split into "read" and "ing" or just treated as a single token.
Why Tokenization Matters: The model processes text more efficiently when it works with tokens. It also allows the model to handle typos, slang, and compound words better. For instance, tokenizing "unhappiness" into "un," "happy," and "ness" helps the model understand the components of the word.

How Do We Know It’s Working? Evaluation

Perplexity: Perplexity measures how well a model predicts a sample of text. Think of it as the number of different choices the model hesitates between when guessing the next word. Lower perplexity indicates the model is making more confident predictions.
Human Preference Ratings: Human evaluators review the model's output for tasks such as summarizing an article or writing an essay. They rate the responses based on criteria like coherence, relevance, and accuracy. This helps improve the model by giving feedback on what it did well and where it struggled.
Aggregated Benchmarks: LLMs are also tested against standardized benchmarks, which consist of various language tasks. This helps compare the model’s performance against other models or previous versions.

Making the Model Better: Scaling Up

More Data, Better Results: The more data the model is trained on, the better it can learn language patterns. It’s similar to how a person who reads a lot can improve their vocabulary and understanding. Larger models with more data can generate more accurate and nuanced responses.
Optimizing Resources: Training LLMs requires significant computing power, often using thousands of GPUs (graphics processing units). Techniques like mixed precision training (using smaller numbers to speed up calculations) and parallel processing (splitting the training across many GPUs) help make training more efficient.

Challenges and Future Directions

Handling Mistakes and Hallucinations: LLMs can sometimes generate incorrect or made-up information, known as "hallucinations." Researchers are working on ways to reduce these errors by improving how models are fine-tuned and evaluated.
Multimodal Language Models: The future of LLMs may involve combining text with other forms of data, like images or audio, to create models that understand and generate content across different types of media.
Ethical Considerations: Issues like data privacy, bias in training data, and the ethical use of AI are significant challenges. As models get smarter, addressing these concerns will be crucial to ensure responsible development.

Emergent Abilities in LLMs and Human Analogies

The key breakthrough in LLMs (Large Language Models) is that, as the models were scaled up to a large number of parameters, they began to display emergent abilities—behaviors that weren't observed at smaller scales. These emergent properties become apparent only when the models surpass a certain size threshold. The combination of the model’s architecture, extensive training data, and fine-tuning enables these abilities. The parallels to human cognitive abilities are remarkable, and this scaling has unlocked behaviors that smaller models cannot exhibit.

Complex Neural Architecture: Large Language Models (LLMs) are often built using sophisticated neural architectures like Transformers, which excel at learning complex patterns and relationships from vast amounts of data. Analogy: Just as we each have our own creative strengths, we're able to recognize patterns and make sense of sequences and their interconnections. More importantly, we know how to focus on the most relevant details, which enhances our ability to learn and create.

Large and Diverse Training Corpus: LLMs are trained on enormous and diverse datasets, pulling from a wide array of sources, topics, writing styles, and languages. This variety helps the models develop a broad linguistic and factual understanding. Analogy: Before mastering a field, we all have to immerse ourselves in the works that came before us. It's a rite of passage that prepares us to eventually become experts.

Unsupervised Learning: LLMs primarily rely on unsupervised learning, meaning they predict the next word in a sentence without needing explicit labels. This method allows the models to discover complex structures in language on their own. Analogy: Many creative geniuses are self-taught, and in the same way, you'll find yourself teaching and learning independently in whatever field you want to master.

Transfer Learning: LLMs use transfer learning, which allows them to apply knowledge learned in one domain to other areas. Analogy: There are universal patterns in the world that enable us to apply skills across different domains, just like how expertise in one area can translate to another.

Few-Shot and Zero-Shot Learning: LLMs, such as GPT-3, have the remarkable ability to perform tasks with few examples (few-shot learning) or even without examples (zero-shot learning), leveraging their prior knowledge and language understanding. Analogy: Have you ever heard of "improvisation"? It's the ability to create something new on the spot with minimal or no prior preparation, much like how LLMs tackle tasks with little guidance.

Last updated 8 months ago