Large language models are everywhere now. You’ve probably used one to write an email, debug code, summarize a document, or just have an interesting conversation. But have you ever stopped to wonder what’s actually happening when you send a message and get a thoughtful, coherent response back?
The technology behind these systems is genuinely fascinating, and you don’t need a mathematics degree to understand the core ideas. Here’s a clear explanation of how LLMs work — from the training data all the way to the response you read on your screen.
Start With a Simple Framing: Next-Word Prediction
At its most fundamental level, a large language model is trained to predict what comes next in a sequence of text. Given the words “The cat sat on the,” a good model should predict “mat” (or “floor,” or “couch”) is likely to follow. Given “The capital of France is,” the model should predict “Paris.”
This sounds trivially simple. It isn’t. To predict the next word accurately across billions of examples of human writing — novels, scientific papers, code, forum posts, news articles, legal documents, poetry — the model has to develop a deep implicit understanding of language, facts about the world, reasoning patterns, and context. The prediction task is simple; what the model learns to solve it is extraordinarily rich.
Everything you’ve seen LLMs do — answering questions, writing stories, translating languages, explaining concepts, writing code — emerges from this single training objective applied at enormous scale.
What Are Tokens?
Before a model can process text, the text needs to be converted into a format the model can work with mathematically. That process is called tokenization.
A token is roughly a chunk of text — sometimes a full word, sometimes part of a word, sometimes a punctuation mark. The sentence “I love learning about AI” might become five tokens: “I”, “love”, “learning”, “about”, “AI”. But the word “unbelievable” might become three tokens: “un”, “believ”, “able” — because it’s more efficient to break uncommon or long words into meaningful pieces that appear more frequently in the training data.
GPT-4 and similar models typically work with about 100,000 different tokens. Every piece of text going in or coming out is represented as a sequence of these tokens. The model sees numbers, not words — each token is represented by a number, and those numbers are converted into high-dimensional vectors (lists of hundreds or thousands of decimal numbers) that capture the meaning and relationships between tokens.
The Transformer Architecture: Why It Changed Everything
Before 2017, language models existed but struggled with one critical problem: they couldn’t effectively handle long-range dependencies. A model might understand the first sentence of a paragraph fine, but by the time it reached sentence ten, it had largely forgotten what sentence one said. That made them useful for short snippets but unreliable for anything requiring sustained understanding across a longer document.
The Transformer architecture, introduced in the 2017 paper “Attention Is All You Need,” solved this with a mechanism called self-attention. Self-attention lets the model look at every part of the input simultaneously and decide how much each part matters for understanding any other part.
Here’s the intuition: when reading the sentence “The trophy didn’t fit in the suitcase because it was too large,” you instinctively know that “it” refers to the trophy, not the suitcase. You made that judgment by understanding the relationship between “it,” “large,” “trophy,” and “suitcase” — not by processing the sentence left-to-right like an assembly line.
Self-attention works similarly. For every token in a sequence, it calculates how much attention to pay to every other token. This allows the model to understand relationships and context across any distance in the text, regardless of how far apart the relevant pieces are.
Modern LLMs like GPT-4, Claude, and Gemini are all based on the Transformer architecture. It’s the single most important technical development in the history of natural language processing.
Training: How the Model Learns
Training a large language model is a massive computational process that happens in three broad stages.
Pre-training: Learning From the Internet
During pre-training, the model is exposed to an enormous corpus of text — hundreds of billions to trillions of words scraped from books, websites, code repositories, scientific papers, and more. For each piece of text, the model makes predictions about what comes next, compares its predictions to what actually came next, and adjusts its internal parameters (billions of numbers called weights) to make better predictions next time.
This process runs continuously on thousands of specialized AI chips (GPUs or TPUs) for weeks or months. Training GPT-4 reportedly cost over $100 million in compute alone. The result is a model that has compressed an enormous amount of human knowledge and language patterns into its weights — but that isn’t yet optimized to be helpful, honest, or safe as a conversational assistant.
Fine-tuning: Teaching It to Be Helpful
After pre-training, the model goes through fine-tuning — a second training phase on a much smaller, carefully curated dataset designed to make the model respond well in conversations. Human trainers write examples of good responses to various prompts. The model learns to follow this format.
RLHF: Learning From Human Preferences
Most advanced LLMs then go through a process called Reinforcement Learning from Human Feedback (RLHF). Human evaluators compare different model responses and rate which is better — more helpful, more accurate, less harmful. A separate model (the reward model) learns to predict these human preferences. The LLM is then further trained to generate responses that score highly according to the reward model.
This is what makes modern LLMs notably better at being actually useful than their pre-trained counterparts would be. The pre-trained model knows a lot; the RLHF-tuned model knows how to be genuinely helpful with what it knows.
Context Windows: The Model’s Working Memory
One key limitation of LLMs is the context window — the maximum amount of text they can consider at once. Early GPT models had a context window of about 4,000 tokens (roughly 3,000 words). Modern models like GPT-4 Turbo and Claude 3 can handle 128,000 tokens or more — about 90,000 words, or a full novel.
Everything you’ve sent in a conversation, plus the model’s responses, needs to fit within this context window. When you have a very long conversation, older parts of it may get dropped. The model isn’t “remembering” your previous conversations the way a human colleague would — it’s processing everything within a single fixed-length context window each time it generates a response.
Inference: What Happens When You Send a Message
When you send a message to ChatGPT or Claude, here’s what happens in roughly the right order:
Your message is tokenized — converted from text into a sequence of numbers. Those numbers are fed through the model’s neural network, which involves passing the token representations through dozens of Transformer layers, each one applying self-attention and other operations. The final layer outputs a probability distribution over all possible next tokens — essentially a scored ranking of every token in the vocabulary.
The model samples from this distribution to pick the next token, adds it to the sequence, runs the whole thing again to pick the next token, and repeats until it generates a complete response. This token-by-token generation is why responses stream out word by word rather than appearing all at once — each word requires a separate forward pass through the model.
Why LLMs Make Things Up (Hallucinations)
The generation process above reveals why LLMs sometimes confidently state false information. The model is always generating the next most probable token given everything before it — it’s not retrieving stored facts, it’s pattern-matching at every step. When it encounters a question about something obscure or at the edge of its training data, it generates plausible-sounding text that fits the expected pattern of an answer — even if the content is wrong.
This is an active research area. Techniques like Retrieval-Augmented Generation (RAG) help by giving the model access to verified external information before generating a response, reducing its reliance on pattern-matching alone.
The Scale That Makes It All Work
One of the most surprising discoveries in LLM research is that scale doesn’t just improve performance incrementally — it unlocks entirely new capabilities. Models at certain size thresholds suddenly become able to do things smaller models simply cannot: multi-step reasoning, following complex instructions, learning new tasks from just a few examples (few-shot learning). These are called emergent capabilities, and they’ve driven much of the excitement (and concern) around large language models over the past few years.
Understanding this basic architecture — tokens, transformers, attention, training, and inference — gives you a solid foundation for understanding why LLMs behave the way they do: their strengths, their limitations, and the directions the field is moving.
