How LLMs Work

March 28, 2026

#genai #ai #llm #machine-learning #transformers

In What is Generative AI?, we covered the big picture of GenAI and where Large Language Models fit in. Now let’s go a level deeper. How does an LLM actually generate text? What’s happening when ChatGPT or Claude “thinks” about your question?

You don’t need a PhD in machine learning to understand this. We’ll walk through the core ideas — transformers, attention, and next-token prediction — with enough depth to make you a more effective developer when working with these models.

The Core Idea: Next-Token Prediction

At its heart, an LLM does one thing: given a sequence of tokens, predict the most likely next token.

That’s it. Every impressive thing you’ve seen an LLM do — writing essays, debugging code, translating languages — comes down to repeatedly predicting “what word comes next?”

Here’s a simplified example. Given the input:

The capital of France is

The model assigns probabilities to possible next tokens:

Token	Probability
Paris	0.92
the	0.03
a	0.02
Lyon	0.01
…	…

It picks “Paris” (or samples from the distribution, depending on the temperature setting), appends it to the sequence, and then predicts the next token after “The capital of France is Paris”. This process repeats until the model produces a stop token or hits the maximum length.

This is called autoregressive generation — each new token is generated based on all the tokens that came before it.

The Transformer Architecture

The breakthrough that made modern LLMs possible is the transformer, introduced in the 2017 paper “Attention Is All You Need” by researchers at Google. Before transformers, language models used recurrent neural networks (RNNs) that processed text one word at a time, sequentially. This was slow and made it hard to capture relationships between words that were far apart.

Transformers solved both problems. Here’s how they work at a high level:

Input: Tokenization and Embeddings

Before the model can process text, it needs to convert it into numbers. This happens in two steps:

Tokenization — The input text is split into tokens (roughly word fragments). “Programming is fun” might become ["Program", "ming", " is", " fun"]. We cover this in detail in Tokens, Context Windows & Model Parameters.
Embeddings — Each token is converted into a high-dimensional vector (a list of numbers). These vectors encode the meaning of each token in a way the model can work with. Words with similar meanings end up with similar vectors.
Positional encoding — Since transformers process all tokens simultaneously (not sequentially), they need a way to know the order of tokens. Positional encodings are added to the embeddings to give the model information about where each token appears in the sequence.

The Attention Mechanism

The attention mechanism is the key innovation of transformers. It allows the model to look at all tokens in the input simultaneously and figure out which ones are most relevant to each other.

Consider this sentence:

The cat sat on the mat because it was tired.

What does “it” refer to? A human immediately knows it refers to “the cat,” not “the mat.” The attention mechanism lets the model make this same connection by computing how strongly each token should “attend to” every other token.

Here’s the intuition:

For each token, the model creates three vectors: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I provide?”).
The model compares each token’s Query against every other token’s Key to compute attention scores — essentially, “how relevant is token B to token A?”
These scores are used to create a weighted combination of the Value vectors, producing a new representation for each token that incorporates context from the entire sequence.

In practice, transformers use multi-head attention, which runs multiple attention computations in parallel. Each “head” can learn to focus on different types of relationships — one head might focus on syntactic relationships, another on semantic ones, another on positional proximity.

This is why LLMs are good at understanding context. Unlike older models that could only look at nearby words, attention lets the model connect any two tokens in the input, regardless of how far apart they are.

Stacking Layers

A transformer doesn’t just run attention once. Modern LLMs stack dozens or even hundreds of transformer layers (also called blocks). Each layer refines the model’s understanding:

Early layers tend to capture basic patterns — syntax, grammar, common phrases
Middle layers capture more complex relationships — semantics, entity references, logical structure
Later layers capture high-level reasoning and task-specific patterns

GPT-4 is rumored to have over 100 layers. Claude and other frontier models are similar in scale.

Output: Predicting the Next Token

After passing through all the transformer layers, the model produces a probability distribution over its entire vocabulary (typically 30,000–100,000+ tokens). The token with the highest probability — or a token sampled from the distribution — becomes the output.

The Training Process

How does a model learn all these patterns? Through training on massive datasets.

Pre-Training

During pre-training, the model is shown enormous amounts of text and learns to predict the next token. The process looks like this:

Take a chunk of text from the training data
Mask the last token
Have the model predict it
Compare the prediction to the actual token
Adjust the model’s parameters to make the prediction more accurate
Repeat — billions of times

This is called self-supervised learning because the training data provides its own labels. The model doesn’t need humans to manually label anything — it just learns from the structure of the text itself.

The scale of pre-training is staggering. Models like GPT-4 and Claude are trained on trillions of tokens from books, websites, code repositories, academic papers, and more, using thousands of GPUs running for months.

Fine-Tuning and RLHF

A pre-trained model is good at predicting text, but it’s not necessarily good at being helpful. It might complete your prompt with something that’s statistically likely but not actually useful.

To make models useful as assistants, they go through additional training:

Supervised Fine-Tuning (SFT) — The model is trained on examples of high-quality conversations: human questions paired with ideal responses.
Reinforcement Learning from Human Feedback (RLHF) — Human raters rank different model responses from best to worst. A reward model is trained on these rankings, and then the LLM is fine-tuned to produce responses that the reward model scores highly.

This is why ChatGPT and Claude behave like helpful assistants rather than just autocomplete engines — they’ve been specifically trained to be helpful, harmless, and honest.

Parameters: What the Model “Knows”

When people say a model has “70 billion parameters,” they’re referring to the number of adjustable weights in the model’s neural network. These parameters are the numbers that get tuned during training.

Think of parameters as the model’s learned knowledge, encoded as numbers. More parameters generally means:

More capacity to store patterns and knowledge
Better performance on complex tasks
More computational resources needed to run

Here’s a rough sense of scale:

Model	Parameters
GPT-2 (2019)	1.5 billion
GPT-3 (2020)	175 billion
LLaMA 3 (2024)	8B – 405B
GPT-4 (2023)	Estimated 1+ trillion

Bigger isn’t always better. Smaller models that are well-trained on high-quality data can outperform larger models on specific tasks. The trend in 2025–2026 has been toward more efficient models that do more with fewer parameters.

Why This Matters for Developers

Understanding how LLMs work isn’t just academic — it directly impacts how effectively you use them:

Prompt design — Knowing that models predict tokens sequentially explains why giving clear context upfront produces better results. The model can only attend to what’s in the prompt.
Temperature and sampling — Understanding the probability distribution over tokens explains what temperature does: low temperature picks the highest-probability token (more deterministic), high temperature samples more broadly (more creative).
Context window limits — The attention mechanism computes relationships between all tokens, which means computation grows quadratically with input length. This is why context windows have limits.
Hallucinations — The model is always predicting the most likely next token. If it doesn’t have relevant knowledge, it will still produce something plausible-sounding — because that’s what’s statistically likely.
Token costs — API pricing is based on tokens processed. Understanding tokenization helps you estimate costs and optimize prompts.

What’s Next?

Now that you understand the architecture, the next step is to get practical with the building blocks. In Tokens, Context Windows & Model Parameters, we’ll look at how text gets split into tokens, what context windows mean for your applications, and how settings like temperature affect output — all things you’ll work with directly when building with LLM APIs.

Table of Contents