How LLMs Work

March 28, 2026
#genai #ai #llm #machine-learning #transformers

In What is Generative AI?, we covered the big picture of GenAI and where Large Language Models fit in. Now let’s go a level deeper. How does an LLM actually generate text? What’s happening when ChatGPT or Claude “thinks” about your question?

You don’t need a PhD in machine learning to understand this. We’ll walk through the core ideas — transformers, attention, and next-token prediction — with enough depth to make you a more effective developer when working with these models.

The Core Idea: Next-Token Prediction

At its heart, an LLM does one thing: given a sequence of tokens, predict the most likely next token.

That’s it. Every impressive thing you’ve seen an LLM do — writing essays, debugging code, translating languages — comes down to repeatedly predicting “what word comes next?”

Here’s a simplified example. Given the input:

The capital of France is

The model assigns probabilities to possible next tokens:

Token Probability
Paris 0.92
the 0.03
a 0.02
Lyon 0.01

It picks “Paris” (or samples from the distribution, depending on the temperature setting), appends it to the sequence, and then predicts the next token after “The capital of France is Paris”. This process repeats until the model produces a stop token or hits the maximum length.

This is called autoregressive generation — each new token is generated based on all the tokens that came before it.

The Transformer Architecture

The breakthrough that made modern LLMs possible is the transformer, introduced in the 2017 paper “Attention Is All You Need” by researchers at Google. Before transformers, language models used recurrent neural networks (RNNs) that processed text one word at a time, sequentially. This was slow and made it hard to capture relationships between words that were far apart.

Transformers solved both problems. Here’s how they work at a high level:

Input: Tokenization and Embeddings

Before the model can process text, it needs to convert it into numbers. This happens in two steps:

  1. Tokenization — The input text is split into tokens (roughly word fragments). “Programming is fun” might become ["Program", "ming", " is", " fun"]. We cover this in detail in Tokens, Context Windows & Model Parameters.

  2. Embeddings — Each token is converted into a high-dimensional vector (a list of numbers). These vectors encode the meaning of each token in a way the model can work with. Words with similar meanings end up with similar vectors.

  3. Positional encoding — Since transformers process all tokens simultaneously (not sequentially), they need a way to know the order of tokens. Positional encodings are added to the embeddings to give the model information about where each token appears in the sequence.

The Attention Mechanism

The attention mechanism is the key innovation of transformers. It allows the model to look at all tokens in the input simultaneously and figure out which ones are most relevant to each other.

Consider this sentence:

The cat sat on the mat because it was tired.

What does “it” refer to? A human immediately knows it refers to “the cat,” not “the mat.” The attention mechanism lets the model make this same connection by computing how strongly each token should “attend to” every other token.

Here’s the intuition:

  • For each token, the model creates three vectors: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information do I provide?”).
  • The model compares each token’s Query against every other token’s Key to compute attention scores — essentially, “how relevant is token B to token A?”
  • These scores are used to create a weighted combination of the Value vectors, producing a new representation for each token that incorporates context from the entire sequence.

In practice, transformers use multi-head attention, which runs multiple attention computations in parallel. Each “head” can learn to focus on different types of relationships — one head might focus on syntactic relationships, another on semantic ones, another on positional proximity.

Stacking Layers

A transformer doesn’t just run attention once. Modern LLMs stack dozens or even hundreds of transformer layers (also called blocks). Each layer refines the model’s understanding:

  • Early layers tend to capture basic patterns — syntax, grammar, common phrases
  • Middle layers capture more complex relationships — semantics, entity references, logical structure
  • Later layers capture high-level reasoning and task-specific patterns

GPT-4 is rumored to have over 100 layers. Claude and other frontier models are similar in scale.

Output: Predicting the Next Token

After passing through all the transformer layers, the model produces a probability distribution over its entire vocabulary (typically 30,000–100,000+ tokens). The token with the highest probability — or a token sampled from the distribution — becomes the output.

The Training Process

How does a model learn all these patterns? Through training on massive datasets.

Pre-Training

During pre-training, the model is shown enormous amounts of text and learns to predict the next token. The process looks like this:

  1. Take a chunk of text from the training data
  2. Mask the last token
  3. Have the model predict it
  4. Compare the prediction to the actual token
  5. Adjust the model’s parameters to make the prediction more accurate
  6. Repeat — billions of times

This is called self-supervised learning because the training data provides its own labels. The model doesn’t need humans to manually label anything — it just learns from the structure of the text itself.

The scale of pre-training is staggering. Models like GPT-4 and Claude are trained on trillions of tokens from books, websites, code repositories, academic papers, and more, using thousands of GPUs running for months.

Fine-Tuning and RLHF

A pre-trained model is good at predicting text, but it’s not necessarily good at being helpful. It might complete your prompt with something that’s statistically likely but not actually useful.

To make models useful as assistants, they go through additional training:

  • Supervised Fine-Tuning (SFT) — The model is trained on examples of high-quality conversations: human questions paired with ideal responses.
  • Reinforcement Learning from Human Feedback (RLHF) — Human raters rank different model responses from best to worst. A reward model is trained on these rankings, and then the LLM is fine-tuned to produce responses that the reward model scores highly.

This is why ChatGPT and Claude behave like helpful assistants rather than just autocomplete engines — they’ve been specifically trained to be helpful, harmless, and honest.

Parameters: What the Model “Knows”

When people say a model has “70 billion parameters,” they’re referring to the number of adjustable weights in the model’s neural network. These parameters are the numbers that get tuned during training.

Think of parameters as the model’s learned knowledge, encoded as numbers. More parameters generally means:

  • More capacity to store patterns and knowledge
  • Better performance on complex tasks
  • More computational resources needed to run

Here’s a rough sense of scale:

Model Parameters
GPT-2 (2019) 1.5 billion
GPT-3 (2020) 175 billion
LLaMA 3 (2024) 8B – 405B
GPT-4 (2023) Estimated 1+ trillion

Why This Matters for Developers

Understanding how LLMs work isn’t just academic — it directly impacts how effectively you use them:

  • Prompt design — Knowing that models predict tokens sequentially explains why giving clear context upfront produces better results. The model can only attend to what’s in the prompt.
  • Temperature and sampling — Understanding the probability distribution over tokens explains what temperature does: low temperature picks the highest-probability token (more deterministic), high temperature samples more broadly (more creative).
  • Context window limits — The attention mechanism computes relationships between all tokens, which means computation grows quadratically with input length. This is why context windows have limits.
  • Hallucinations — The model is always predicting the most likely next token. If it doesn’t have relevant knowledge, it will still produce something plausible-sounding — because that’s what’s statistically likely.
  • Token costs — API pricing is based on tokens processed. Understanding tokenization helps you estimate costs and optimize prompts.

What’s Next?

Now that you understand the architecture, the next step is to get practical with the building blocks. In Tokens, Context Windows & Model Parameters, we’ll look at how text gets split into tokens, what context windows mean for your applications, and how settings like temperature affect output — all things you’ll work with directly when building with LLM APIs.

Thanks for visiting
We are actively updating content to this site. Thanks for visiting! Please bookmark this page and visit again soon.
Sponsor