Tokens, Context Windows & Model Parameters
Table of Contents
When you work with LLM APIs, three concepts come up constantly: tokens, context windows, and model parameters like temperature. These aren’t abstract theory — they directly affect your costs, the quality of responses, and what you can build. This tutorial covers all three with practical examples.
If you haven’t already, read How LLMs Work first for the underlying architecture.
Tokens
LLMs don’t read text the way humans do. They break input into tokens — chunks that are roughly word fragments. Tokenization is the first step in every LLM interaction, and understanding it helps you write better prompts, estimate costs, and debug unexpected behavior.
How Tokenization Works
Most modern LLMs use a tokenization algorithm called Byte Pair Encoding (BPE). The idea is:
- Start with individual characters as tokens
- Find the most frequently occurring pair of adjacent tokens in the training data
- Merge that pair into a new token
- Repeat thousands of times
The result is a vocabulary of typically 30,000–100,000 tokens that efficiently represents text. Common words become single tokens, while rare words get split into pieces.
Here are some examples of how text gets tokenized (using GPT-style tokenization):
| Text | Tokens | Count |
|---|---|---|
Hello world |
["Hello", " world"] |
2 |
Tokenization |
["Token", "ization"] |
2 |
GPT-4o |
["G", "PT", "-", "4", "o"] |
5 |
こんにちは |
["こん", "にち", "は"] |
3 |
function add(a, b) { return a + b; } |
["function", " add", "(", "a", ",", " b", ")", " {", " return", " a", " +", " b", ";", " }"] |
14 |
A few things to notice:
- Spaces are part of tokens. The space before “world” is included in the token
" world". This is why tokenizers sometimes produce unexpected results. - Common words are single tokens. Words like “the”, “function”, and “return” are each one token.
- Rare or compound words get split. “Tokenization” becomes two tokens.
- Code is token-heavy. Punctuation, brackets, and operators each consume tokens.
- Non-English text uses more tokens. Languages with non-Latin scripts typically require more tokens per word.
Counting Tokens in Practice
When building applications, you’ll often need to count tokens to stay within limits and estimate costs. Here’s how to do it with the popular tiktoken library (used by OpenAI models):
import { encoding_for_model } from "tiktoken";
const enc = encoding_for_model("gpt-4o");
const tokens = enc.encode("How many tokens is this?");
console.log(tokens.length); // 6
enc.free();
Why Tokens Matter
- Pricing — API providers charge per token (both input and output). A typical rate might be $2.50 per million input tokens and $10 per million output tokens for a frontier model.
- Speed — More tokens = longer generation time. Output tokens are generated sequentially, so a 1,000-token response takes roughly 10x longer than a 100-token response.
- Context limits — Every model has a maximum number of tokens it can process at once (the context window). Your input and the model’s output must fit within this limit.
Context Windows
The context window is the maximum number of tokens a model can handle in a single request — including both your input (the prompt) and the model’s output (the completion).
Context Window Sizes
Context windows have grown dramatically:
| Model | Context Window |
|---|---|
| GPT-3 (2020) | 4,096 tokens |
| GPT-4 (2023) | 8,192 / 128K tokens |
| Claude 3.5 (2024) | 200K tokens |
| Gemini 1.5 Pro (2024) | 1M+ tokens |
| GPT-4o (2024) | 128K tokens |
To put this in perspective, 128K tokens is roughly 96,000 words — about the length of a full novel. A 200K context window can hold several books or an entire codebase.
Working Within Context Limits
Even with large context windows, you need to manage them carefully:
Input + output must fit. If your model has a 128K context window and your prompt uses 120K tokens, the model can only generate 8K tokens of output. Most APIs let you set a max_tokens parameter to control output length.
Longer contexts cost more. Both in money (more tokens = higher cost) and latency (the attention mechanism scales quadratically with sequence length).
Information can get “lost” in the middle. Research has shown that LLMs pay more attention to information at the beginning and end of the context window. If you’re stuffing a lot of documents into a prompt, put the most important content first or last.
// Setting max output tokens in an API call
const response = await client.chat.completions.create({
model: "gpt-4o",
max_tokens: 500,
messages: [{ role: "user", content: prompt }],
});
Model Parameters
When calling an LLM API, you can tune several parameters that control how the model generates text. These don’t change the model itself — they change the sampling strategy used to pick tokens from the probability distribution.
Temperature
Temperature controls the randomness of the output. It adjusts the probability distribution before a token is selected:
- Temperature 0 — Always picks the highest-probability token. Output is deterministic and focused. Good for factual questions, code generation, and structured output.
- Temperature 0.5–0.7 — A balanced middle ground. Some variety while staying coherent.
- Temperature 1.0 — Samples proportionally from the full distribution. More creative and varied, but also more likely to produce unexpected or off-topic results.
- Temperature > 1.0 — Flattens the distribution further, making unlikely tokens more probable. Rarely useful in practice.
// Low temperature for factual/deterministic output
const factual = await client.chat.completions.create({
model: "gpt-4o",
temperature: 0,
messages: [{ role: "user", content: "What is the capital of France?" }],
});
// Higher temperature for creative writing
const creative = await client.chat.completions.create({
model: "gpt-4o",
temperature: 0.9,
messages: [{ role: "user", content: "Write a haiku about debugging." }],
});
Top-p (Nucleus Sampling)
Top-p is an alternative to temperature for controlling randomness. Instead of adjusting the distribution, it limits which tokens are considered:
- Top-p 0.1 — Only considers tokens in the top 10% of probability mass. Very focused.
- Top-p 0.9 — Considers tokens that make up 90% of the probability mass. More varied.
- Top-p 1.0 — Considers all tokens (default).
Other Common Parameters
| Parameter | What It Does |
|---|---|
max_tokens |
Maximum number of tokens to generate in the response |
stop |
Sequences that tell the model to stop generating (e.g., ["\n\n"]) |
frequency_penalty |
Reduces repetition by penalizing tokens that have already appeared |
presence_penalty |
Encourages the model to talk about new topics by penalizing tokens that have appeared at all |
Choosing the Right Settings
Here are practical starting points for common use cases:
| Use Case | Temperature | Top-p | Notes |
|---|---|---|---|
| Code generation | 0 | 1.0 | Deterministic, correct code |
| Factual Q&A | 0–0.3 | 1.0 | Consistent, accurate answers |
| Summarization | 0.3 | 1.0 | Faithful to source material |
| Creative writing | 0.7–0.9 | 0.95 | Varied, interesting output |
| Brainstorming | 0.9–1.0 | 0.95 | Maximum variety |
Putting It All Together
Here’s a practical example that demonstrates token counting, context management, and parameter tuning:
import OpenAI from "openai";
import { encoding_for_model } from "tiktoken";
const client = new OpenAI();
const MODEL = "gpt-4o";
const MAX_CONTEXT = 128000;
const MAX_OUTPUT = 1000;
function countTokens(text) {
const enc = encoding_for_model(MODEL);
const count = enc.encode(text).length;
enc.free();
return count;
}
async function chat(prompt) {
const inputTokens = countTokens(prompt);
if (inputTokens + MAX_OUTPUT > MAX_CONTEXT) {
throw new Error(`Prompt too long: ${inputTokens} tokens`);
}
return client.chat.completions.create({
model: MODEL,
temperature: 0.3,
max_tokens: MAX_OUTPUT,
messages: [{ role: "user", content: prompt }],
});
}
What’s Next?
With a solid understanding of tokens, context windows, and model parameters, you’re ready to start crafting effective prompts. In Prompt Engineering Fundamentals, we’ll cover the techniques and patterns that get consistently better results from LLMs.