Tokens, Context Windows & Model Parameters

March 28, 2026

#genai #ai #llm #tokens #prompt-engineering

When you work with LLM APIs, three concepts come up constantly: tokens, context windows, and model parameters like temperature. These aren’t abstract theory — they directly affect your costs, the quality of responses, and what you can build. This tutorial covers all three with practical examples.

If you haven’t already, read How LLMs Work first for the underlying architecture.

Tokens

LLMs don’t read text the way humans do. They break input into tokens — chunks that are roughly word fragments. Tokenization is the first step in every LLM interaction, and understanding it helps you write better prompts, estimate costs, and debug unexpected behavior.

How Tokenization Works

Most modern LLMs use a tokenization algorithm called Byte Pair Encoding (BPE). The idea is:

Start with individual characters as tokens
Find the most frequently occurring pair of adjacent tokens in the training data
Merge that pair into a new token
Repeat thousands of times

The result is a vocabulary of typically 30,000–100,000 tokens that efficiently represents text. Common words become single tokens, while rare words get split into pieces.

Here are some examples of how text gets tokenized (using GPT-style tokenization):

Text	Tokens	Count
`Hello world`	`["Hello", " world"]`	2
`Tokenization`	`["Token", "ization"]`	2
`GPT-4o`	`["G", "PT", "-", "4", "o"]`	5
`こんにちは`	`["こん", "にち", "は"]`	3
`function add(a, b) { return a + b; }`	`["function", " add", "(", "a", ",", " b", ")", " {", " return", " a", " +", " b", ";", " }"]`	14

A few things to notice:

Spaces are part of tokens. The space before “world” is included in the token " world". This is why tokenizers sometimes produce unexpected results.
Common words are single tokens. Words like “the”, “function”, and “return” are each one token.
Rare or compound words get split. “Tokenization” becomes two tokens.
Code is token-heavy. Punctuation, brackets, and operators each consume tokens.
Non-English text uses more tokens. Languages with non-Latin scripts typically require more tokens per word.

Counting Tokens in Practice

When building applications, you’ll often need to count tokens to stay within limits and estimate costs. Here’s how to do it with the popular tiktoken library (used by OpenAI models):

import { encoding_for_model } from "tiktoken";

const enc = encoding_for_model("gpt-4o");
const tokens = enc.encode("How many tokens is this?");
console.log(tokens.length); // 6
enc.free();

Each model family uses its own tokenizer, so the same text may produce different token counts across providers. Always use the tokenizer that matches your target model.

Why Tokens Matter

Pricing — API providers charge per token (both input and output). A typical rate might be $2.50 per million input tokens and $10 per million output tokens for a frontier model.
Speed — More tokens = longer generation time. Output tokens are generated sequentially, so a 1,000-token response takes roughly 10x longer than a 100-token response.
Context limits — Every model has a maximum number of tokens it can process at once (the context window). Your input and the model’s output must fit within this limit.

Context Windows

The context window is the maximum number of tokens a model can handle in a single request — including both your input (the prompt) and the model’s output (the completion).

Context Window Sizes

Context windows have grown dramatically:

Model	Context Window
GPT-3 (2020)	4,096 tokens
GPT-4 (2023)	8,192 / 128K tokens
Claude 3.5 (2024)	200K tokens
Gemini 1.5 Pro (2024)	1M+ tokens
GPT-4o (2024)	128K tokens

To put this in perspective, 128K tokens is roughly 96,000 words — about the length of a full novel. A 200K context window can hold several books or an entire codebase.

Working Within Context Limits

Even with large context windows, you need to manage them carefully:

Input + output must fit. If your model has a 128K context window and your prompt uses 120K tokens, the model can only generate 8K tokens of output. Most APIs let you set a max_tokens parameter to control output length.

Longer contexts cost more. Both in money (more tokens = higher cost) and latency (the attention mechanism scales quadratically with sequence length).

Information can get “lost” in the middle. Research has shown that LLMs pay more attention to information at the beginning and end of the context window. If you’re stuffing a lot of documents into a prompt, put the most important content first or last.

// Setting max output tokens in an API call
const response = await client.chat.completions.create({
  model: "gpt-4o",
  max_tokens: 500,
  messages: [{ role: "user", content: prompt }],
});

Model Parameters

When calling an LLM API, you can tune several parameters that control how the model generates text. These don’t change the model itself — they change the sampling strategy used to pick tokens from the probability distribution.

Temperature

Temperature controls the randomness of the output. It adjusts the probability distribution before a token is selected:

Temperature 0 — Always picks the highest-probability token. Output is deterministic and focused. Good for factual questions, code generation, and structured output.
Temperature 0.5–0.7 — A balanced middle ground. Some variety while staying coherent.
Temperature 1.0 — Samples proportionally from the full distribution. More creative and varied, but also more likely to produce unexpected or off-topic results.
Temperature > 1.0 — Flattens the distribution further, making unlikely tokens more probable. Rarely useful in practice.

// Low temperature for factual/deterministic output
const factual = await client.chat.completions.create({
  model: "gpt-4o",
  temperature: 0,
  messages: [{ role: "user", content: "What is the capital of France?" }],
});

// Higher temperature for creative writing
const creative = await client.chat.completions.create({
  model: "gpt-4o",
  temperature: 0.9,
  messages: [{ role: "user", content: "Write a haiku about debugging." }],
});

Top-p (Nucleus Sampling)

Top-p is an alternative to temperature for controlling randomness. Instead of adjusting the distribution, it limits which tokens are considered:

Top-p 0.1 — Only considers tokens in the top 10% of probability mass. Very focused.
Top-p 0.9 — Considers tokens that make up 90% of the probability mass. More varied.
Top-p 1.0 — Considers all tokens (default).

Most providers recommend adjusting either temperature or top-p, but not both at the same time. Changing both can produce unpredictable results.

Other Common Parameters

Parameter	What It Does
`max_tokens`	Maximum number of tokens to generate in the response
`stop`	Sequences that tell the model to stop generating (e.g., `["\n\n"]`)
`frequency_penalty`	Reduces repetition by penalizing tokens that have already appeared
`presence_penalty`	Encourages the model to talk about new topics by penalizing tokens that have appeared at all

Choosing the Right Settings

Here are practical starting points for common use cases:

Use Case	Temperature	Top-p	Notes
Code generation	0	1.0	Deterministic, correct code
Factual Q&A	0–0.3	1.0	Consistent, accurate answers
Summarization	0.3	1.0	Faithful to source material
Creative writing	0.7–0.9	0.95	Varied, interesting output
Brainstorming	0.9–1.0	0.95	Maximum variety

Putting It All Together

Here’s a practical example that demonstrates token counting, context management, and parameter tuning:

import OpenAI from "openai";
import { encoding_for_model } from "tiktoken";

const client = new OpenAI();
const MODEL = "gpt-4o";
const MAX_CONTEXT = 128000;
const MAX_OUTPUT = 1000;

function countTokens(text) {
  const enc = encoding_for_model(MODEL);
  const count = enc.encode(text).length;
  enc.free();
  return count;
}

async function chat(prompt) {
  const inputTokens = countTokens(prompt);
  if (inputTokens + MAX_OUTPUT > MAX_CONTEXT) {
    throw new Error(`Prompt too long: ${inputTokens} tokens`);
  }

  return client.chat.completions.create({
    model: MODEL,
    temperature: 0.3,
    max_tokens: MAX_OUTPUT,
    messages: [{ role: "user", content: prompt }],
  });
}

What’s Next?

With a solid understanding of tokens, context windows, and model parameters, you’re ready to start crafting effective prompts. In Prompt Engineering Fundamentals, we’ll cover the techniques and patterns that get consistently better results from LLMs.

Table of Contents