Error Handling & Rate Limits
Table of Contents
LLM API calls fail. Servers go down, rate limits get hit, tokens exceed context windows, and networks time out. If your application doesn’t handle these failures gracefully, your users get cryptic errors or broken experiences.
This tutorial covers the common failure modes, how to detect them, and how to build retry logic that keeps your application running. You should have read Calling LLM APIs with JavaScript or Calling LLM APIs with Python first.
Common Error Types
| HTTP Status | Error | Cause | Retryable? |
|---|---|---|---|
| 400 | Bad Request | Invalid parameters, prompt too long | No — fix the request |
| 401 | Authentication Error | Invalid or missing API key | No — fix your key |
| 403 | Permission Denied | Key doesn’t have access to the model | No — check permissions |
| 404 | Not Found | Wrong model name or endpoint | No — fix the request |
| 429 | Rate Limit | Too many requests or token quota exceeded | Yes — wait and retry |
| 500 | Server Error | Provider-side issue | Yes — retry |
| 503 | Service Unavailable | Provider overloaded | Yes — retry |
The key distinction: 4xx errors (except 429) mean your request is wrong — fix it, don’t retry. 429 and 5xx errors are transient — retry with backoff.
Rate Limits
LLM providers enforce rate limits on two dimensions:
- Requests per minute (RPM) — How many API calls you can make
- Tokens per minute (TPM) — How many tokens you can process
When you exceed either limit, you get a 429 response. The response headers usually tell you when you can retry:
retry-after: 2
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 2s
Rate Limit Tiers
Limits vary by provider and plan. Here’s a rough sense of scale:
| Tier | RPM | TPM |
|---|---|---|
| Free tier | 3–20 | 10K–40K |
| Paid tier 1 | 500 | 200K |
| Paid tier 2+ | 5,000+ | 2M+ |
Check your provider’s documentation for exact limits. They change frequently and vary by model.
Exponential Backoff
The standard pattern for handling retryable errors is exponential backoff: wait a short time after the first failure, then double the wait time with each subsequent retry.
Attempt 1: fails → wait 1s
Attempt 2: fails → wait 2s
Attempt 3: fails → wait 4s
Attempt 4: give up
Adding jitter (a small random delay) prevents multiple clients from retrying at exactly the same time and causing another spike:
Handling Context Length Errors
If your prompt exceeds the model’s context window, you’ll get a 400 error. This is common when building applications with conversation history or RAG, where the input grows over time.
Strategies for managing context length:
Truncate Conversation History
Keep only the most recent messages:
function trimMessages(messages, maxMessages = 20) {
if (messages.length <= maxMessages) return messages;
// Always keep the system message, trim oldest user/assistant messages
const system = messages.filter((m) => m.role === "system");
const rest = messages.filter((m) => m.role !== "system");
return [...system, ...rest.slice(-maxMessages + system.length)];
}
Estimate Tokens Before Sending
Check the token count before making the API call:
import { encoding_for_model } from "tiktoken";
function estimateTokens(messages) {
const enc = encoding_for_model("gpt-4o");
let total = 0;
for (const msg of messages) {
total += enc.encode(msg.content).length + 4; // ~4 tokens overhead per message
}
enc.free();
return total;
}
const MAX_CONTEXT = 128000;
const MAX_OUTPUT = 4096;
if (estimateTokens(messages) + MAX_OUTPUT > MAX_CONTEXT) {
messages = trimMessages(messages);
}
Timeouts
LLM calls can be slow, especially for long responses. Set timeouts to prevent your application from hanging:
Handling Incomplete Responses
Sometimes the model stops generating before completing its response. Check the finish_reason:
| finish_reason | Meaning | Action |
|---|---|---|
stop |
Model finished naturally | Normal — use the response |
length |
Hit max_tokens limit |
Increase max_tokens or continue in a follow-up call |
content_filter |
Content was filtered | Rephrase the prompt |
const response = await client.chat.completions.create({
model: "gpt-4o",
max_tokens: 100,
messages,
});
if (response.choices[0].finish_reason === "length") {
console.warn("Response was truncated — consider increasing max_tokens");
}
Cost Protection
Runaway costs are a real risk, especially during development or if your application goes viral. Protect yourself:
Set spending limits in your provider’s dashboard. Every major provider lets you set monthly budget caps.
Track usage per request:
const response = await client.chat.completions.create({
model: "gpt-4o",
messages,
});
const { prompt_tokens, completion_tokens } = response.usage;
console.log(`Used ${prompt_tokens} input + ${completion_tokens} output tokens`);
Set max_tokens on every request to prevent unexpectedly long (and expensive) responses:
const response = await client.chat.completions.create({
model: "gpt-4o",
max_tokens: 1000, // Never generate more than 1000 tokens
messages,
});
Putting It All Together
Here’s a production-ready wrapper that combines retry logic, timeout, token estimation, and cost tracking:
import OpenAI from "openai";
const client = new OpenAI({ timeout: 30000 });
async function llmCall(messages, { maxTokens = 1000, retries = 3 } = {}) {
for (let i = 0; i <= retries; i++) {
try {
const res = await client.chat.completions.create({
model: "gpt-4o",
max_tokens: maxTokens,
messages,
});
if (res.choices[0].finish_reason === "length") {
console.warn("Response truncated");
}
console.log(`Tokens: ${res.usage.total_tokens}`);
return res.choices[0].message.content;
} catch (err) {
const retryable =
err instanceof OpenAI.RateLimitError ||
(err instanceof OpenAI.APIError && err.status >= 500);
if (!retryable || i === retries) throw err;
const delay = 1000 * Math.pow(2, i) + Math.random() * 1000;
await new Promise((r) => setTimeout(r, delay));
}
}
}
What’s Next?
With error handling covered, you have everything you need to build robust LLM-powered applications. In Comparing LLM Providers, we’ll look at the major providers — OpenAI, Anthropic, Google, and open-source options — to help you choose the right model for your use case.