Error Handling & Rate Limits

March 28, 2026

LLM API calls fail. Servers go down, rate limits get hit, tokens exceed context windows, and networks time out. If your application doesn’t handle these failures gracefully, your users get cryptic errors or broken experiences.

This tutorial covers the common failure modes, how to detect them, and how to build retry logic that keeps your application running. You should have read Calling LLM APIs with JavaScript or Calling LLM APIs with Python first.

Common Error Types

HTTP Status	Error	Cause	Retryable?
400	Bad Request	Invalid parameters, prompt too long	No — fix the request
401	Authentication Error	Invalid or missing API key	No — fix your key
403	Permission Denied	Key doesn’t have access to the model	No — check permissions
404	Not Found	Wrong model name or endpoint	No — fix the request
429	Rate Limit	Too many requests or token quota exceeded	Yes — wait and retry
500	Server Error	Provider-side issue	Yes — retry
503	Service Unavailable	Provider overloaded	Yes — retry

The key distinction: 4xx errors (except 429) mean your request is wrong — fix it, don’t retry. 429 and 5xx errors are transient — retry with backoff.

Rate Limits

LLM providers enforce rate limits on two dimensions:

Requests per minute (RPM) — How many API calls you can make
Tokens per minute (TPM) — How many tokens you can process

When you exceed either limit, you get a 429 response. The response headers usually tell you when you can retry:

retry-after: 2
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 2s

Rate Limit Tiers

Limits vary by provider and plan. Here’s a rough sense of scale:

Tier	RPM	TPM
Free tier	3–20	10K–40K
Paid tier 1	500	200K
Paid tier 2+	5,000+	2M+

Check your provider’s documentation for exact limits. They change frequently and vary by model.

Exponential Backoff

The standard pattern for handling retryable errors is exponential backoff: wait a short time after the first failure, then double the wait time with each subsequent retry.

Attempt 1: fails → wait 1s
Attempt 2: fails → wait 2s
Attempt 3: fails → wait 4s
Attempt 4: give up

Adding jitter (a small random delay) prevents multiple clients from retrying at exactly the same time and causing another spike:

import OpenAI from "openai";

const client = new OpenAI();

async function chatWithRetry(messages, { retries = 3, baseDelay = 1000 } = {}) {
  for (let i = 0; i <= retries; i++) {
    try {
      return await client.chat.completions.create({
        model: "gpt-4o",
        messages,
      });
    } catch (err) {
      const isRetryable =
        err instanceof OpenAI.RateLimitError ||
        (err instanceof OpenAI.APIError && err.status >= 500);

      if (!isRetryable || i === retries) throw err;

      const delay = baseDelay * Math.pow(2, i) + Math.random() * 1000;
      console.log(`Retry ${i + 1} in ${Math.round(delay)}ms...`);
      await new Promise((r) => setTimeout(r, delay));
    }
  }
}

import random
import time
from openai import OpenAI, RateLimitError, APIStatusError

client = OpenAI()


def chat_with_retry(messages, retries=3, base_delay=1.0):
    for i in range(retries + 1):
        try:
            return client.chat.completions.create(model="gpt-4o", messages=messages)
        except (RateLimitError, APIStatusError) as err:
            is_retryable = isinstance(err, RateLimitError) or (
                hasattr(err, "status_code") and err.status_code >= 500
            )
            if not is_retryable or i == retries:
                raise

            delay = base_delay * (2**i) + random.random()
            print(f"Retry {i + 1} in {delay:.1f}s...")
            time.sleep(delay)

Handling Context Length Errors

If your prompt exceeds the model’s context window, you’ll get a 400 error. This is common when building applications with conversation history or RAG, where the input grows over time.

Strategies for managing context length:

Truncate Conversation History

Keep only the most recent messages:

function trimMessages(messages, maxMessages = 20) {
  if (messages.length <= maxMessages) return messages;
  // Always keep the system message, trim oldest user/assistant messages
  const system = messages.filter((m) => m.role === "system");
  const rest = messages.filter((m) => m.role !== "system");
  return [...system, ...rest.slice(-maxMessages + system.length)];
}

Estimate Tokens Before Sending

Check the token count before making the API call:

import { encoding_for_model } from "tiktoken";

function estimateTokens(messages) {
  const enc = encoding_for_model("gpt-4o");
  let total = 0;
  for (const msg of messages) {
    total += enc.encode(msg.content).length + 4; // ~4 tokens overhead per message
  }
  enc.free();
  return total;
}

const MAX_CONTEXT = 128000;
const MAX_OUTPUT = 4096;

if (estimateTokens(messages) + MAX_OUTPUT > MAX_CONTEXT) {
  messages = trimMessages(messages);
}

Timeouts

LLM calls can be slow, especially for long responses. Set timeouts to prevent your application from hanging:

const client = new OpenAI({ timeout: 30000 }); // 30 second timeout

// Or per-request:
const response = await client.chat.completions.create(
  { model: "gpt-4o", messages },
  { timeout: 15000 }
);

client = OpenAI(timeout=30.0)  # 30 second timeout

# Or per-request:
response = client.with_options(timeout=15.0).chat.completions.create(
    model="gpt-4o", messages=messages
)

Handling Incomplete Responses

Sometimes the model stops generating before completing its response. Check the finish_reason:

finish_reason	Meaning	Action
`stop`	Model finished naturally	Normal — use the response
`length`	Hit `max_tokens` limit	Increase `max_tokens` or continue in a follow-up call
`content_filter`	Content was filtered	Rephrase the prompt

const response = await client.chat.completions.create({
  model: "gpt-4o",
  max_tokens: 100,
  messages,
});

if (response.choices[0].finish_reason === "length") {
  console.warn("Response was truncated — consider increasing max_tokens");
}

Cost Protection

Runaway costs are a real risk, especially during development or if your application goes viral. Protect yourself:

Set spending limits in your provider’s dashboard. Every major provider lets you set monthly budget caps.

Track usage per request:

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages,
});

const { prompt_tokens, completion_tokens } = response.usage;
console.log(`Used ${prompt_tokens} input + ${completion_tokens} output tokens`);

Set max_tokens on every request to prevent unexpectedly long (and expensive) responses:

const response = await client.chat.completions.create({
  model: "gpt-4o",
  max_tokens: 1000, // Never generate more than 1000 tokens
  messages,
});

Always set spending limits in your provider dashboard before deploying to production. A bug in a loop or a traffic spike can burn through your budget in minutes.

Putting It All Together

Here’s a production-ready wrapper that combines retry logic, timeout, token estimation, and cost tracking:

import OpenAI from "openai";

const client = new OpenAI({ timeout: 30000 });

async function llmCall(messages, { maxTokens = 1000, retries = 3 } = {}) {
  for (let i = 0; i <= retries; i++) {
    try {
      const res = await client.chat.completions.create({
        model: "gpt-4o",
        max_tokens: maxTokens,
        messages,
      });

      if (res.choices[0].finish_reason === "length") {
        console.warn("Response truncated");
      }

      console.log(`Tokens: ${res.usage.total_tokens}`);
      return res.choices[0].message.content;
    } catch (err) {
      const retryable =
        err instanceof OpenAI.RateLimitError ||
        (err instanceof OpenAI.APIError && err.status >= 500);

      if (!retryable || i === retries) throw err;

      const delay = 1000 * Math.pow(2, i) + Math.random() * 1000;
      await new Promise((r) => setTimeout(r, delay));
    }
  }
}

What’s Next?

With error handling covered, you have everything you need to build robust LLM-powered applications. In Comparing LLM Providers, we’ll look at the major providers — OpenAI, Anthropic, Google, and open-source options — to help you choose the right model for your use case.

Table of Contents