Error Handling & Rate Limits

March 28, 2026
#genai #ai #llm #javascript #python

LLM API calls fail. Servers go down, rate limits get hit, tokens exceed context windows, and networks time out. If your application doesn’t handle these failures gracefully, your users get cryptic errors or broken experiences.

This tutorial covers the common failure modes, how to detect them, and how to build retry logic that keeps your application running. You should have read Calling LLM APIs with JavaScript or Calling LLM APIs with Python first.

Common Error Types

HTTP Status Error Cause Retryable?
400 Bad Request Invalid parameters, prompt too long No — fix the request
401 Authentication Error Invalid or missing API key No — fix your key
403 Permission Denied Key doesn’t have access to the model No — check permissions
404 Not Found Wrong model name or endpoint No — fix the request
429 Rate Limit Too many requests or token quota exceeded Yes — wait and retry
500 Server Error Provider-side issue Yes — retry
503 Service Unavailable Provider overloaded Yes — retry

The key distinction: 4xx errors (except 429) mean your request is wrong — fix it, don’t retry. 429 and 5xx errors are transient — retry with backoff.

Rate Limits

LLM providers enforce rate limits on two dimensions:

  • Requests per minute (RPM) — How many API calls you can make
  • Tokens per minute (TPM) — How many tokens you can process

When you exceed either limit, you get a 429 response. The response headers usually tell you when you can retry:

retry-after: 2
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 2s

Rate Limit Tiers

Limits vary by provider and plan. Here’s a rough sense of scale:

Tier RPM TPM
Free tier 3–20 10K–40K
Paid tier 1 500 200K
Paid tier 2+ 5,000+ 2M+

Check your provider’s documentation for exact limits. They change frequently and vary by model.

Exponential Backoff

The standard pattern for handling retryable errors is exponential backoff: wait a short time after the first failure, then double the wait time with each subsequent retry.

Attempt 1: fails → wait 1s
Attempt 2: fails → wait 2s
Attempt 3: fails → wait 4s
Attempt 4: give up

Adding jitter (a small random delay) prevents multiple clients from retrying at exactly the same time and causing another spike:

Handling Context Length Errors

If your prompt exceeds the model’s context window, you’ll get a 400 error. This is common when building applications with conversation history or RAG, where the input grows over time.

Strategies for managing context length:

Truncate Conversation History

Keep only the most recent messages:

function trimMessages(messages, maxMessages = 20) {
  if (messages.length <= maxMessages) return messages;
  // Always keep the system message, trim oldest user/assistant messages
  const system = messages.filter((m) => m.role === "system");
  const rest = messages.filter((m) => m.role !== "system");
  return [...system, ...rest.slice(-maxMessages + system.length)];
}

Estimate Tokens Before Sending

Check the token count before making the API call:

import { encoding_for_model } from "tiktoken";

function estimateTokens(messages) {
  const enc = encoding_for_model("gpt-4o");
  let total = 0;
  for (const msg of messages) {
    total += enc.encode(msg.content).length + 4; // ~4 tokens overhead per message
  }
  enc.free();
  return total;
}

const MAX_CONTEXT = 128000;
const MAX_OUTPUT = 4096;

if (estimateTokens(messages) + MAX_OUTPUT > MAX_CONTEXT) {
  messages = trimMessages(messages);
}

Timeouts

LLM calls can be slow, especially for long responses. Set timeouts to prevent your application from hanging:

Handling Incomplete Responses

Sometimes the model stops generating before completing its response. Check the finish_reason:

finish_reason Meaning Action
stop Model finished naturally Normal — use the response
length Hit max_tokens limit Increase max_tokens or continue in a follow-up call
content_filter Content was filtered Rephrase the prompt
const response = await client.chat.completions.create({
  model: "gpt-4o",
  max_tokens: 100,
  messages,
});

if (response.choices[0].finish_reason === "length") {
  console.warn("Response was truncated — consider increasing max_tokens");
}

Cost Protection

Runaway costs are a real risk, especially during development or if your application goes viral. Protect yourself:

Set spending limits in your provider’s dashboard. Every major provider lets you set monthly budget caps.

Track usage per request:

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages,
});

const { prompt_tokens, completion_tokens } = response.usage;
console.log(`Used ${prompt_tokens} input + ${completion_tokens} output tokens`);

Set max_tokens on every request to prevent unexpectedly long (and expensive) responses:

const response = await client.chat.completions.create({
  model: "gpt-4o",
  max_tokens: 1000, // Never generate more than 1000 tokens
  messages,
});

Putting It All Together

Here’s a production-ready wrapper that combines retry logic, timeout, token estimation, and cost tracking:

import OpenAI from "openai";

const client = new OpenAI({ timeout: 30000 });

async function llmCall(messages, { maxTokens = 1000, retries = 3 } = {}) {
  for (let i = 0; i <= retries; i++) {
    try {
      const res = await client.chat.completions.create({
        model: "gpt-4o",
        max_tokens: maxTokens,
        messages,
      });

      if (res.choices[0].finish_reason === "length") {
        console.warn("Response truncated");
      }

      console.log(`Tokens: ${res.usage.total_tokens}`);
      return res.choices[0].message.content;
    } catch (err) {
      const retryable =
        err instanceof OpenAI.RateLimitError ||
        (err instanceof OpenAI.APIError && err.status >= 500);

      if (!retryable || i === retries) throw err;

      const delay = 1000 * Math.pow(2, i) + Math.random() * 1000;
      await new Promise((r) => setTimeout(r, delay));
    }
  }
}

What’s Next?

With error handling covered, you have everything you need to build robust LLM-powered applications. In Comparing LLM Providers, we’ll look at the major providers — OpenAI, Anthropic, Google, and open-source options — to help you choose the right model for your use case.

Thanks for visiting
We are actively updating content to this site. Thanks for visiting! Please bookmark this page and visit again soon.
Sponsor