Streaming Responses

March 28, 2026

When you make a standard LLM API call, you wait for the entire response to be generated before you see anything. For short answers that’s fine, but for longer responses the user stares at a blank screen for seconds. Streaming fixes this by sending tokens to the client as they’re generated, creating the “typing” effect you see in ChatGPT and other AI chat interfaces.

This tutorial covers streaming in depth — how it works under the hood, implementation in both JavaScript and Python, and how to integrate streaming into web applications. You should have read Calling LLM APIs with JavaScript or Calling LLM APIs with Python first.

Why Streaming Matters

Consider a response that takes 5 seconds to generate:

Without streaming — The user waits 5 seconds, then sees the full response all at once. It feels slow and unresponsive.
With streaming — The first tokens appear within ~200ms. The user starts reading immediately while the rest generates. Same total time, but it feels much faster.

Streaming also lets you:

Display partial results in real time
Cancel generation early if the user navigates away
Process output incrementally (e.g., parse JSON as it arrives)

How Streaming Works

LLM APIs use Server-Sent Events (SSE) for streaming. The server keeps the HTTP connection open and sends chunks of data as they become available. Each chunk contains one or more tokens.

A raw SSE stream looks like this:

data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":" world"}}]}

data: {"choices":[{"delta":{"content":"!"}}]}

data: [DONE]

The SDK handles parsing this for you — you just iterate over chunks.

Streaming in JavaScript

Basic Streaming

import OpenAI from "openai";

const client = new OpenAI();

const stream = await client.chat.completions.create({
  model: "gpt-4o",
  stream: true,
  messages: [{ role: "user", content: "Explain closures in JavaScript." }],
});

for await (const chunk of stream) {
  const text = chunk.choices[0]?.delta?.content;
  if (text) process.stdout.write(text);
}

Each chunk’s delta contains the new content. The first chunk often includes the role field instead of content, which is why the ?. check is important.

Collecting the Full Response

If you need both streaming output and the complete response (e.g., to save to conversation history):

let fullResponse = "";

for await (const chunk of stream) {
  const text = chunk.choices[0]?.delta?.content;
  if (text) {
    process.stdout.write(text);
    fullResponse += text;
  }
}

messages.push({ role: "assistant", content: fullResponse });

Streaming to a Web Client

In a web application, you typically stream from your backend to the browser. Here’s a minimal Express.js endpoint:

import express from "express";
import OpenAI from "openai";

const app = express();
const client = new OpenAI();

app.use(express.json());

app.post("/api/chat", async (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  const stream = await client.chat.completions.create({
    model: "gpt-4o",
    stream: true,
    messages: req.body.messages,
  });

  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content;
    if (text) res.write(`data: ${JSON.stringify({ text })}\n\n`);
  }

  res.write("data: [DONE]\n\n");
  res.end();
});

app.listen(3000);

And the browser-side code to consume it:

async function streamChat(messages) {
  const res = await fetch("/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ messages }),
  });

  const reader = res.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    for (const line of text.split("\n")) {
      if (line.startsWith("data: ") && line !== "data: [DONE]") {
        const { text: token } = JSON.parse(line.slice(6));
        document.getElementById("output").textContent += token;
      }
    }
  }
}

Streaming in Python

Basic Streaming

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o",
    stream=True,
    messages=[{"role": "user", "content": "Explain decorators in Python."}],
)

for chunk in stream:
    text = chunk.choices[0].delta.content
    if text:
        print(text, end="", flush=True)

Async Streaming

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()


async def stream_chat(prompt: str):
    stream = await client.chat.completions.create(
        model="gpt-4o",
        stream=True,
        messages=[{"role": "user", "content": prompt}],
    )

    async for chunk in stream:
        text = chunk.choices[0].delta.content
        if text:
            print(text, end="", flush=True)


asyncio.run(stream_chat("Explain generators in Python."))

Streaming with FastAPI

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
from pydantic import BaseModel

app = FastAPI()
client = OpenAI()


class ChatRequest(BaseModel):
    messages: list[dict]


@app.post("/api/chat")
async def chat(req: ChatRequest):
    def generate():
        stream = client.chat.completions.create(
            model="gpt-4o", stream=True, messages=req.messages
        )
        for chunk in stream:
            text = chunk.choices[0].delta.content
            if text:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Streaming with Anthropic

Anthropic’s SDK has a slightly different streaming interface:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const stream = await client.messages.stream({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Explain closures." }],
});

for await (const event of stream) {
  if (event.type === "content_block_delta") {
    process.stdout.write(event.delta.text);
  }
}

from anthropic import Anthropic

client = Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain decorators."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Handling Cancellation

Users might navigate away or click “stop generating.” You should abort the stream to avoid wasting tokens and compute:

const controller = new AbortController();

// User clicks "stop"
document.getElementById("stop-btn").onclick = () => controller.abort();

try {
  const stream = await client.chat.completions.create(
    { model: "gpt-4o", stream: true, messages },
    { signal: controller.signal }
  );

  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content;
    if (text) appendToUI(text);
  }
} catch (err) {
  if (err.name === "AbortError") {
    console.log("Stream cancelled by user");
  } else {
    throw err;
  }
}

When Not to Stream

Streaming isn’t always the right choice:

Structured output — If you need to parse the full response as JSON, you have to collect all chunks first anyway. Non-streaming is simpler.
Batch processing — When processing many requests programmatically with no user watching, streaming adds complexity without benefit.
Short responses — If the response is just a few tokens, the overhead of setting up a stream isn’t worth it.

What’s Next?

Streaming handles the happy path, but API calls can fail. In Error Handling & Rate Limits, we’ll cover building resilient applications that handle failures, rate limits, and timeouts gracefully.

Table of Contents

Why Streaming Matters

How Streaming Works

Streaming in JavaScript

Basic Streaming

Collecting the Full Response

Streaming to a Web Client

Streaming in Python

Basic Streaming

Async Streaming

Streaming with FastAPI

Streaming with Anthropic

Handling Cancellation

When Not to Stream

What’s Next?