Streaming Responses
Table of Contents
When you make a standard LLM API call, you wait for the entire response to be generated before you see anything. For short answers that’s fine, but for longer responses the user stares at a blank screen for seconds. Streaming fixes this by sending tokens to the client as they’re generated, creating the “typing” effect you see in ChatGPT and other AI chat interfaces.
This tutorial covers streaming in depth — how it works under the hood, implementation in both JavaScript and Python, and how to integrate streaming into web applications. You should have read Calling LLM APIs with JavaScript or Calling LLM APIs with Python first.
Why Streaming Matters
Consider a response that takes 5 seconds to generate:
- Without streaming — The user waits 5 seconds, then sees the full response all at once. It feels slow and unresponsive.
- With streaming — The first tokens appear within ~200ms. The user starts reading immediately while the rest generates. Same total time, but it feels much faster.
Streaming also lets you:
- Display partial results in real time
- Cancel generation early if the user navigates away
- Process output incrementally (e.g., parse JSON as it arrives)
How Streaming Works
LLM APIs use Server-Sent Events (SSE) for streaming. The server keeps the HTTP connection open and sends chunks of data as they become available. Each chunk contains one or more tokens.
A raw SSE stream looks like this:
data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":" world"}}]}
data: {"choices":[{"delta":{"content":"!"}}]}
data: [DONE]
The SDK handles parsing this for you — you just iterate over chunks.
Streaming in JavaScript
Basic Streaming
import OpenAI from "openai";
const client = new OpenAI();
const stream = await client.chat.completions.create({
model: "gpt-4o",
stream: true,
messages: [{ role: "user", content: "Explain closures in JavaScript." }],
});
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content;
if (text) process.stdout.write(text);
}
Each chunk’s delta contains the new content. The first chunk often includes the role field instead of content, which is why the ?. check is important.
Collecting the Full Response
If you need both streaming output and the complete response (e.g., to save to conversation history):
let fullResponse = "";
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content;
if (text) {
process.stdout.write(text);
fullResponse += text;
}
}
messages.push({ role: "assistant", content: fullResponse });
Streaming to a Web Client
In a web application, you typically stream from your backend to the browser. Here’s a minimal Express.js endpoint:
import express from "express";
import OpenAI from "openai";
const app = express();
const client = new OpenAI();
app.use(express.json());
app.post("/api/chat", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
const stream = await client.chat.completions.create({
model: "gpt-4o",
stream: true,
messages: req.body.messages,
});
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content;
if (text) res.write(`data: ${JSON.stringify({ text })}\n\n`);
}
res.write("data: [DONE]\n\n");
res.end();
});
app.listen(3000);
And the browser-side code to consume it:
async function streamChat(messages) {
const res = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
for (const line of text.split("\n")) {
if (line.startsWith("data: ") && line !== "data: [DONE]") {
const { text: token } = JSON.parse(line.slice(6));
document.getElementById("output").textContent += token;
}
}
}
}
Streaming in Python
Basic Streaming
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4o",
stream=True,
messages=[{"role": "user", "content": "Explain decorators in Python."}],
)
for chunk in stream:
text = chunk.choices[0].delta.content
if text:
print(text, end="", flush=True)
Async Streaming
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def stream_chat(prompt: str):
stream = await client.chat.completions.create(
model="gpt-4o",
stream=True,
messages=[{"role": "user", "content": prompt}],
)
async for chunk in stream:
text = chunk.choices[0].delta.content
if text:
print(text, end="", flush=True)
asyncio.run(stream_chat("Explain generators in Python."))
Streaming with FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
from pydantic import BaseModel
app = FastAPI()
client = OpenAI()
class ChatRequest(BaseModel):
messages: list[dict]
@app.post("/api/chat")
async def chat(req: ChatRequest):
def generate():
stream = client.chat.completions.create(
model="gpt-4o", stream=True, messages=req.messages
)
for chunk in stream:
text = chunk.choices[0].delta.content
if text:
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Streaming with Anthropic
Anthropic’s SDK has a slightly different streaming interface:
Handling Cancellation
Users might navigate away or click “stop generating.” You should abort the stream to avoid wasting tokens and compute:
const controller = new AbortController();
// User clicks "stop"
document.getElementById("stop-btn").onclick = () => controller.abort();
try {
const stream = await client.chat.completions.create(
{ model: "gpt-4o", stream: true, messages },
{ signal: controller.signal }
);
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content;
if (text) appendToUI(text);
}
} catch (err) {
if (err.name === "AbortError") {
console.log("Stream cancelled by user");
} else {
throw err;
}
}
When Not to Stream
Streaming isn’t always the right choice:
- Structured output — If you need to parse the full response as JSON, you have to collect all chunks first anyway. Non-streaming is simpler.
- Batch processing — When processing many requests programmatically with no user watching, streaming adds complexity without benefit.
- Short responses — If the response is just a few tokens, the overhead of setting up a stream isn’t worth it.
What’s Next?
Streaming handles the happy path, but API calls can fail. In Error Handling & Rate Limits, we’ll cover building resilient applications that handle failures, rate limits, and timeouts gracefully.