Streaming LLM Responses: Building Real-Time User Experiences
Implement streaming responses that make AI agents feel fast and responsive - covering SSE setup, token-by-token rendering, progress indicators, and handling partial failures gracefully.
Implement streaming responses that make AI agents feel fast and responsive - covering SSE setup, token-by-token rendering, progress indicators, and handling partial failures gracefully.
TL;DR
Jump to Why streaming matters · Jump to Architecture options · Jump to Implementation guide · Jump to UX patterns
Click send, wait 3 seconds, see nothing. Wait another 2 seconds, still nothing. Finally - a wall of text appears. That's the non-streaming experience. Users don't know if the system is working, broken, or just slow.
Streaming responses change this fundamentally. The first token appears within 200-400ms. Users see the response being generated character by character, giving immediate feedback that the system is working. Even though the total time to complete is similar, perceived performance improves dramatically.
This guide covers implementing streaming from LLM APIs through to frontend rendering, with patterns that handle real-world complications like errors, tool calls, and mobile connectivity.
Key takeaways
- Streaming is primarily a UX improvement - total time to full response is often the same or slightly longer.
- Time to First Token (TTFT) is the metric that matters; aim for <500ms.
- Don't render every token individually - batch updates every 50-100ms for smooth animation.
- Plan for stream interruptions: save partial responses, show graceful errors, offer retry.
Research from Stanford's HCI lab found that users perceive streaming chat interfaces as 40% faster than non-streaming interfaces, even when total response time is identical (Stanford HCI, 2024). The psychology is simple: visible progress feels faster than invisible waiting.
Non-streaming request:
User sends message
→ Network latency (~50ms)
→ LLM processing (2,000-8,000ms)
→ Network latency (~50ms)
→ Render response (~10ms)
Total time to any feedback: 2,100-8,100ms
Streaming request:
User sends message
→ Network latency (~50ms)
→ LLM starts generating (200-400ms)
→ First token arrives
→ Tokens stream continuously
→ Final token arrives (2,000-8,000ms total)
Total time to first feedback: 250-450ms
The user sees activity almost immediately with streaming. That changes the entire perception of speed.
| Scenario | Benefit | Priority |
|---|---|---|
| Long responses (500+ tokens) | High - visible progress during generation | Essential |
| User-facing chat | High - immediate feedback | Essential |
| Slow models (GPT-4, Claude Opus) | High - reduces perceived wait | High |
| Fast models (GPT-4o-mini) | Medium - already fast | Nice to have |
| Background processing | Low - user isn't watching | Skip |
| Batch operations | None - no UI to update | Skip |
Three main approaches exist for streaming from LLM to browser.
One-way stream from server to client. Simplest option for most use cases.
[Client] → HTTP POST /api/chat (message)
[Server] ← Opens SSE connection
[Server] → data: {"token": "Hello"}
[Server] → data: {"token": " world"}
[Server] → data: [DONE]
[Server] ← Connection closes
Pros:
Cons:
Bidirectional persistent connection. More complex but more flexible.
[Client] ↔ WebSocket connection established
[Client] → {"type": "message", "content": "Hello"}
[Server] → {"type": "token", "content": "Hi"}
[Server] → {"type": "token", "content": " there"}
[Client] → {"type": "cancel"} (user can interrupt)
[Server] → {"type": "done"}
Pros:
Cons:
Uses standard HTTP with chunked transfer encoding. Good compatibility.
[Client] → POST /api/chat
[Server] ← Transfer-Encoding: chunked
[Server] → chunk: Hello
[Server] → chunk: world
[Server] → 0 (end of chunks)
Pros:
Cons:
Recommendation: Use SSE for most applications. It's simple, well-supported, and handles 95% of use cases. Switch to WebSockets only if you need bidirectional communication (e.g., user can cancel mid-stream).
Let's build a complete streaming implementation with SSE.
// app/api/chat/route.ts
import { OpenAI } from 'openai';
export const runtime = 'nodejs';
export async function POST(request: Request) {
const { messages } = await request.json();
const openai = new OpenAI();
// Create streaming response
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
stream: true
});
// Convert to SSE format
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
try {
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
// SSE format: "data: {json}\n\n"
const data = JSON.stringify({ type: 'token', content });
controller.enqueue(encoder.encode(`data: ${data}\n\n`));
}
// Check for finish reason
if (chunk.choices[0]?.finish_reason) {
const done = JSON.stringify({
type: 'done',
reason: chunk.choices[0].finish_reason
});
controller.enqueue(encoder.encode(`data: ${done}\n\n`));
}
}
} catch (error) {
const errorData = JSON.stringify({
type: 'error',
message: error.message
});
controller.enqueue(encoder.encode(`data: ${errorData}\n\n`));
} finally {
controller.close();
}
}
});
return new Response(readable, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive'
}
});
}
// hooks/useStreamingChat.ts
import { useState, useCallback } from 'react';
interface StreamState {
content: string;
isStreaming: boolean;
error: string | null;
}
export function useStreamingChat() {
const [state, setState] = useState<StreamState>({
content: '',
isStreaming: false,
error: null
});
const sendMessage = useCallback(async (message: string) => {
setState({ content: '', isStreaming: true, error: null });
try {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [{ role: 'user', content: message }]
})
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Parse SSE events from buffer
const events = buffer.split('\n\n');
buffer = events.pop() || ''; // Keep incomplete event in buffer
for (const event of events) {
if (!event.startsWith('data: ')) continue;
const json = event.slice(6); // Remove "data: " prefix
const data = JSON.parse(json);
if (data.type === 'token') {
setState(prev => ({
...prev,
content: prev.content + data.content
}));
} else if (data.type === 'done') {
setState(prev => ({ ...prev, isStreaming: false }));
} else if (data.type === 'error') {
setState(prev => ({
...prev,
isStreaming: false,
error: data.message
}));
}
}
}
} catch (error) {
setState(prev => ({
...prev,
isStreaming: false,
error: error.message
}));
}
}, []);
return { ...state, sendMessage };
}
Rendering every token individually causes janky animations. Batch updates for smooth rendering.
// hooks/useBufferedContent.ts
import { useState, useEffect, useRef } from 'react';
export function useBufferedContent(
streamContent: string,
bufferInterval: number = 50
) {
const [displayContent, setDisplayContent] = useState('');
const pendingRef = useRef(streamContent);
useEffect(() => {
pendingRef.current = streamContent;
}, [streamContent]);
useEffect(() => {
const interval = setInterval(() => {
if (pendingRef.current !== displayContent) {
setDisplayContent(pendingRef.current);
}
}, bufferInterval);
return () => clearInterval(interval);
}, [displayContent, bufferInterval]);
return displayContent;
}
// Usage in component
function ChatMessage({ streamContent, isStreaming }) {
const displayContent = useBufferedContent(streamContent, 50);
return (
<div className="message">
<ReactMarkdown>{displayContent}</ReactMarkdown>
{isStreaming && <span className="cursor-blink">▋</span>}
</div>
);
}
When agents use tools, handle them gracefully in the stream.
interface StreamEvent {
type: 'token' | 'tool_call' | 'tool_result' | 'done' | 'error';
content?: string;
toolCall?: {
id: string;
name: string;
arguments: string;
};
toolResult?: {
id: string;
result: any;
};
}
async function processStreamWithTools(stream: AsyncIterable<any>) {
const pendingToolCalls: Map<string, any> = new Map();
let accumulatedContent = '';
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta;
// Handle content tokens
if (delta?.content) {
yield { type: 'token', content: delta.content };
accumulatedContent += delta.content;
}
// Handle tool calls
if (delta?.tool_calls) {
for (const toolCall of delta.tool_calls) {
const existing = pendingToolCalls.get(toolCall.index) || {
id: '',
name: '',
arguments: ''
};
if (toolCall.id) existing.id = toolCall.id;
if (toolCall.function?.name) existing.name = toolCall.function.name;
if (toolCall.function?.arguments) {
existing.arguments += toolCall.function.arguments;
}
pendingToolCalls.set(toolCall.index, existing);
}
}
// Check if tool calls are complete
const finishReason = chunk.choices[0]?.finish_reason;
if (finishReason === 'tool_calls') {
for (const [_, toolCall] of pendingToolCalls) {
yield {
type: 'tool_call',
toolCall: {
id: toolCall.id,
name: toolCall.name,
arguments: toolCall.arguments
}
};
// Execute tool and yield result
const result = await executeTool(toolCall.name, JSON.parse(toolCall.arguments));
yield {
type: 'tool_result',
toolResult: { id: toolCall.id, result }
};
}
pendingToolCalls.clear();
}
if (finishReason === 'stop') {
yield { type: 'done' };
}
}
}
Technical implementation is half the battle. Good UX patterns make streaming feel polished.
Show a typing indicator during the 200-400ms before the first token.
function ChatInterface() {
const [showTyping, setShowTyping] = useState(false);
const { content, isStreaming, sendMessage } = useStreamingChat();
const handleSend = async (message: string) => {
setShowTyping(true);
await sendMessage(message);
};
// Hide typing indicator once content starts arriving
useEffect(() => {
if (content.length > 0) {
setShowTyping(false);
}
}, [content]);
return (
<div>
{showTyping && <TypingIndicator />}
{content && <StreamingMessage content={content} />}
</div>
);
}
A blinking cursor at the end of streaming content feels natural.
.cursor-blink {
animation: blink 1s step-end infinite;
}
@keyframes blink {
50% { opacity: 0; }
}
/* Fade out cursor when streaming stops */
.cursor-fade {
animation: fadeOut 0.3s ease-out forwards;
}
@keyframes fadeOut {
to { opacity: 0; }
}
For responses that include processing steps, show progress.
function StreamingResponse({ events }) {
return (
<div>
{events.map((event, i) => {
if (event.type === 'token') {
return <span key={i}>{event.content}</span>;
}
if (event.type === 'tool_call') {
return (
<div key={i} className="tool-indicator">
<Spinner size="sm" />
<span>Searching: {event.toolCall.name}...</span>
</div>
);
}
if (event.type === 'tool_result') {
return (
<div key={i} className="tool-complete">
<CheckIcon />
<span>Found {event.toolResult.result.count} results</span>
</div>
);
}
return null;
})}
</div>
);
}
When streams fail mid-response, preserve partial content.
function useResilientStream() {
const [partialContent, setPartialContent] = useState('');
const [error, setError] = useState<string | null>(null);
const sendWithRecovery = async (message: string) => {
setError(null);
try {
await sendMessage(message);
} catch (error) {
// Preserve what we received
if (partialContent.length > 50) {
setError(`Response interrupted. Showing partial response (${partialContent.length} characters received).`);
// Don't clear partialContent - show what we have
} else {
setError('Failed to get response. Please try again.');
setPartialContent('');
}
}
};
return { partialContent, error, sendWithRecovery };
}
async function measureTTFT(sendMessage: () => Promise<void>) {
const start = performance.now();
let ttft: number | null = null;
const observer = new MutationObserver(() => {
if (ttft === null) {
ttft = performance.now() - start;
observer.disconnect();
console.log(`TTFT: ${ttft.toFixed(0)}ms`);
}
});
observer.observe(document.querySelector('.message-container')!, {
childList: true,
subtree: true,
characterData: true
});
await sendMessage();
}
Mobile networks have higher latency. Adjust buffering and show loading states earlier.
const isMobile = /iPhone|iPad|iPod|Android/i.test(navigator.userAgent);
const bufferInterval = isMobile ? 100 : 50; // More buffering on mobile
const loadingTimeout = isMobile ? 500 : 300; // Show loading earlier on mobile
For responses under 100 tokens, streaming adds complexity without much UX benefit. Consider a threshold - stream responses expected to be long, don't stream short ones.
Render incrementally, but be careful with incomplete markdown. Either wait for complete blocks or use a parser that handles partial markdown gracefully.
Partial JSON isn't valid JSON. Either stream as raw text and parse at the end, or use function calling where the complete result arrives atomically.
Streaming connections count toward connection limits, not rate limits. However, open streams consume server resources. Set reasonable timeouts and close idle connections.
Create mock streams that emit tokens at controlled intervals. Test both happy paths and interruptions.
Streaming transforms how users perceive AI agent speed. The implementation is straightforward with SSE, but details like buffering, error handling, and progress indicators separate good implementations from great ones.
Implementation checklist:
Quick wins:
Internal links:
External references: