TL;DR

Streaming reduces perceived latency by 60-80% - users see tokens immediately instead of waiting for full responses.
Server-Sent Events (SSE) is simpler than WebSockets for one-way streaming; use WebSockets only if you need bidirectional.
Buffer intelligently: show tokens as they arrive, but batch DOM updates to avoid janky rendering.
Handle partial failures: if streaming stops mid-response, save what you have and offer retry.

Jump to Why streaming matters · Jump to Architecture options · Jump to Implementation guide · Jump to UX patterns

Streaming LLM Responses: Building Real-Time User Experiences

Click send, wait 3 seconds, see nothing. Wait another 2 seconds, still nothing. Finally - a wall of text appears. That's the non-streaming experience. Users don't know if the system is working, broken, or just slow.

Streaming responses change this fundamentally. The first token appears within 200-400ms. Users see the response being generated character by character, giving immediate feedback that the system is working. Even though the total time to complete is similar, perceived performance improves dramatically.

This guide covers implementing streaming from LLM APIs through to frontend rendering, with patterns that handle real-world complications like errors, tool calls, and mobile connectivity.

Key takeaways

Streaming is primarily a UX improvement - total time to full response is often the same or slightly longer.

Time to First Token (TTFT) is the metric that matters; aim for <500ms.

Don't render every token individually - batch updates every 50-100ms for smooth animation.

Plan for stream interruptions: save partial responses, show graceful errors, offer retry.

Why streaming matters

Research from Stanford's HCI lab found that users perceive streaming chat interfaces as 40% faster than non-streaming interfaces, even when total response time is identical (Stanford HCI, 2024). The psychology is simple: visible progress feels faster than invisible waiting.

Latency breakdown

Non-streaming request:

User sends message
    → Network latency (~50ms)
    → LLM processing (2,000-8,000ms)
    → Network latency (~50ms)
    → Render response (~10ms)
Total time to any feedback: 2,100-8,100ms

Streaming request:

User sends message
    → Network latency (~50ms)
    → LLM starts generating (200-400ms)
    → First token arrives
    → Tokens stream continuously
    → Final token arrives (2,000-8,000ms total)
Total time to first feedback: 250-450ms

The user sees activity almost immediately with streaming. That changes the entire perception of speed.

When streaming helps most

Scenario	Benefit	Priority
Long responses (500+ tokens)	High - visible progress during generation	Essential
User-facing chat	High - immediate feedback	Essential
Slow models (GPT-4, Claude Opus)	High - reduces perceived wait	High
Fast models (GPT-4o-mini)	Medium - already fast	Nice to have
Background processing	Low - user isn't watching	Skip
Batch operations	None - no UI to update	Skip

"The developer experience improvements we've seen from AI tools are the most significant since IDEs and version control. This is a permanent shift in how software gets built." - Emily Freeman, VP of Developer Relations at AWS

Architecture options

Three main approaches exist for streaming from LLM to browser.

Option 1: Server-Sent Events (SSE)

One-way stream from server to client. Simplest option for most use cases.

[Client] → HTTP POST /api/chat (message)
[Server] ← Opens SSE connection
[Server] → data: {"token": "Hello"}
[Server] → data: {"token": " world"}
[Server] → data: [DONE]
[Server] ← Connection closes

Pros:

Built on HTTP - works with existing infrastructure
Automatic reconnection in browsers
Simple to implement

Cons:

One-way only (client can't send during stream)
Some corporate firewalls/proxies struggle with it

Option 2: WebSockets

Bidirectional persistent connection. More complex but more flexible.

[Client] ↔ WebSocket connection established
[Client] → {"type": "message", "content": "Hello"}
[Server] → {"type": "token", "content": "Hi"}
[Server] → {"type": "token", "content": " there"}
[Client] → {"type": "cancel"} (user can interrupt)
[Server] → {"type": "done"}

Pros:

Bidirectional - can cancel, send follow-ups
Lower overhead per message
Better for high-frequency updates

Cons:

Requires WebSocket infrastructure
Doesn't work through some proxies
More complex state management

Option 3: HTTP streaming (chunked transfer)

Uses standard HTTP with chunked transfer encoding. Good compatibility.

[Client] → POST /api/chat
[Server] ← Transfer-Encoding: chunked
[Server] → chunk: Hello
[Server] → chunk:  world
[Server] → 0 (end of chunks)

Pros:

Works everywhere HTTP works
No special protocol needed

Cons:

Harder to handle errors mid-stream
Limited browser APIs for reading chunks

Recommendation: Use SSE for most applications. It's simple, well-supported, and handles 95% of use cases. Switch to WebSockets only if you need bidirectional communication (e.g., user can cancel mid-stream).

Implementation guide

Let's build a complete streaming implementation with SSE.

Step 1: Server-side SSE endpoint (Next.js)

// app/api/chat/route.ts
import { OpenAI } from 'openai';

export const runtime = 'nodejs';

export async function POST(request: Request) {
  const { messages } = await request.json();

  const openai = new OpenAI();

  // Create streaming response
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true
  });

  // Convert to SSE format
  const encoder = new TextEncoder();

  const readable = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of stream) {
          const content = chunk.choices[0]?.delta?.content;

          if (content) {
            // SSE format: "data: {json}\n\n"
            const data = JSON.stringify({ type: 'token', content });
            controller.enqueue(encoder.encode(`data: ${data}\n\n`));
          }

          // Check for finish reason
          if (chunk.choices[0]?.finish_reason) {
            const done = JSON.stringify({
              type: 'done',
              reason: chunk.choices[0].finish_reason
            });
            controller.enqueue(encoder.encode(`data: ${done}\n\n`));
          }
        }
      } catch (error) {
        const errorData = JSON.stringify({
          type: 'error',
          message: error.message
        });
        controller.enqueue(encoder.encode(`data: ${errorData}\n\n`));
      } finally {
        controller.close();
      }
    }
  });

  return new Response(readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive'
    }
  });
}

Step 2: Client-side SSE consumption

// hooks/useStreamingChat.ts
import { useState, useCallback } from 'react';

interface StreamState {
  content: string;
  isStreaming: boolean;
  error: string | null;
}

export function useStreamingChat() {
  const [state, setState] = useState<StreamState>({
    content: '',
    isStreaming: false,
    error: null
  });

  const sendMessage = useCallback(async (message: string) => {
    setState({ content: '', isStreaming: true, error: null });

    try {
      const response = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          messages: [{ role: 'user', content: message }]
        })
      });

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }

      const reader = response.body!.getReader();
      const decoder = new TextDecoder();
      let buffer = '';

      while (true) {
        const { done, value } = await reader.read();

        if (done) break;

        buffer += decoder.decode(value, { stream: true });

        // Parse SSE events from buffer
        const events = buffer.split('\n\n');
        buffer = events.pop() || '';  // Keep incomplete event in buffer

        for (const event of events) {
          if (!event.startsWith('data: ')) continue;

          const json = event.slice(6);  // Remove "data: " prefix
          const data = JSON.parse(json);

          if (data.type === 'token') {
            setState(prev => ({
              ...prev,
              content: prev.content + data.content
            }));
          } else if (data.type === 'done') {
            setState(prev => ({ ...prev, isStreaming: false }));
          } else if (data.type === 'error') {
            setState(prev => ({
              ...prev,
              isStreaming: false,
              error: data.message
            }));
          }
        }
      }
    } catch (error) {
      setState(prev => ({
        ...prev,
        isStreaming: false,
        error: error.message
      }));
    }
  }, []);

  return { ...state, sendMessage };
}

Step 3: Rendering with batched updates

Rendering every token individually causes janky animations. Batch updates for smooth rendering.

// hooks/useBufferedContent.ts
import { useState, useEffect, useRef } from 'react';

export function useBufferedContent(
  streamContent: string,
  bufferInterval: number = 50
) {
  const [displayContent, setDisplayContent] = useState('');
  const pendingRef = useRef(streamContent);

  useEffect(() => {
    pendingRef.current = streamContent;
  }, [streamContent]);

  useEffect(() => {
    const interval = setInterval(() => {
      if (pendingRef.current !== displayContent) {
        setDisplayContent(pendingRef.current);
      }
    }, bufferInterval);

    return () => clearInterval(interval);
  }, [displayContent, bufferInterval]);

  return displayContent;
}

// Usage in component
function ChatMessage({ streamContent, isStreaming }) {
  const displayContent = useBufferedContent(streamContent, 50);

  return (
    <div className="message">
      <ReactMarkdown>{displayContent}</ReactMarkdown>
      {isStreaming && <span className="cursor-blink">▋</span>}
    </div>
  );
}

Step 4: Handling tool calls in streams

When agents use tools, handle them gracefully in the stream.

interface StreamEvent {
  type: 'token' | 'tool_call' | 'tool_result' | 'done' | 'error';
  content?: string;
  toolCall?: {
    id: string;
    name: string;
    arguments: string;
  };
  toolResult?: {
    id: string;
    result: any;
  };
}

async function processStreamWithTools(stream: AsyncIterable<any>) {
  const pendingToolCalls: Map<string, any> = new Map();
  let accumulatedContent = '';

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta;

    // Handle content tokens
    if (delta?.content) {
      yield { type: 'token', content: delta.content };
      accumulatedContent += delta.content;
    }

    // Handle tool calls
    if (delta?.tool_calls) {
      for (const toolCall of delta.tool_calls) {
        const existing = pendingToolCalls.get(toolCall.index) || {
          id: '',
          name: '',
          arguments: ''
        };

        if (toolCall.id) existing.id = toolCall.id;
        if (toolCall.function?.name) existing.name = toolCall.function.name;
        if (toolCall.function?.arguments) {
          existing.arguments += toolCall.function.arguments;
        }

        pendingToolCalls.set(toolCall.index, existing);
      }
    }

    // Check if tool calls are complete
    const finishReason = chunk.choices[0]?.finish_reason;
    if (finishReason === 'tool_calls') {
      for (const [_, toolCall] of pendingToolCalls) {
        yield {
          type: 'tool_call',
          toolCall: {
            id: toolCall.id,
            name: toolCall.name,
            arguments: toolCall.arguments
          }
        };

        // Execute tool and yield result
        const result = await executeTool(toolCall.name, JSON.parse(toolCall.arguments));
        yield {
          type: 'tool_result',
          toolResult: { id: toolCall.id, result }
        };
      }
      pendingToolCalls.clear();
    }

    if (finishReason === 'stop') {
      yield { type: 'done' };
    }
  }
}

UX patterns

Technical implementation is half the battle. Good UX patterns make streaming feel polished.

Pattern 1: Typing indicators before stream starts

Show a typing indicator during the 200-400ms before the first token.

function ChatInterface() {
  const [showTyping, setShowTyping] = useState(false);
  const { content, isStreaming, sendMessage } = useStreamingChat();

  const handleSend = async (message: string) => {
    setShowTyping(true);
    await sendMessage(message);
  };

  // Hide typing indicator once content starts arriving
  useEffect(() => {
    if (content.length > 0) {
      setShowTyping(false);
    }
  }, [content]);

  return (
    <div>
      {showTyping && <TypingIndicator />}
      {content && <StreamingMessage content={content} />}
    </div>
  );
}

Pattern 2: Smooth cursor animation

A blinking cursor at the end of streaming content feels natural.

.cursor-blink {
  animation: blink 1s step-end infinite;
}

@keyframes blink {
  50% { opacity: 0; }
}

/* Fade out cursor when streaming stops */
.cursor-fade {
  animation: fadeOut 0.3s ease-out forwards;
}

@keyframes fadeOut {
  to { opacity: 0; }
}

Pattern 3: Progress for long operations

For responses that include processing steps, show progress.

function StreamingResponse({ events }) {
  return (
    <div>
      {events.map((event, i) => {
        if (event.type === 'token') {
          return <span key={i}>{event.content}</span>;
        }
        if (event.type === 'tool_call') {
          return (
            <div key={i} className="tool-indicator">
              <Spinner size="sm" />
              <span>Searching: {event.toolCall.name}...</span>
            </div>
          );
        }
        if (event.type === 'tool_result') {
          return (
            <div key={i} className="tool-complete">
              <CheckIcon />
              <span>Found {event.toolResult.result.count} results</span>
            </div>
          );
        }
        return null;
      })}
    </div>
  );
}

Pattern 4: Graceful error recovery

When streams fail mid-response, preserve partial content.

function useResilientStream() {
  const [partialContent, setPartialContent] = useState('');
  const [error, setError] = useState<string | null>(null);

  const sendWithRecovery = async (message: string) => {
    setError(null);

    try {
      await sendMessage(message);
    } catch (error) {
      // Preserve what we received
      if (partialContent.length > 50) {
        setError(`Response interrupted. Showing partial response (${partialContent.length} characters received).`);
        // Don't clear partialContent - show what we have
      } else {
        setError('Failed to get response. Please try again.');
        setPartialContent('');
      }
    }
  };

  return { partialContent, error, sendWithRecovery };
}

Performance optimisation

Measure TTFT (Time to First Token)

async function measureTTFT(sendMessage: () => Promise<void>) {
  const start = performance.now();
  let ttft: number | null = null;

  const observer = new MutationObserver(() => {
    if (ttft === null) {
      ttft = performance.now() - start;
      observer.disconnect();
      console.log(`TTFT: ${ttft.toFixed(0)}ms`);
    }
  });

  observer.observe(document.querySelector('.message-container')!, {
    childList: true,
    subtree: true,
    characterData: true
  });

  await sendMessage();
}

Optimise for mobile

Mobile networks have higher latency. Adjust buffering and show loading states earlier.

const isMobile = /iPhone|iPad|iPod|Android/i.test(navigator.userAgent);

const bufferInterval = isMobile ? 100 : 50;  // More buffering on mobile
const loadingTimeout = isMobile ? 500 : 300;  // Show loading earlier on mobile

FAQs

Should I stream short responses?

For responses under 100 tokens, streaming adds complexity without much UX benefit. Consider a threshold - stream responses expected to be long, don't stream short ones.

How do I handle markdown rendering during streaming?

Render incrementally, but be careful with incomplete markdown. Either wait for complete blocks or use a parser that handles partial markdown gracefully.

Can I stream structured outputs?

Partial JSON isn't valid JSON. Either stream as raw text and parse at the end, or use function calling where the complete result arrives atomically.

What about rate limiting?

Streaming connections count toward connection limits, not rate limits. However, open streams consume server resources. Set reasonable timeouts and close idle connections.

How do I test streaming implementations?

Create mock streams that emit tokens at controlled intervals. Test both happy paths and interruptions.

Summary and next steps

Streaming transforms how users perceive AI agent speed. The implementation is straightforward with SSE, but details like buffering, error handling, and progress indicators separate good implementations from great ones.

Implementation checklist:

Set up SSE endpoint with proper headers
Implement client-side event parsing
Add buffered rendering (50-100ms batches)
Build typing indicator for pre-stream phase
Handle stream interruptions gracefully
Add tool call progress indicators

Quick wins:

Replace synchronous calls with streaming for long responses
Add typing indicator during TTFT window
Implement basic buffering for smooth rendering

Internal links:

External references: