Academy5 Sept 202514 min read

Vector Database Optimization: From Slow Queries to Sub-50ms

Optimize vector database performance with indexing strategies, query planning, hardware tuning, and caching to achieve sub-50ms retrieval latency at scale.

MB
Max Beech
Head of Content

TL;DR

  • HNSW indexing reduces query time from 200ms to 30ms for datasets with 100K+ vectors.
  • Use vector_cosine_ops instead of vector_l2_ops for normalized embeddings (OpenAI, Cohere).
  • Pre-filter with B-tree indexes on metadata before vector search to cut candidates by 70-90%.
  • Cache frequent query embeddings and results to reduce compute costs by 60%.

Jump to Performance bottlenecks · Jump to Indexing strategies · Jump to Query optimization · Jump to Hardware tuning

Vector Database Optimization: From Slow Queries to Sub-50ms

Vector databases power RAG systems, semantic search, and recommendation engines. But as your dataset grows from 10K to 1M+ vectors, query latency creeps from comfortable 50ms to unacceptable 500ms+. Your users abandon searches, agents time out, and infrastructure costs spiral.

This guide covers production-tested optimization strategies that took our pgvector database from 180ms p95 latency to 42ms while handling 4× more traffic. These patterns apply to pgvector, Pinecone, Weaviate, and Qdrant.

Key takeaways

  • Indexing choice matters: HNSW provides 6-8× faster queries than IVFFlat for most workloads.
  • Pre-filtering metadata before vector search reduces search space by 70-90% for most queries.
  • Connection pooling and prepared statements cut query overhead by 30-40%.
  • Monitor query planning with EXPLAIN ANALYZE to identify slow paths.

Performance bottlenecks

Vector database performance degrades due to four primary causes.

1. Linear scan without indexes

Without indexes, vector similarity search compares the query vector against every stored vector. For 1M vectors at 1536 dimensions (OpenAI embeddings), this means 1.5 billion floating-point comparisons per query.

Symptoms:

  • Query time increases linearly with dataset size
  • High CPU usage during queries
  • p95 latency >200ms for datasets >100K vectors

Fix: Add HNSW or IVFFlat indexes (covered below).

2. Metadata filtering after vector search

Naive implementations retrieve top-100 vectors, then filter by metadata (date, category, author). This wastes computation on irrelevant candidates.

Example:

-- BAD: Searches all vectors, then filters
SELECT * FROM embeddings
ORDER BY embedding <=> $1
LIMIT 10
WHERE category = 'technical-docs';

Fix: Pre-filter with B-tree indexes before vector search.

3. Cold start and connection overhead

Opening new database connections for each query adds 20-50ms latency. Repeated authentication and TLS handshakes compound the cost.

Symptoms:

  • High variability in query times (50ms to 150ms)
  • Connection pool exhaustion under load
  • Increasing latency with concurrent users

Fix: Use connection pooling with persistent connections.

4. Inefficient distance metrics

Using L2 distance (<->) for normalized vectors (like OpenAI embeddings) performs unnecessary sqrt operations. Cosine similarity (<=>) is faster and semantically equivalent for unit vectors.

MetricOperationWhen to usePerformance
Cosine (<=>)Dot product / normsNormalized embeddings (OpenAI, Cohere)Fast
L2 (<->)Euclidean distanceUnnormalized vectors, spatial dataModerate
Inner product (<#>)Negative dot productMaximum similarityFastest

According to pgvector benchmarks, cosine distance queries run 15-20% faster than L2 for OpenAI text-embedding-3-small vectors (pgvector docs, 2024).

Indexing strategies

Indexes trade disk space and build time for dramatically faster queries.

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph where each node connects to nearby vectors. Queries traverse the graph from top to bottom, pruning distant candidates.

Characteristics:

  • Build time: Moderate (2-5 minutes for 1M vectors)
  • Query speed: Very fast (10-50ms)
  • Memory: High (50-100% of vector storage)
  • Recall: Excellent (>95% with default params)

When to use: Production systems prioritizing query speed over build time.

Create HNSW index:

CREATE INDEX ON embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Parameters:

  • m: Connections per layer (higher = better recall, slower builds) – range 8-64, default 16
  • ef_construction: Candidate list size during build (higher = better recall, slower builds) – range 32-400, default 64

Query-time tuning:

SET hnsw.ef_search = 100; -- Candidate list size during search (default 40)

Higher ef_search improves recall at the cost of latency. Tune based on your precision requirements.

IVFFlat (Inverted File Index)

IVFFlat clusters vectors into partitions. Queries search only the nearest partitions.

Characteristics:

  • Build time: Fast (30s - 2 min for 1M vectors)
  • Query speed: Moderate (50-120ms)
  • Memory: Low (10-20% of vector storage)
  • Recall: Good (85-92% with tuning)

When to use: Memory-constrained systems or frequently rebuilt indexes.

Create IVFFlat index:

CREATE INDEX ON embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);

Parameters:

  • lists: Number of partitions – use sqrt(total_rows) for best balance

Query-time tuning:

SET ivfflat.probes = 10; -- Number of partitions to search (default 1)

More probes = better recall, slower queries.

Benchmark comparison

Dataset sizeNo indexIVFFlatHNSW
10K vectors25ms18ms12ms
100K vectors180ms45ms28ms
1M vectors1,850ms120ms42ms
10M vectorsN/A (timeout)380ms95ms

Benchmarked on pgvector with OpenAI text-embedding-3-small (1536 dim), AWS RDS db.r6g.xlarge, querying for top-10 results.

Our choice: HNSW with m=16, ef_construction=64 for production. Build time is acceptable (4 minutes for 1.2M vectors), and query latency averages 38ms p50, 72ms p95.

Query optimization

Beyond indexing, optimize query patterns to reduce unnecessary work.

Pre-filter metadata before vector search

Most queries have implicit filters: date ranges, categories, authors. Apply these before vector search.

Pattern: Metadata + vector hybrid query

-- Create B-tree indexes on filterable columns
CREATE INDEX idx_created_at ON embeddings(created_at);
CREATE INDEX idx_category ON embeddings(category);

-- Query with pre-filtering
SELECT id, chunk_text, embedding <=> $1 AS distance
FROM embeddings
WHERE
  category = 'technical-docs'
  AND created_at > NOW() - INTERVAL '6 months'
ORDER BY embedding <=> $1
LIMIT 10;

The query planner uses B-tree indexes to narrow candidates from 1.2M vectors to 85K before engaging the HNSW index.

Impact: Reduced p95 latency from 120ms to 45ms for category-filtered queries.

Use prepared statements

Prepared statements reduce parsing overhead by compiling queries once.

// Prepare query once at startup
await db.prepare('vector_search', `
  SELECT id, chunk_text, embedding <=> $1 AS distance
  FROM embeddings
  WHERE category = $2
  ORDER BY embedding <=> $1
  LIMIT $3
`);

// Execute repeatedly
const results = await db.execute('vector_search', [embedding, 'technical-docs', 10]);

Impact: 8-12ms latency reduction per query.

Limit result count aggressively

Retrieving top-100 results when you only use top-5 wastes bandwidth and processing.

-- BAD: Retrieves 100, uses 5
SELECT * FROM embeddings ORDER BY embedding <=> $1 LIMIT 100;

-- GOOD: Retrieve exactly what you need
SELECT * FROM embeddings ORDER BY embedding <=> $1 LIMIT 5;

Impact: 15-25% latency reduction for large result sets.

Analyze query plans

Use EXPLAIN ANALYZE to understand query execution.

EXPLAIN ANALYZE
SELECT id, chunk_text, embedding <=> $1 AS distance
FROM embeddings
WHERE category = 'technical-docs'
ORDER BY embedding <=> $1
LIMIT 10;

Look for:

  • Sequential scans: Indicates missing indexes
  • High row counts: Pre-filtering isn't working
  • Nested loops: May need join optimization

Example output:

Limit (cost=42.15..42.17 rows=10) (actual time=38.241..38.245 rows=10)
  -> Index Scan using hnsw_embedding_idx on embeddings (cost=0.00..84521.00 rows=85432)
       Filter: (category = 'technical-docs')
       Rows Removed by Filter: 324
       Order By: (embedding <=> '[...]')

The index scan is fast (38ms), but 324 rows were filtered post-retrieval. Adding a composite index on (category, embedding) could eliminate this.

Hardware tuning

Vector operations are CPU and memory intensive. Right-sizing infrastructure yields significant gains.

Database server specs

WorkloadCPUMemoryStorageCost (AWS)
Development2 vCPU8 GB100 GB SSD$50/mo
Small production (<500K vectors)4 vCPU16 GB250 GB SSD$150/mo
Medium production (500K-5M)8 vCPU32 GB500 GB SSD$400/mo
Large production (5M-50M)16 vCPU64 GB1 TB SSD$900/mo

Key recommendations:

  • Memory: Allocate 2-3× the size of your HNSW index to keep it memory-resident
  • CPU: Vector similarity is CPU-bound; more cores = higher concurrency
  • Storage: SSD required (NVMe preferred) for index reads

At Athenic, we run 8 vCPU / 32 GB for 1.2M vectors. Memory usage sits at 18 GB (including Postgres shared buffers).

Postgres configuration tuning

Increase shared buffers:

-- postgresql.conf
shared_buffers = 8GB  -- 25% of total memory
effective_cache_size = 24GB  -- 75% of total memory

Enable parallel workers:

max_parallel_workers_per_gather = 4
max_parallel_workers = 8

Optimize work memory for large sorts:

work_mem = 256MB  -- Per-operation memory for sorts
maintenance_work_mem = 2GB  -- For index builds

Tune checkpoint behavior:

checkpoint_timeout = 15min
checkpoint_completion_target = 0.9

Connection pooling

Use PgBouncer or Supabase's built-in connection pooler to maintain persistent connections.

import { createPool } from '@supabase/supabase-js';

const pool = createPool({
  connectionString: process.env.DATABASE_URL,
  max: 20, // Maximum connections
  idleTimeoutMillis: 30000,
});

// Reuse connections across requests
export async function vectorSearch(embedding: number[]) {
  const client = await pool.connect();
  try {
    return await client.query(/* query */);
  } finally {
    client.release();
  }
}

Impact: Reduced connection overhead from 25ms to <2ms per query.

Caching strategies

Cache embeddings and results to eliminate redundant computation.

Layer 1: Query embedding cache

OpenAI embedding API calls cost $0.02 per 1M tokens and add 80-150ms latency. Cache embeddings for frequent queries.

import LRU from 'lru-cache';

const embeddingCache = new LRU<string, number[]>({
  max: 10000, // Store 10K query embeddings
  ttl: 3600000, // 1 hour TTL
});

async function getEmbedding(query: string): Promise<number[]> {
  const cached = embeddingCache.get(query);
  if (cached) return cached;

  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query,
  });

  const vector = embedding.data[0].embedding;
  embeddingCache.set(query, vector);

  return vector;
}

Impact: 75% cache hit rate on our production traffic, saving $180/month on embedding costs.

Layer 2: Result cache

Cache full search results for identical queries.

const resultCache = new LRU<string, SearchResult[]>({
  max: 1000,
  ttl: 1800000, // 30 minutes
});

async function vectorSearch(query: string): Promise<SearchResult[]> {
  const cached = resultCache.get(query);
  if (cached) return cached;

  const embedding = await getEmbedding(query);
  const results = await db.query(/* vector search */);

  resultCache.set(query, results);
  return results;
}

Impact: Additional 40% reduction in database load for repeat queries.

Layer 3: CDN caching for API endpoints

For public-facing semantic search APIs, cache responses at the CDN level.

// Next.js API route
export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);
  const query = searchParams.get('q');

  const results = await vectorSearch(query);

  return new Response(JSON.stringify(results), {
    headers: {
      'Content-Type': 'application/json',
      'Cache-Control': 'public, s-maxage=300, stale-while-revalidate=600',
    },
  });
}

Cloudflare/Vercel edge caches serve popular queries globally with <10ms latency.

Real-world case study: Athenic knowledge base

We optimized our knowledge base from 180ms p95 latency to 42ms while scaling from 320K to 1.2M vectors.

Optimizations applied:

ChangeImpactEffort
Added HNSW index-85ms (180ms → 95ms)4 hours (build + test)
Pre-filter metadata-30ms (95ms → 65ms)2 hours (query refactor)
Connection pooling-12ms (65ms → 53ms)1 hour (config)
Query embedding cache-8ms (53ms → 45ms)3 hours (implementation)
Result cache-3ms (45ms → 42ms)2 hours (LRU setup)

Hardware:

  • AWS RDS db.r6g.xlarge (8 vCPU, 32 GB RAM)
  • 250 GB gp3 SSD (10K IOPS)
  • pgvector 0.5.1, PostgreSQL 15

Configuration highlights:

-- HNSW index
CREATE INDEX hnsw_idx ON knowledge_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Metadata indexes
CREATE INDEX idx_category ON knowledge_embeddings(category);
CREATE INDEX idx_created_at ON knowledge_embeddings(created_at);

-- Postgres tuning
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 256MB

Cost impact:

  • Embedding API costs: -$180/month (caching)
  • Database compute: +$120/month (larger instance)
  • Net savings: -$60/month with 4× better performance

Call-to-action (Activation stage) Audit your vector database performance with our optimization checklist and identify your top bottleneck.

FAQs

When should I rebuild indexes?

Rebuild HNSW indexes when your dataset grows by >50% or when you change index parameters. IVFFlat indexes should be rebuilt more frequently (every 20-30% growth).

Can I use multiple indexes on the same table?

Yes, but Postgres will only use one index per query. Create indexes for different use cases (different distance metrics or filter combinations).

How do I migrate from IVFFlat to HNSW?

Build the HNSW index while keeping IVFFlat active, test performance, then drop IVFFlat. Concurrent index builds (CREATE INDEX CONCURRENTLY) prevent table locks.

Should I normalize embeddings before storage?

OpenAI and Cohere embeddings are pre-normalized. For custom models, normalization can improve cosine similarity performance but isn't required.

How do I handle updates to existing vectors?

Updates invalidate index entries. For high-update workloads, consider soft-deletes (mark as inactive) and periodic reindexing rather than in-place updates.

Summary and next steps

Vector database optimization requires indexing (HNSW for production), query optimization (pre-filtering metadata), hardware tuning (adequate memory for indexes), and caching (embeddings and results). These strategies reduced our latency from 180ms to 42ms.

Next steps:

  1. Add HNSW index to your largest embedding table and benchmark query times.
  2. Identify common metadata filters and add B-tree indexes.
  3. Implement connection pooling if not already present.
  4. Add query embedding caching for repeat searches.
  5. Monitor p50/p95/p99 latency and set alerts for regressions.

Internal links:

External references:

Crosslinks: