Vector Database Optimization: From Slow Queries to Sub-50ms
Optimize vector database performance with indexing strategies, query planning, hardware tuning, and caching to achieve sub-50ms retrieval latency at scale.
Optimize vector database performance with indexing strategies, query planning, hardware tuning, and caching to achieve sub-50ms retrieval latency at scale.
TL;DR
vector_cosine_ops instead of vector_l2_ops for normalized embeddings (OpenAI, Cohere).Jump to Performance bottlenecks · Jump to Indexing strategies · Jump to Query optimization · Jump to Hardware tuning
Vector databases power RAG systems, semantic search, and recommendation engines. But as your dataset grows from 10K to 1M+ vectors, query latency creeps from comfortable 50ms to unacceptable 500ms+. Your users abandon searches, agents time out, and infrastructure costs spiral.
This guide covers production-tested optimization strategies that took our pgvector database from 180ms p95 latency to 42ms while handling 4× more traffic. These patterns apply to pgvector, Pinecone, Weaviate, and Qdrant.
Key takeaways
- Indexing choice matters: HNSW provides 6-8× faster queries than IVFFlat for most workloads.
- Pre-filtering metadata before vector search reduces search space by 70-90% for most queries.
- Connection pooling and prepared statements cut query overhead by 30-40%.
- Monitor query planning with EXPLAIN ANALYZE to identify slow paths.
Vector database performance degrades due to four primary causes.
Without indexes, vector similarity search compares the query vector against every stored vector. For 1M vectors at 1536 dimensions (OpenAI embeddings), this means 1.5 billion floating-point comparisons per query.
Symptoms:
Fix: Add HNSW or IVFFlat indexes (covered below).
Naive implementations retrieve top-100 vectors, then filter by metadata (date, category, author). This wastes computation on irrelevant candidates.
Example:
-- BAD: Searches all vectors, then filters
SELECT * FROM embeddings
ORDER BY embedding <=> $1
LIMIT 10
WHERE category = 'technical-docs';
Fix: Pre-filter with B-tree indexes before vector search.
Opening new database connections for each query adds 20-50ms latency. Repeated authentication and TLS handshakes compound the cost.
Symptoms:
Fix: Use connection pooling with persistent connections.
Using L2 distance (<->) for normalized vectors (like OpenAI embeddings) performs unnecessary sqrt operations. Cosine similarity (<=>) is faster and semantically equivalent for unit vectors.
| Metric | Operation | When to use | Performance |
|---|---|---|---|
Cosine (<=>) | Dot product / norms | Normalized embeddings (OpenAI, Cohere) | Fast |
L2 (<->) | Euclidean distance | Unnormalized vectors, spatial data | Moderate |
Inner product (<#>) | Negative dot product | Maximum similarity | Fastest |
According to pgvector benchmarks, cosine distance queries run 15-20% faster than L2 for OpenAI text-embedding-3-small vectors (pgvector docs, 2024).
Indexes trade disk space and build time for dramatically faster queries.
HNSW builds a multi-layer graph where each node connects to nearby vectors. Queries traverse the graph from top to bottom, pruning distant candidates.
Characteristics:
When to use: Production systems prioritizing query speed over build time.
Create HNSW index:
CREATE INDEX ON embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Parameters:
m: Connections per layer (higher = better recall, slower builds) – range 8-64, default 16ef_construction: Candidate list size during build (higher = better recall, slower builds) – range 32-400, default 64Query-time tuning:
SET hnsw.ef_search = 100; -- Candidate list size during search (default 40)
Higher ef_search improves recall at the cost of latency. Tune based on your precision requirements.
IVFFlat clusters vectors into partitions. Queries search only the nearest partitions.
Characteristics:
When to use: Memory-constrained systems or frequently rebuilt indexes.
Create IVFFlat index:
CREATE INDEX ON embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
Parameters:
lists: Number of partitions – use sqrt(total_rows) for best balanceQuery-time tuning:
SET ivfflat.probes = 10; -- Number of partitions to search (default 1)
More probes = better recall, slower queries.
| Dataset size | No index | IVFFlat | HNSW |
|---|---|---|---|
| 10K vectors | 25ms | 18ms | 12ms |
| 100K vectors | 180ms | 45ms | 28ms |
| 1M vectors | 1,850ms | 120ms | 42ms |
| 10M vectors | N/A (timeout) | 380ms | 95ms |
Benchmarked on pgvector with OpenAI text-embedding-3-small (1536 dim), AWS RDS db.r6g.xlarge, querying for top-10 results.
Our choice: HNSW with m=16, ef_construction=64 for production. Build time is acceptable (4 minutes for 1.2M vectors), and query latency averages 38ms p50, 72ms p95.
Beyond indexing, optimize query patterns to reduce unnecessary work.
Most queries have implicit filters: date ranges, categories, authors. Apply these before vector search.
Pattern: Metadata + vector hybrid query
-- Create B-tree indexes on filterable columns
CREATE INDEX idx_created_at ON embeddings(created_at);
CREATE INDEX idx_category ON embeddings(category);
-- Query with pre-filtering
SELECT id, chunk_text, embedding <=> $1 AS distance
FROM embeddings
WHERE
category = 'technical-docs'
AND created_at > NOW() - INTERVAL '6 months'
ORDER BY embedding <=> $1
LIMIT 10;
The query planner uses B-tree indexes to narrow candidates from 1.2M vectors to 85K before engaging the HNSW index.
Impact: Reduced p95 latency from 120ms to 45ms for category-filtered queries.
Prepared statements reduce parsing overhead by compiling queries once.
// Prepare query once at startup
await db.prepare('vector_search', `
SELECT id, chunk_text, embedding <=> $1 AS distance
FROM embeddings
WHERE category = $2
ORDER BY embedding <=> $1
LIMIT $3
`);
// Execute repeatedly
const results = await db.execute('vector_search', [embedding, 'technical-docs', 10]);
Impact: 8-12ms latency reduction per query.
Retrieving top-100 results when you only use top-5 wastes bandwidth and processing.
-- BAD: Retrieves 100, uses 5
SELECT * FROM embeddings ORDER BY embedding <=> $1 LIMIT 100;
-- GOOD: Retrieve exactly what you need
SELECT * FROM embeddings ORDER BY embedding <=> $1 LIMIT 5;
Impact: 15-25% latency reduction for large result sets.
Use EXPLAIN ANALYZE to understand query execution.
EXPLAIN ANALYZE
SELECT id, chunk_text, embedding <=> $1 AS distance
FROM embeddings
WHERE category = 'technical-docs'
ORDER BY embedding <=> $1
LIMIT 10;
Look for:
Example output:
Limit (cost=42.15..42.17 rows=10) (actual time=38.241..38.245 rows=10)
-> Index Scan using hnsw_embedding_idx on embeddings (cost=0.00..84521.00 rows=85432)
Filter: (category = 'technical-docs')
Rows Removed by Filter: 324
Order By: (embedding <=> '[...]')
The index scan is fast (38ms), but 324 rows were filtered post-retrieval. Adding a composite index on (category, embedding) could eliminate this.
Vector operations are CPU and memory intensive. Right-sizing infrastructure yields significant gains.
| Workload | CPU | Memory | Storage | Cost (AWS) |
|---|---|---|---|---|
| Development | 2 vCPU | 8 GB | 100 GB SSD | $50/mo |
| Small production (<500K vectors) | 4 vCPU | 16 GB | 250 GB SSD | $150/mo |
| Medium production (500K-5M) | 8 vCPU | 32 GB | 500 GB SSD | $400/mo |
| Large production (5M-50M) | 16 vCPU | 64 GB | 1 TB SSD | $900/mo |
Key recommendations:
At Athenic, we run 8 vCPU / 32 GB for 1.2M vectors. Memory usage sits at 18 GB (including Postgres shared buffers).
Increase shared buffers:
-- postgresql.conf
shared_buffers = 8GB -- 25% of total memory
effective_cache_size = 24GB -- 75% of total memory
Enable parallel workers:
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
Optimize work memory for large sorts:
work_mem = 256MB -- Per-operation memory for sorts
maintenance_work_mem = 2GB -- For index builds
Tune checkpoint behavior:
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
Use PgBouncer or Supabase's built-in connection pooler to maintain persistent connections.
import { createPool } from '@supabase/supabase-js';
const pool = createPool({
connectionString: process.env.DATABASE_URL,
max: 20, // Maximum connections
idleTimeoutMillis: 30000,
});
// Reuse connections across requests
export async function vectorSearch(embedding: number[]) {
const client = await pool.connect();
try {
return await client.query(/* query */);
} finally {
client.release();
}
}
Impact: Reduced connection overhead from 25ms to <2ms per query.
Cache embeddings and results to eliminate redundant computation.
OpenAI embedding API calls cost $0.02 per 1M tokens and add 80-150ms latency. Cache embeddings for frequent queries.
import LRU from 'lru-cache';
const embeddingCache = new LRU<string, number[]>({
max: 10000, // Store 10K query embeddings
ttl: 3600000, // 1 hour TTL
});
async function getEmbedding(query: string): Promise<number[]> {
const cached = embeddingCache.get(query);
if (cached) return cached;
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query,
});
const vector = embedding.data[0].embedding;
embeddingCache.set(query, vector);
return vector;
}
Impact: 75% cache hit rate on our production traffic, saving $180/month on embedding costs.
Cache full search results for identical queries.
const resultCache = new LRU<string, SearchResult[]>({
max: 1000,
ttl: 1800000, // 30 minutes
});
async function vectorSearch(query: string): Promise<SearchResult[]> {
const cached = resultCache.get(query);
if (cached) return cached;
const embedding = await getEmbedding(query);
const results = await db.query(/* vector search */);
resultCache.set(query, results);
return results;
}
Impact: Additional 40% reduction in database load for repeat queries.
For public-facing semantic search APIs, cache responses at the CDN level.
// Next.js API route
export async function GET(request: Request) {
const { searchParams } = new URL(request.url);
const query = searchParams.get('q');
const results = await vectorSearch(query);
return new Response(JSON.stringify(results), {
headers: {
'Content-Type': 'application/json',
'Cache-Control': 'public, s-maxage=300, stale-while-revalidate=600',
},
});
}
Cloudflare/Vercel edge caches serve popular queries globally with <10ms latency.
We optimized our knowledge base from 180ms p95 latency to 42ms while scaling from 320K to 1.2M vectors.
Optimizations applied:
| Change | Impact | Effort |
|---|---|---|
| Added HNSW index | -85ms (180ms → 95ms) | 4 hours (build + test) |
| Pre-filter metadata | -30ms (95ms → 65ms) | 2 hours (query refactor) |
| Connection pooling | -12ms (65ms → 53ms) | 1 hour (config) |
| Query embedding cache | -8ms (53ms → 45ms) | 3 hours (implementation) |
| Result cache | -3ms (45ms → 42ms) | 2 hours (LRU setup) |
Hardware:
Configuration highlights:
-- HNSW index
CREATE INDEX hnsw_idx ON knowledge_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Metadata indexes
CREATE INDEX idx_category ON knowledge_embeddings(category);
CREATE INDEX idx_created_at ON knowledge_embeddings(created_at);
-- Postgres tuning
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 256MB
Cost impact:
Call-to-action (Activation stage) Audit your vector database performance with our optimization checklist and identify your top bottleneck.
Rebuild HNSW indexes when your dataset grows by >50% or when you change index parameters. IVFFlat indexes should be rebuilt more frequently (every 20-30% growth).
Yes, but Postgres will only use one index per query. Create indexes for different use cases (different distance metrics or filter combinations).
Build the HNSW index while keeping IVFFlat active, test performance, then drop IVFFlat. Concurrent index builds (CREATE INDEX CONCURRENTLY) prevent table locks.
OpenAI and Cohere embeddings are pre-normalized. For custom models, normalization can improve cosine similarity performance but isn't required.
Updates invalidate index entries. For high-update workloads, consider soft-deletes (mark as inactive) and periodic reindexing rather than in-place updates.
Vector database optimization requires indexing (HNSW for production), query optimization (pre-filtering metadata), hardware tuning (adequate memory for indexes), and caching (embeddings and results). These strategies reduced our latency from 180ms to 42ms.
Next steps:
Internal links:
External references:
Crosslinks: