TL;DR
- Static tool lists don't scale - when agents have access to 50+ tools, they waste tokens reasoning about irrelevant options.
- Vector-based tool selection surfaces relevant tools dynamically based on task requirements.
- Capability scoring ranks tools by fit: semantic similarity, cost, latency, and reliability all factor in.
- Build fallback chains: if the primary tool fails, automatically route to alternatives.
Jump to The tool selection problem · Jump to Architecture overview · Jump to Implementation guide · Jump to Production patterns
AI Agent Tool Selection: Building Dynamic Routing Systems That Scale
An agent with access to 100 tools faces a paradox: more capability means worse performance. Every tool in the agent's prompt consumes tokens, and LLMs struggle to reason effectively when presented with too many options. Research from Anthropic's tool use team found that accuracy drops 23% when agents have more than 15 tools available in context (Anthropic, 2024).
Dynamic tool selection solves this by surfacing only relevant tools for each task. Instead of dumping all tools into every prompt, you match task requirements to tool capabilities and inject only the top candidates.
This guide covers building tool selection systems that scale to hundreds of tools while maintaining fast, accurate routing. We'll implement the pattern we use at Athenic, where our orchestrator routes across 80+ tools dynamically.
Key takeaways
- Tool selection is a retrieval problem: index tools by capability, query by task requirements.
- Semantic matching outperforms keyword matching - users describe needs differently than tool descriptions.
- Rank by multiple factors: semantic fit, execution cost, historical success rate.
- Build graceful degradation: primary tool fails → fallback tool → human escalation.
The tool selection problem
Traditional agent architectures include all available tools in every prompt:
// Traditional approach - all tools, all the time
const agent = new Agent({
tools: [
webSearchTool,
databaseQueryTool,
emailSendTool,
slackMessageTool,
calendarBookTool,
crmLookupTool,
documentAnalysisTool,
codeExecutionTool,
// ... 50 more tools
]
});
This creates three problems:
Problem 1: Token waste
Each tool definition consumes 50-200 tokens. With 50 tools, you're spending 2,500-10,000 tokens just describing capabilities - before any actual work happens.
Real cost: At GPT-4o pricing ($2.50/1M input tokens), 50 tools × 150 tokens × 10,000 requests/day = $18.75/day just on tool descriptions.
Problem 2: Decision paralysis
LLMs perform worse with more options. Given 50 tools, the model spends significant reasoning capacity deciding which tool to use rather than using it well.
Observed behaviour: Agents with large tool sets often select reasonable-but-suboptimal tools, or oscillate between options before committing.
Problem 3: Irrelevant context
A task about sending emails doesn't need database tools in context. Including them dilutes the prompt with irrelevant information.
Architecture overview
Dynamic tool selection treats tools as a searchable index rather than a static list.
User request
↓
[Task Analysis]
↓
[Tool Registry Query] → Returns top-K relevant tools
↓
[Agent Execution] ← Only relevant tools in context
↓
[Result + Feedback] → Updates tool success metrics
Components
Tool Registry: Database of all available tools with semantic descriptions, capability metadata, and embeddings for similarity search.
Selection Engine: Queries the registry based on task requirements, ranks results, and returns the top candidates.
Feedback Loop: Tracks which tool selections succeed or fail, improving future routing accuracy.
Implementation guide
Let's build a production-ready tool selection system step by step.
Step 1: Tool registry schema
Store tools with rich metadata that enables intelligent matching.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE tools (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT UNIQUE NOT NULL,
display_name TEXT NOT NULL,
description TEXT NOT NULL, -- Human-readable description
capability_summary TEXT NOT NULL, -- What this tool can do (for embedding)
-- Embedding for semantic search
embedding vector(1536),
-- Capability tags for filtering
categories TEXT[] DEFAULT '{}',
input_types TEXT[] DEFAULT '{}',
output_types TEXT[] DEFAULT '{}',
-- Performance metadata
avg_latency_ms INTEGER,
success_rate FLOAT DEFAULT 1.0,
cost_per_call DECIMAL(10, 6),
-- Access control
requires_auth BOOLEAN DEFAULT false,
allowed_scopes TEXT[] DEFAULT '{}',
-- Tool definition for agent consumption
schema JSONB NOT NULL, -- OpenAI function schema format
-- Tracking
call_count INTEGER DEFAULT 0,
last_used_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index for semantic search
CREATE INDEX ON tools USING ivfflat (embedding vector_cosine_ops) WITH (lists = 50);
-- Index for category filtering
CREATE INDEX ON tools USING gin (categories);
Step 2: Tool registration
When registering tools, generate embeddings from capability descriptions.
import { OpenAI } from 'openai';
interface ToolDefinition {
name: string;
displayName: string;
description: string;
capabilitySummary: string;
categories: string[];
inputTypes: string[];
outputTypes: string[];
schema: object;
requiresAuth?: boolean;
allowedScopes?: string[];
costPerCall?: number;
}
class ToolRegistry {
private openai = new OpenAI();
async registerTool(tool: ToolDefinition): Promise<void> {
// Generate embedding from capability summary
const embeddingResponse = await this.openai.embeddings.create({
model: 'text-embedding-3-small',
input: `${tool.displayName}: ${tool.capabilitySummary}`
});
const embedding = embeddingResponse.data[0].embedding;
await db.tools.upsert({
name: tool.name,
displayName: tool.displayName,
description: tool.description,
capabilitySummary: tool.capabilitySummary,
embedding,
categories: tool.categories,
inputTypes: tool.inputTypes,
outputTypes: tool.outputTypes,
schema: tool.schema,
requiresAuth: tool.requiresAuth ?? false,
allowedScopes: tool.allowedScopes ?? [],
costPerCall: tool.costPerCall ?? 0
});
}
}
// Example: Register an email tool
await registry.registerTool({
name: 'send_email',
displayName: 'Send Email',
description: 'Send an email to one or more recipients with subject and body.',
capabilitySummary: 'Send emails, compose messages, email outreach, contact people via email, mail delivery',
categories: ['communication', 'email'],
inputTypes: ['text', 'email_address'],
outputTypes: ['confirmation'],
schema: {
name: 'send_email',
description: 'Send an email message',
parameters: {
type: 'object',
properties: {
to: { type: 'array', items: { type: 'string' }, description: 'Recipient email addresses' },
subject: { type: 'string', description: 'Email subject line' },
body: { type: 'string', description: 'Email body content' }
},
required: ['to', 'subject', 'body']
}
}
});
Key insight: The capabilitySummary field matters more than the technical description. Include synonyms and variations - "send email", "email outreach", "contact via email" all map to the same tool.
Step 3: Selection engine
Query the registry based on task requirements and rank results.
interface SelectionContext {
taskDescription: string;
categories?: string[];
requiredInputTypes?: string[];
requiredOutputTypes?: string[];
maxCost?: number;
scope: string;
userId: string;
}
interface SelectedTool {
name: string;
displayName: string;
description: string;
schema: object;
score: number;
scoreBreakdown: {
semanticSimilarity: number;
categoryMatch: number;
successRate: number;
costScore: number;
};
}
class ToolSelector {
async selectTools(
context: SelectionContext,
topK: number = 5
): Promise<SelectedTool[]> {
// Generate embedding for task description
const taskEmbedding = await this.generateEmbedding(context.taskDescription);
// Query with filters
const candidates = await db.query(`
SELECT
t.*,
1 - (t.embedding <=> $1) AS semantic_similarity
FROM tools t
WHERE
-- Scope check
($2 = ANY(t.allowed_scopes) OR cardinality(t.allowed_scopes) = 0)
-- Category filter (if specified)
AND ($3::text[] IS NULL OR t.categories && $3)
-- Cost filter
AND ($4::decimal IS NULL OR t.cost_per_call <= $4)
ORDER BY t.embedding <=> $1
LIMIT $5
`, [
taskEmbedding,
context.scope,
context.categories || null,
context.maxCost || null,
topK * 2 // Get extra candidates for re-ranking
]);
// Re-rank with multi-factor scoring
const scored = candidates.map(tool => ({
...tool,
score: this.calculateScore(tool, context),
scoreBreakdown: this.getScoreBreakdown(tool, context)
}));
// Sort by composite score and return top-K
scored.sort((a, b) => b.score - a.score);
return scored.slice(0, topK).map(tool => ({
name: tool.name,
displayName: tool.display_name,
description: tool.description,
schema: tool.schema,
score: tool.score,
scoreBreakdown: tool.scoreBreakdown
}));
}
private calculateScore(tool: any, context: SelectionContext): number {
const weights = {
semanticSimilarity: 0.5,
categoryMatch: 0.2,
successRate: 0.2,
costScore: 0.1
};
const scores = this.getScoreBreakdown(tool, context);
return (
weights.semanticSimilarity * scores.semanticSimilarity +
weights.categoryMatch * scores.categoryMatch +
weights.successRate * scores.successRate +
weights.costScore * scores.costScore
);
}
private getScoreBreakdown(tool: any, context: SelectionContext) {
// Semantic similarity (already computed)
const semanticSimilarity = tool.semantic_similarity;
// Category match
const categoryMatch = context.categories
? tool.categories.filter(c => context.categories.includes(c)).length / context.categories.length
: 0.5; // Neutral if no categories specified
// Success rate from historical data
const successRate = tool.success_rate;
// Cost score (lower cost = higher score)
const maxCost = context.maxCost || 0.01; // Default max
const costScore = 1 - Math.min(tool.cost_per_call / maxCost, 1);
return {
semanticSimilarity,
categoryMatch,
successRate,
costScore
};
}
private async generateEmbedding(text: string): Promise<number[]> {
const response = await this.openai.embeddings.create({
model: 'text-embedding-3-small',
input: text
});
return response.data[0].embedding;
}
}
Step 4: Integration with agent execution
Inject selected tools into agent context dynamically.
class DynamicToolAgent {
private selector: ToolSelector;
private llm: OpenAI;
async execute(
userMessage: string,
context: ExecutionContext
): Promise<AgentResponse> {
// Select relevant tools for this task
const selectedTools = await this.selector.selectTools({
taskDescription: userMessage,
scope: context.scope,
userId: context.userId,
maxCost: context.costBudget
});
// Convert to OpenAI function format
const tools = selectedTools.map(tool => ({
type: 'function' as const,
function: tool.schema
}));
// Execute with only relevant tools
const response = await this.llm.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: this.systemPrompt },
{ role: 'user', content: userMessage }
],
tools,
tool_choice: 'auto'
});
// Track tool usage for feedback
if (response.choices[0].message.tool_calls) {
for (const call of response.choices[0].message.tool_calls) {
await this.recordToolUsage(call.function.name, context);
}
}
return this.processResponse(response);
}
}
Step 5: Feedback loop
Track selection quality and update success rates.
interface ToolExecutionResult {
toolName: string;
success: boolean;
latencyMs: number;
errorType?: string;
}
class ToolFeedback {
async recordExecution(result: ToolExecutionResult): Promise<void> {
// Update tool metrics
await db.query(`
UPDATE tools
SET
call_count = call_count + 1,
last_used_at = NOW(),
avg_latency_ms = (avg_latency_ms * call_count + $2) / (call_count + 1),
success_rate = (success_rate * call_count + $3) / (call_count + 1)
WHERE name = $1
`, [result.toolName, result.latencyMs, result.success ? 1 : 0]);
// Log for analysis
await db.toolExecutions.insert({
toolName: result.toolName,
success: result.success,
latencyMs: result.latencyMs,
errorType: result.errorType,
executedAt: new Date()
});
}
async getToolHealth(): Promise<ToolHealthReport[]> {
return db.query(`
SELECT
name,
call_count,
success_rate,
avg_latency_ms,
CASE
WHEN success_rate < 0.8 THEN 'unhealthy'
WHEN success_rate < 0.95 THEN 'degraded'
ELSE 'healthy'
END AS status
FROM tools
WHERE call_count > 10
ORDER BY success_rate ASC
`);
}
}
Production patterns
Pattern 1: Fallback chains
Define alternative tools for reliability.
interface ToolWithFallbacks {
primary: string;
fallbacks: string[];
}
const fallbackChains: Record<string, ToolWithFallbacks> = {
'send_email': {
primary: 'sendgrid_email',
fallbacks: ['ses_email', 'smtp_email']
},
'web_search': {
primary: 'perplexity_search',
fallbacks: ['tavily_search', 'serper_search']
}
};
async function executeWithFallback(
capability: string,
parameters: Record<string, any>
): Promise<ToolResult> {
const chain = fallbackChains[capability];
if (!chain) {
throw new Error(`No tool chain for capability: ${capability}`);
}
const tools = [chain.primary, ...chain.fallbacks];
for (const toolName of tools) {
try {
const result = await executeTool(toolName, parameters);
return result;
} catch (error) {
console.warn(`Tool ${toolName} failed, trying fallback:`, error);
await feedback.recordExecution({
toolName,
success: false,
latencyMs: 0,
errorType: error.code || 'unknown'
});
}
}
throw new Error(`All tools failed for capability: ${capability}`);
}
Pattern 2: Capability abstraction
Let agents request capabilities, not specific tools.
// Instead of: "Use the SendGrid tool to send an email"
// Agent says: "I need to send an email"
interface CapabilityRequest {
capability: 'send_email' | 'search_web' | 'query_database' | 'create_document';
parameters: Record<string, any>;
}
async function fulfillCapability(
request: CapabilityRequest,
context: ExecutionContext
): Promise<any> {
// Map capability to best available tool
const tool = await selector.selectTools({
taskDescription: request.capability,
scope: context.scope,
userId: context.userId
}, 1);
if (!tool.length) {
throw new Error(`No tool available for: ${request.capability}`);
}
return executeWithFallback(tool[0].name, request.parameters);
}
Pattern 3: Context-aware selection
Adjust selection based on execution context.
async function selectWithContext(
task: string,
context: ExecutionContext
): Promise<SelectedTool[]> {
// Adjust weights based on context
const weights = {
semanticSimilarity: 0.5,
categoryMatch: 0.2,
successRate: 0.2,
costScore: 0.1
};
// If on budget, weight cost higher
if (context.costSensitive) {
weights.costScore = 0.3;
weights.semanticSimilarity = 0.4;
}
// If time-critical, weight latency (via success rate proxy)
if (context.timeCritical) {
weights.successRate = 0.35;
weights.costScore = 0.05;
}
return selector.selectTools({
taskDescription: task,
scope: context.scope,
userId: context.userId,
// Pass custom weights
_weights: weights
});
}
Pattern 4: Tool composition
Combine multiple tools for complex operations.
interface CompositeOperation {
name: string;
description: string;
steps: {
tool: string;
inputMapping: Record<string, string>;
outputMapping: string;
}[];
}
const compositeOperations: CompositeOperation[] = [
{
name: 'research_and_summarise',
description: 'Search for information and create a summary',
steps: [
{
tool: 'web_search',
inputMapping: { query: 'input.topic' },
outputMapping: 'searchResults'
},
{
tool: 'document_analysis',
inputMapping: { documents: 'searchResults' },
outputMapping: 'analysis'
},
{
tool: 'generate_summary',
inputMapping: { content: 'analysis' },
outputMapping: 'output'
}
]
}
];
async function executeComposite(
operationName: string,
input: Record<string, any>
): Promise<any> {
const operation = compositeOperations.find(o => o.name === operationName);
if (!operation) throw new Error(`Unknown operation: ${operationName}`);
const state: Record<string, any> = { input };
for (const step of operation.steps) {
const parameters = resolveMapping(step.inputMapping, state);
const result = await executeTool(step.tool, parameters);
state[step.outputMapping] = result;
}
return state.output;
}
Monitoring and optimisation
Track selection quality to identify improvement opportunities.
Key metrics
| Metric | Target | Indicates |
|---|
| Selection accuracy | >90% | First choice is correct |
| Fallback rate | <10% | Primary tools are reliable |
| Avg tools per request | 3-5 | Selection is selective enough |
| Selection latency | <100ms | Routing is fast |
Dashboard queries
-- Tool selection accuracy (first choice usage)
SELECT
DATE_TRUNC('day', executed_at) AS day,
COUNT(*) FILTER (WHERE selection_rank = 1) * 100.0 / COUNT(*) AS first_choice_pct
FROM tool_executions
WHERE executed_at > NOW() - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;
-- Tools with high fallback rates
SELECT
t.name,
t.success_rate,
COUNT(te.id) FILTER (WHERE te.success = false) AS failures,
COUNT(te.id) AS total_calls
FROM tools t
JOIN tool_executions te ON t.name = te.tool_name
WHERE te.executed_at > NOW() - INTERVAL '7 days'
GROUP BY t.name, t.success_rate
HAVING COUNT(te.id) FILTER (WHERE te.success = false) > 5
ORDER BY t.success_rate ASC;
FAQs
How many tools can this approach handle?
We've tested with 200+ tools without degradation. Vector search scales well - the limitation is more about tool quality than quantity.
Should I embed tool names or descriptions?
Embed capability summaries - they capture what the tool does, not what it's called. Tool names are often technical; users describe tasks conversationally.
How do I handle new tools?
Register with good capability summaries and start with neutral success rate (1.0). The system learns actual reliability through usage. Consider a "cold start" period with lower ranking until sufficient data exists.
What about MCP tools from external servers?
Same approach. When discovering MCP tools, extract descriptions and generate embeddings. Store in the same registry with metadata indicating the source server.
How do I prevent selection gaming?
Don't let tool providers control their own capability summaries. Review registrations manually and verify claims against actual behaviour.
Summary and next steps
Dynamic tool selection transforms agent capabilities from "what tools exist" to "what tools help with this task". The combination of semantic matching, multi-factor scoring, and feedback loops creates systems that improve with use.
Implementation checklist:
- Design tool registry schema with semantic fields
- Register existing tools with rich capability summaries
- Build selection engine with vector search
- Integrate with agent execution pipeline
- Add fallback chains for reliability
- Deploy monitoring for selection quality
Quick wins:
- Start with semantic search only - multi-factor scoring can come later
- Register 5-10 core tools and validate selection quality before scaling
- Add feedback loop early - historical data improves future selection
Internal links:
External references: