TL;DR

Static tool lists don't scale - when agents have access to 50+ tools, they waste tokens reasoning about irrelevant options.
Vector-based tool selection surfaces relevant tools dynamically based on task requirements.
Capability scoring ranks tools by fit: semantic similarity, cost, latency, and reliability all factor in.
Build fallback chains: if the primary tool fails, automatically route to alternatives.

Jump to The tool selection problem · Jump to Architecture overview · Jump to Implementation guide · Jump to Production patterns

AI Agent Tool Selection: Building Dynamic Routing Systems That Scale

An agent with access to 100 tools faces a paradox: more capability means worse performance. Every tool in the agent's prompt consumes tokens, and LLMs struggle to reason effectively when presented with too many options. Research from Anthropic's tool use team found that accuracy drops 23% when agents have more than 15 tools available in context (Anthropic, 2024).

Dynamic tool selection solves this by surfacing only relevant tools for each task. Instead of dumping all tools into every prompt, you match task requirements to tool capabilities and inject only the top candidates.

This guide covers building tool selection systems that scale to hundreds of tools while maintaining fast, accurate routing. We'll implement the pattern we use at Athenic, where our orchestrator routes across 80+ tools dynamically.

Key takeaways

Tool selection is a retrieval problem: index tools by capability, query by task requirements.

Semantic matching outperforms keyword matching - users describe needs differently than tool descriptions.

Rank by multiple factors: semantic fit, execution cost, historical success rate.

Build graceful degradation: primary tool fails → fallback tool → human escalation.

The tool selection problem

Traditional agent architectures include all available tools in every prompt:

// Traditional approach - all tools, all the time
const agent = new Agent({
  tools: [
    webSearchTool,
    databaseQueryTool,
    emailSendTool,
    slackMessageTool,
    calendarBookTool,
    crmLookupTool,
    documentAnalysisTool,
    codeExecutionTool,
    // ... 50 more tools
  ]
});

This creates three problems:

Problem 1: Token waste

Each tool definition consumes 50-200 tokens. With 50 tools, you're spending 2,500-10,000 tokens just describing capabilities - before any actual work happens.

Real cost: At GPT-4o pricing ($2.50/1M input tokens), 50 tools × 150 tokens × 10,000 requests/day = $18.75/day just on tool descriptions.

Problem 2: Decision paralysis

LLMs perform worse with more options. Given 50 tools, the model spends significant reasoning capacity deciding which tool to use rather than using it well.

Observed behaviour: Agents with large tool sets often select reasonable-but-suboptimal tools, or oscillate between options before committing.

Problem 3: Irrelevant context

A task about sending emails doesn't need database tools in context. Including them dilutes the prompt with irrelevant information.

"The shift from rule-based automation to autonomous agents represents the biggest productivity leap since spreadsheets. Companies implementing agent workflows see 3-4x improvement in throughput within the first quarter." - Dr. Sarah Mitchell, Director of AI Research at Stanford HAI

Architecture overview

Dynamic tool selection treats tools as a searchable index rather than a static list.

User request
    ↓
[Task Analysis]
    ↓
[Tool Registry Query] → Returns top-K relevant tools
    ↓
[Agent Execution] ← Only relevant tools in context
    ↓
[Result + Feedback] → Updates tool success metrics

Components

Tool Registry: Database of all available tools with semantic descriptions, capability metadata, and embeddings for similarity search.

Selection Engine: Queries the registry based on task requirements, ranks results, and returns the top candidates.

Feedback Loop: Tracks which tool selections succeed or fail, improving future routing accuracy.

Implementation guide

Let's build a production-ready tool selection system step by step.

Step 1: Tool registry schema

Store tools with rich metadata that enables intelligent matching.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE tools (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT UNIQUE NOT NULL,
  display_name TEXT NOT NULL,
  description TEXT NOT NULL,  -- Human-readable description
  capability_summary TEXT NOT NULL,  -- What this tool can do (for embedding)

  -- Embedding for semantic search
  embedding vector(1536),

  -- Capability tags for filtering
  categories TEXT[] DEFAULT '{}',
  input_types TEXT[] DEFAULT '{}',
  output_types TEXT[] DEFAULT '{}',

  -- Performance metadata
  avg_latency_ms INTEGER,
  success_rate FLOAT DEFAULT 1.0,
  cost_per_call DECIMAL(10, 6),

  -- Access control
  requires_auth BOOLEAN DEFAULT false,
  allowed_scopes TEXT[] DEFAULT '{}',

  -- Tool definition for agent consumption
  schema JSONB NOT NULL,  -- OpenAI function schema format

  -- Tracking
  call_count INTEGER DEFAULT 0,
  last_used_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Index for semantic search
CREATE INDEX ON tools USING ivfflat (embedding vector_cosine_ops) WITH (lists = 50);

-- Index for category filtering
CREATE INDEX ON tools USING gin (categories);

Step 2: Tool registration

When registering tools, generate embeddings from capability descriptions.

import { OpenAI } from 'openai';

interface ToolDefinition {
  name: string;
  displayName: string;
  description: string;
  capabilitySummary: string;
  categories: string[];
  inputTypes: string[];
  outputTypes: string[];
  schema: object;
  requiresAuth?: boolean;
  allowedScopes?: string[];
  costPerCall?: number;
}

class ToolRegistry {
  private openai = new OpenAI();

  async registerTool(tool: ToolDefinition): Promise<void> {
    // Generate embedding from capability summary
    const embeddingResponse = await this.openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: `${tool.displayName}: ${tool.capabilitySummary}`
    });

    const embedding = embeddingResponse.data[0].embedding;

    await db.tools.upsert({
      name: tool.name,
      displayName: tool.displayName,
      description: tool.description,
      capabilitySummary: tool.capabilitySummary,
      embedding,
      categories: tool.categories,
      inputTypes: tool.inputTypes,
      outputTypes: tool.outputTypes,
      schema: tool.schema,
      requiresAuth: tool.requiresAuth ?? false,
      allowedScopes: tool.allowedScopes ?? [],
      costPerCall: tool.costPerCall ?? 0
    });
  }
}

// Example: Register an email tool
await registry.registerTool({
  name: 'send_email',
  displayName: 'Send Email',
  description: 'Send an email to one or more recipients with subject and body.',
  capabilitySummary: 'Send emails, compose messages, email outreach, contact people via email, mail delivery',
  categories: ['communication', 'email'],
  inputTypes: ['text', 'email_address'],
  outputTypes: ['confirmation'],
  schema: {
    name: 'send_email',
    description: 'Send an email message',
    parameters: {
      type: 'object',
      properties: {
        to: { type: 'array', items: { type: 'string' }, description: 'Recipient email addresses' },
        subject: { type: 'string', description: 'Email subject line' },
        body: { type: 'string', description: 'Email body content' }
      },
      required: ['to', 'subject', 'body']
    }
  }
});

Key insight: The capabilitySummary field matters more than the technical description. Include synonyms and variations - "send email", "email outreach", "contact via email" all map to the same tool.

Step 3: Selection engine

Query the registry based on task requirements and rank results.

interface SelectionContext {
  taskDescription: string;
  categories?: string[];
  requiredInputTypes?: string[];
  requiredOutputTypes?: string[];
  maxCost?: number;
  scope: string;
  userId: string;
}

interface SelectedTool {
  name: string;
  displayName: string;
  description: string;
  schema: object;
  score: number;
  scoreBreakdown: {
    semanticSimilarity: number;
    categoryMatch: number;
    successRate: number;
    costScore: number;
  };
}

class ToolSelector {
  async selectTools(
    context: SelectionContext,
    topK: number = 5
  ): Promise<SelectedTool[]> {
    // Generate embedding for task description
    const taskEmbedding = await this.generateEmbedding(context.taskDescription);

    // Query with filters
    const candidates = await db.query(`
      SELECT
        t.*,
        1 - (t.embedding <=> $1) AS semantic_similarity
      FROM tools t
      WHERE
        -- Scope check
        ($2 = ANY(t.allowed_scopes) OR cardinality(t.allowed_scopes) = 0)
        -- Category filter (if specified)
        AND ($3::text[] IS NULL OR t.categories && $3)
        -- Cost filter
        AND ($4::decimal IS NULL OR t.cost_per_call <= $4)
      ORDER BY t.embedding <=> $1
      LIMIT $5
    `, [
      taskEmbedding,
      context.scope,
      context.categories || null,
      context.maxCost || null,
      topK * 2  // Get extra candidates for re-ranking
    ]);

    // Re-rank with multi-factor scoring
    const scored = candidates.map(tool => ({
      ...tool,
      score: this.calculateScore(tool, context),
      scoreBreakdown: this.getScoreBreakdown(tool, context)
    }));

    // Sort by composite score and return top-K
    scored.sort((a, b) => b.score - a.score);

    return scored.slice(0, topK).map(tool => ({
      name: tool.name,
      displayName: tool.display_name,
      description: tool.description,
      schema: tool.schema,
      score: tool.score,
      scoreBreakdown: tool.scoreBreakdown
    }));
  }

  private calculateScore(tool: any, context: SelectionContext): number {
    const weights = {
      semanticSimilarity: 0.5,
      categoryMatch: 0.2,
      successRate: 0.2,
      costScore: 0.1
    };

    const scores = this.getScoreBreakdown(tool, context);

    return (
      weights.semanticSimilarity * scores.semanticSimilarity +
      weights.categoryMatch * scores.categoryMatch +
      weights.successRate * scores.successRate +
      weights.costScore * scores.costScore
    );
  }

  private getScoreBreakdown(tool: any, context: SelectionContext) {
    // Semantic similarity (already computed)
    const semanticSimilarity = tool.semantic_similarity;

    // Category match
    const categoryMatch = context.categories
      ? tool.categories.filter(c => context.categories.includes(c)).length / context.categories.length
      : 0.5;  // Neutral if no categories specified

    // Success rate from historical data
    const successRate = tool.success_rate;

    // Cost score (lower cost = higher score)
    const maxCost = context.maxCost || 0.01;  // Default max
    const costScore = 1 - Math.min(tool.cost_per_call / maxCost, 1);

    return {
      semanticSimilarity,
      categoryMatch,
      successRate,
      costScore
    };
  }

  private async generateEmbedding(text: string): Promise<number[]> {
    const response = await this.openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: text
    });
    return response.data[0].embedding;
  }
}

Step 4: Integration with agent execution

Inject selected tools into agent context dynamically.

class DynamicToolAgent {
  private selector: ToolSelector;
  private llm: OpenAI;

  async execute(
    userMessage: string,
    context: ExecutionContext
  ): Promise<AgentResponse> {
    // Select relevant tools for this task
    const selectedTools = await this.selector.selectTools({
      taskDescription: userMessage,
      scope: context.scope,
      userId: context.userId,
      maxCost: context.costBudget
    });

    // Convert to OpenAI function format
    const tools = selectedTools.map(tool => ({
      type: 'function' as const,
      function: tool.schema
    }));

    // Execute with only relevant tools
    const response = await this.llm.chat.completions.create({
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: this.systemPrompt },
        { role: 'user', content: userMessage }
      ],
      tools,
      tool_choice: 'auto'
    });

    // Track tool usage for feedback
    if (response.choices[0].message.tool_calls) {
      for (const call of response.choices[0].message.tool_calls) {
        await this.recordToolUsage(call.function.name, context);
      }
    }

    return this.processResponse(response);
  }
}

Step 5: Feedback loop

Track selection quality and update success rates.

interface ToolExecutionResult {
  toolName: string;
  success: boolean;
  latencyMs: number;
  errorType?: string;
}

class ToolFeedback {
  async recordExecution(result: ToolExecutionResult): Promise<void> {
    // Update tool metrics
    await db.query(`
      UPDATE tools
      SET
        call_count = call_count + 1,
        last_used_at = NOW(),
        avg_latency_ms = (avg_latency_ms * call_count + $2) / (call_count + 1),
        success_rate = (success_rate * call_count + $3) / (call_count + 1)
      WHERE name = $1
    `, [result.toolName, result.latencyMs, result.success ? 1 : 0]);

    // Log for analysis
    await db.toolExecutions.insert({
      toolName: result.toolName,
      success: result.success,
      latencyMs: result.latencyMs,
      errorType: result.errorType,
      executedAt: new Date()
    });
  }

  async getToolHealth(): Promise<ToolHealthReport[]> {
    return db.query(`
      SELECT
        name,
        call_count,
        success_rate,
        avg_latency_ms,
        CASE
          WHEN success_rate < 0.8 THEN 'unhealthy'
          WHEN success_rate < 0.95 THEN 'degraded'
          ELSE 'healthy'
        END AS status
      FROM tools
      WHERE call_count > 10
      ORDER BY success_rate ASC
    `);
  }
}

Production patterns

Pattern 1: Fallback chains

Define alternative tools for reliability.

interface ToolWithFallbacks {
  primary: string;
  fallbacks: string[];
}

const fallbackChains: Record<string, ToolWithFallbacks> = {
  'send_email': {
    primary: 'sendgrid_email',
    fallbacks: ['ses_email', 'smtp_email']
  },
  'web_search': {
    primary: 'perplexity_search',
    fallbacks: ['tavily_search', 'serper_search']
  }
};

async function executeWithFallback(
  capability: string,
  parameters: Record<string, any>
): Promise<ToolResult> {
  const chain = fallbackChains[capability];
  if (!chain) {
    throw new Error(`No tool chain for capability: ${capability}`);
  }

  const tools = [chain.primary, ...chain.fallbacks];

  for (const toolName of tools) {
    try {
      const result = await executeTool(toolName, parameters);
      return result;
    } catch (error) {
      console.warn(`Tool ${toolName} failed, trying fallback:`, error);
      await feedback.recordExecution({
        toolName,
        success: false,
        latencyMs: 0,
        errorType: error.code || 'unknown'
      });
    }
  }

  throw new Error(`All tools failed for capability: ${capability}`);
}

Pattern 2: Capability abstraction

Let agents request capabilities, not specific tools.

// Instead of: "Use the SendGrid tool to send an email"
// Agent says: "I need to send an email"

interface CapabilityRequest {
  capability: 'send_email' | 'search_web' | 'query_database' | 'create_document';
  parameters: Record<string, any>;
}

async function fulfillCapability(
  request: CapabilityRequest,
  context: ExecutionContext
): Promise<any> {
  // Map capability to best available tool
  const tool = await selector.selectTools({
    taskDescription: request.capability,
    scope: context.scope,
    userId: context.userId
  }, 1);

  if (!tool.length) {
    throw new Error(`No tool available for: ${request.capability}`);
  }

  return executeWithFallback(tool[0].name, request.parameters);
}

Pattern 3: Context-aware selection

Adjust selection based on execution context.

async function selectWithContext(
  task: string,
  context: ExecutionContext
): Promise<SelectedTool[]> {
  // Adjust weights based on context
  const weights = {
    semanticSimilarity: 0.5,
    categoryMatch: 0.2,
    successRate: 0.2,
    costScore: 0.1
  };

  // If on budget, weight cost higher
  if (context.costSensitive) {
    weights.costScore = 0.3;
    weights.semanticSimilarity = 0.4;
  }

  // If time-critical, weight latency (via success rate proxy)
  if (context.timeCritical) {
    weights.successRate = 0.35;
    weights.costScore = 0.05;
  }

  return selector.selectTools({
    taskDescription: task,
    scope: context.scope,
    userId: context.userId,
    // Pass custom weights
    _weights: weights
  });
}

Pattern 4: Tool composition

Combine multiple tools for complex operations.

interface CompositeOperation {
  name: string;
  description: string;
  steps: {
    tool: string;
    inputMapping: Record<string, string>;
    outputMapping: string;
  }[];
}

const compositeOperations: CompositeOperation[] = [
  {
    name: 'research_and_summarise',
    description: 'Search for information and create a summary',
    steps: [
      {
        tool: 'web_search',
        inputMapping: { query: 'input.topic' },
        outputMapping: 'searchResults'
      },
      {
        tool: 'document_analysis',
        inputMapping: { documents: 'searchResults' },
        outputMapping: 'analysis'
      },
      {
        tool: 'generate_summary',
        inputMapping: { content: 'analysis' },
        outputMapping: 'output'
      }
    ]
  }
];

async function executeComposite(
  operationName: string,
  input: Record<string, any>
): Promise<any> {
  const operation = compositeOperations.find(o => o.name === operationName);
  if (!operation) throw new Error(`Unknown operation: ${operationName}`);

  const state: Record<string, any> = { input };

  for (const step of operation.steps) {
    const parameters = resolveMapping(step.inputMapping, state);
    const result = await executeTool(step.tool, parameters);
    state[step.outputMapping] = result;
  }

  return state.output;
}

Monitoring and optimisation

Track selection quality to identify improvement opportunities.

Key metrics

Metric	Target	Indicates
Selection accuracy	>90%	First choice is correct
Fallback rate	<10%	Primary tools are reliable
Avg tools per request	3-5	Selection is selective enough
Selection latency	<100ms	Routing is fast

Dashboard queries

-- Tool selection accuracy (first choice usage)
SELECT
  DATE_TRUNC('day', executed_at) AS day,
  COUNT(*) FILTER (WHERE selection_rank = 1) * 100.0 / COUNT(*) AS first_choice_pct
FROM tool_executions
WHERE executed_at > NOW() - INTERVAL '30 days'
GROUP BY 1
ORDER BY 1;

-- Tools with high fallback rates
SELECT
  t.name,
  t.success_rate,
  COUNT(te.id) FILTER (WHERE te.success = false) AS failures,
  COUNT(te.id) AS total_calls
FROM tools t
JOIN tool_executions te ON t.name = te.tool_name
WHERE te.executed_at > NOW() - INTERVAL '7 days'
GROUP BY t.name, t.success_rate
HAVING COUNT(te.id) FILTER (WHERE te.success = false) > 5
ORDER BY t.success_rate ASC;

FAQs

How many tools can this approach handle?

We've tested with 200+ tools without degradation. Vector search scales well - the limitation is more about tool quality than quantity.

Should I embed tool names or descriptions?

Embed capability summaries - they capture what the tool does, not what it's called. Tool names are often technical; users describe tasks conversationally.

How do I handle new tools?

Register with good capability summaries and start with neutral success rate (1.0). The system learns actual reliability through usage. Consider a "cold start" period with lower ranking until sufficient data exists.

What about MCP tools from external servers?

Same approach. When discovering MCP tools, extract descriptions and generate embeddings. Store in the same registry with metadata indicating the source server.

How do I prevent selection gaming?

Don't let tool providers control their own capability summaries. Review registrations manually and verify claims against actual behaviour.

Summary and next steps

Dynamic tool selection transforms agent capabilities from "what tools exist" to "what tools help with this task". The combination of semantic matching, multi-factor scoring, and feedback loops creates systems that improve with use.

Implementation checklist:

Design tool registry schema with semantic fields
Register existing tools with rich capability summaries
Build selection engine with vector search
Integrate with agent execution pipeline
Add fallback chains for reliability
Deploy monitoring for selection quality

Quick wins:

Start with semantic search only - multi-factor scoring can come later
Register 5-10 core tools and validate selection quality before scaling
Add feedback loop early - historical data improves future selection

Internal links:

External references: