Academy15 Oct 202411 min read

Function Calling with LLMs: Complete Implementation Guide for AI Agents

Master function calling in production AI agents -implementation patterns, error handling, security considerations, and real examples with OpenAI, Anthropic, and open-source models.

MB
Max Beech
Head of Content

TL;DR

  • Function calling lets LLMs execute actions beyond text generation (API calls, database queries, tool use).
  • Works by: Define tools in JSON schema → LLM decides which to call → You execute → Return result to LLM.
  • OpenAI: Best-in-class function calling, parallel calls supported, strict schema validation.
  • Anthropic Claude: "Tool use" pattern, excellent reliability, supports parallel tools.
  • Open-source: Llama 3.1+, Mistral support function calling but lower reliability (60-75% vs 95%+).
  • Critical: Validate all LLM tool calls before execution (security), handle errors gracefully, use retry logic.
  • Cost: Function calling adds 15-30% token overhead but unlocks 10× more agent capabilities.

Function Calling with LLMs: Complete Implementation Guide

Without function calling:

User: "What's the weather in London?"
LLM: "I don't have access to real-time weather data."

With function calling:

User: "What's the weather in London?"
LLM: [Calls get_weather(city="London")]
System: [Executes API call, returns: {"temp": 15, "condition": "Cloudy"}]
LLM: "It's currently 15°C and cloudy in London."

Function calling transforms LLMs from text generators into agents that interact with the real world.

What Is Function Calling?

Definition: LLM returns structured JSON describing which function to call with what parameters, instead of just generating text.

You provide:

  1. Tool definitions (function name, description, parameters)
  2. User query

LLM returns:

  1. Decision on which tool to call
  2. Parameters to pass

You execute:

  1. Run the actual function
  2. Return result to LLM
  3. LLM incorporates result into final response

Example Flow

1. Define Tool:

{
  "name": "get_weather",
  "description": "Get current weather for a city",
  "parameters": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "City name, e.g. 'London'"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature units"
      }
    },
    "required": ["city"]
  }
}

2. User Query:

"What's the weather like in Paris today?"

3. LLM Response:

{
  "tool_calls": [{
    "id": "call_abc123",
    "type": "function",
    "function": {
      "name": "get_weather",
      "arguments": "{\"city\": \"Paris\", \"units\": \"celsius\"}"
    }
  }]
}

4. Execute Function:

def get_weather(city, units="celsius"):
    response = requests.get(f"https://api.weather.com/current?city={city}&units={units}")
    return response.json()

result = get_weather("Paris", "celsius")
# {"temp": 18, "condition": "Partly cloudy", "humidity": 65}

5. Return to LLM:

{
  "role": "tool",
  "tool_call_id": "call_abc123",
  "content": "{\"temp\": 18, \"condition\": \"Partly cloudy\", \"humidity\": 65}"
}

6. Final LLM Response:

"The weather in Paris today is partly cloudy with a temperature of 18°C and 65% humidity."

OpenAI Function Calling

Model support: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo (June 2023+)

Basic Implementation

from openai import OpenAI

client = OpenAI()

# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search customer database by name or email",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query (name or email)"
                    },
                    "limit": {
                        "type": "integer",
                        "description": "Max results to return",
                        "default": 10
                    }
                },
                "required": ["query"]
            }
        }
    }
]

# Initial request
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "user", "content": "Find customer john.doe@example.com"}
    ],
    tools=tools,
    tool_choice="auto"  # Let model decide
)

# Check if tool was called
message = response.choices[0].message

if message.tool_calls:
    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        
        # Execute function
        if function_name == "search_database":
            result = search_database(**arguments)
            
            # Send result back to LLM
            messages = [
                {"role": "user", "content": "Find customer john.doe@example.com"},
                message,  # Assistant's tool call
                {
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(result)
                }
            ]
            
            final_response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=messages
            )
            
            print(final_response.choices[0].message.content)

Parallel Function Calling

OpenAI supports calling multiple functions in one turn.

tools = [
    {"type": "function", "function": {"name": "get_weather", ...}},
    {"type": "function", "function": {"name": "get_news", ...}},
    {"type": "function", "function": {"name": "get_stock_price", ...}}
]

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": "What's the weather, top news, and AAPL stock price?"}],
    tools=tools
)

# LLM might return 3 tool calls at once
message = response.choices[0].message

results = []
for tool_call in message.tool_calls:
    function_name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)
    
    # Execute each function
    if function_name == "get_weather":
        result = get_weather(**arguments)
    elif function_name == "get_news":
        result = get_news(**arguments)
    elif function_name == "get_stock_price":
        result = get_stock_price(**arguments)
    
    results.append({
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(result)
    })

# Return all results together
final_response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "user", "content": "..."},
        message,
        *results  # All tool results
    ]
)

Performance: Parallel calling reduces latency by 2-3× for multi-tool queries.

Strict Schema Validation

New feature (Oct 2024): Enforce exact schema compliance.

tools = [{
    "type": "function",
    "function": {
        "name": "book_flight",
        "strict": True,  # Enforce schema
        "parameters": {
            "type": "object",
            "properties": {
                "origin": {"type": "string"},
                "destination": {"type": "string"},
                "date": {"type": "string", "pattern": "^\\d{4}-\\d{2}-\\d{2}$"}
            },
            "required": ["origin", "destination", "date"],
            "additionalProperties": False
        }
    }
}]

Benefit: Guaranteed valid JSON, no parsing errors. Improves reliability from 98% to 99.9%+.

Anthropic Claude Tool Use

Model support: Claude 3 Opus, Sonnet, Haiku (all versions)

Implementation

import anthropic

client = anthropic.Anthropic()

# Define tools
tools = [
    {
        "name": "get_customer_info",
        "description": "Retrieves customer information from database",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {
                    "type": "string",
                    "description": "Unique customer ID"
                }
            },
            "required": ["customer_id"]
        }
    }
]

# Initial request
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "Get info for customer C12345"}
    ]
)

# Check for tool use
for content in response.content:
    if content.type == "tool_use":
        tool_name = content.name
        tool_input = content.input
        tool_use_id = content.id
        
        # Execute tool
        if tool_name == "get_customer_info":
            result = get_customer_info(**tool_input)
            
            # Return result to Claude
            follow_up = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                tools=tools,
                messages=[
                    {"role": "user", "content": "Get info for customer C12345"},
                    {"role": "assistant", "content": response.content},
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "tool_result",
                                "tool_use_id": tool_use_id,
                                "content": json.dumps(result)
                            }
                        ]
                    }
                ]
            )
            
            print(follow_up.content[0].text)

Multi-Tool Support

Claude can also call multiple tools in one response:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=2048,
    tools=[
        {"name": "get_weather", ...},
        {"name": "search_flights", ...}
    ],
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo and are there flights available tomorrow?"}
    ]
)

# Response may contain multiple tool_use blocks
tool_results = []
for content in response.content:
    if content.type == "tool_use":
        result = execute_tool(content.name, content.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": content.id,
            "content": json.dumps(result)
        })

Open-Source Model Function Calling

Supported models:

  • Llama 3.1 (8B, 70B, 405B)
  • Mistral Large, Mistral Medium
  • Mixtral 8x7B (limited support)

Llama 3.1 Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import json

model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Define tools
tools = [
    {
        "name": "calculate",
        "description": "Perform mathematical calculation",
        "parameters": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression to evaluate"}
            },
            "required": ["expression"]
        }
    }
]

# Format prompt with tools
messages = [
    {"role": "system", "content": f"You have access to these tools:\n{json.dumps(tools, indent=2)}"},
    {"role": "user", "content": "What's 127 * 89?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Parse tool call from response
# Llama 3.1 returns: <function_call>{"name": "calculate", "arguments": {"expression": "127 * 89"}}</function_call>

Reliability: Open-source models have 60-75% tool calling accuracy vs 95%+ for GPT-4/Claude.

When to use: Self-hosted requirements, cost sensitivity (inference is free after initial hosting cost).

Function Calling Patterns

Pattern 1: Single Tool, Simple Flow

Use case: One clear action (search, calculate, lookup).

def simple_tool_agent(user_query):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": user_query}],
        tools=[search_tool],
        tool_choice="auto"
    )
    
    if response.choices[0].message.tool_calls:
        tool_call = response.choices[0].message.tool_calls[0]
        result = execute_tool(tool_call)
        
        # Return result to LLM for natural language response
        final = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "user", "content": user_query},
                response.choices[0].message,
                {"role": "tool", "tool_call_id": tool_call.id, "content": result}
            ]
        )
        return final.choices[0].message.content
    
    return response.choices[0].message.content

Pattern 2: Multi-Step Workflow

Use case: Chain multiple tool calls (retrieve data → process → store).

def multi_step_agent(user_query, max_iterations=5):
    messages = [{"role": "user", "content": user_query}]
    
    for _ in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=messages,
            tools=all_tools
        )
        
        message = response.choices[0].message
        messages.append(message)
        
        # If no tool calls, agent is done
        if not message.tool_calls:
            return message.content
        
        # Execute all tool calls
        for tool_call in message.tool_calls:
            result = execute_tool(tool_call)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })
    
    return "Max iterations reached"

Example:

User: "Find top 3 customers by revenue and email them a thank you note"

Iteration 1: [Calls get_top_customers(limit=3)]
Iteration 2: [Calls send_email(to=..., subject=..., body=...)] × 3
Iteration 3: [Returns "Sent thank you emails to top 3 customers"]

Pattern 3: Conditional Tool Selection

Use case: Different tools based on context.

tools = [
    {"name": "search_web", "description": "Search the internet"},
    {"name": "search_database", "description": "Search internal database"},
    {"name": "calculate", "description": "Perform calculations"}
]

# LLM automatically selects appropriate tool based on query
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": user_query}],
    tools=tools,
    tool_choice="auto"  # Model decides which tool
)

Query: "What's our Q3 revenue?" → Calls search_database Query: "What's the latest on AI regulation?" → Calls search_web Query: "What's 15% of $2,400?" → Calls calculate

Error Handling & Security

Validation Before Execution

Critical: Never blindly execute tool calls. Validate first.

def execute_tool_safely(tool_call):
    function_name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)
    
    # 1. Whitelist check
    ALLOWED_FUNCTIONS = ["search_database", "get_weather", "calculate"]
    if function_name not in ALLOWED_FUNCTIONS:
        return {"error": "Unauthorized function"}
    
    # 2. Parameter validation
    if function_name == "search_database":
        # Prevent SQL injection
        if not isinstance(arguments.get("query"), str):
            return {"error": "Invalid query parameter"}
        
        # Limit query length
        if len(arguments["query"]) > 200:
            return {"error": "Query too long"}
    
    # 3. Rate limiting
    if is_rate_limited(function_name):
        return {"error": "Rate limit exceeded"}
    
    # 4. Execute with timeout
    try:
        result = timeout_execute(globals()[function_name], arguments, timeout=10)
        return result
    except TimeoutError:
        return {"error": "Function execution timeout"}
    except Exception as e:
        return {"error": f"Execution failed: {str(e)}"}

Retry Logic

def call_with_retry(messages, tools, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=messages,
                tools=tools,
                timeout=30
            )
            
            # Validate response has expected structure
            if response.choices[0].message:
                return response
                
        except json.JSONDecodeError as e:
            if attempt == max_retries - 1:
                raise
            # Invalid JSON in tool arguments, retry
            time.sleep(2 ** attempt)  # Exponential backoff
            
        except openai.APIError as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    
    raise Exception("Max retries exceeded")

Cost Analysis

Token overhead: Function definitions added to every request.

Example:

User query: 50 tokens
Function definitions (3 tools): 200 tokens
Total input: 250 tokens (vs 50 without functions)
Cost multiplier: 5×

But: Enables capabilities worth far more than 5× cost.

Cost Optimization

StrategyToken ReductionTrade-off
Define only relevant tools per query60%Requires query classification
Use shorter descriptions30%Less LLM guidance
Lazy tool loading (add tools mid-conversation)50%More complexity
Cache tool definitions (Claude)90%Requires prompt caching

Recommendation: Use Claude's prompt caching for tool definitions. Cuts cost by 90% for repeated tool use.

# Claude prompt caching
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant.",
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": f"Available tools:\n{json.dumps(tools)}",
            "cache_control": {"type": "ephemeral"}  # Cache tools
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

Result: First call pays full cost, subsequent calls in same session pay 10% for tool definitions.

Production Checklist

Before deploying function calling to production:

  • Whitelist allowed functions - Prevent arbitrary execution
  • Validate all parameters - Check types, ranges, SQL injection
  • Implement timeouts - Prevent hanging on slow external APIs
  • Add retry logic - Handle transient failures gracefully
  • Log all tool calls - Audit trail for debugging and compliance
  • Rate limit - Prevent abuse, manage external API quotas
  • Monitor costs - Track token usage, set budget alerts
  • Test edge cases - Missing parameters, invalid inputs, API failures
  • Human approval for sensitive ops - Deletions, payments, emails to customers
  • Graceful degradation - Fallback behavior if tools unavailable

Frequently Asked Questions

Can the LLM call functions I didn't define?

No. LLM can only call functions you explicitly provide in the tools array. It cannot invent or call arbitrary functions.

What if the LLM hallucinates function arguments?

Validate all arguments before execution. Use strict schema validation (OpenAI) or manual checks. Never trust LLM output blindly.

How do I prevent the LLM from calling expensive APIs repeatedly?

Implement rate limiting per function. Track calls per session, return error if limit exceeded.

call_counts = {}

def execute_tool(tool_call):
    function_name = tool_call.function.name
    call_counts[function_name] = call_counts.get(function_name, 0) + 1
    
    if call_counts[function_name] > 10:  # Max 10 calls per function
        return {"error": "Rate limit: Too many calls to this function"}
    
    return globals()[function_name](**arguments)

Should I use tool_choice="auto" or force a specific tool?

Auto: Let model decide (recommended for most cases) Force: Use when you know exactly which tool should run (e.g., form submission must call submit_form)

# Force tool call
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": "Submit this form"}],
    tools=[submit_form_tool],
    tool_choice={"type": "function", "function": {"name": "submit_form"}}
)

Bottom line: Function calling transforms LLMs into agents that interact with real systems. OpenAI and Claude have 95%+ reliability. Always validate before execution, implement retries, and monitor costs. The 15-30% token overhead is worth the 10× capability expansion.

Next: Read our Multi-Agent Systems guide for coordinating multiple function-calling agents.