Tool Correctness

Validates whether AI agents call the expected tools with correct parameters and outputs

Overview

Tool Correctness evaluates how accurately your AI agent calls the expected tools with correct parameters and outputs. It compares the tools actually called by the agent against a predefined set of expected tool calls, evaluating tool names, input parameters, and outputs.

Ideal for: Validating multi-tool workflows, ensuring correct tool parameters, verifying tool outputs, testing function-calling capabilities, and quality assurance for tool-using agents.

What Gets Evaluated

This evaluation compares actual tool calls against expected tool calls:

✅ Evaluates: "Did the agent call the correct tools?"
✅ Evaluates: "Are the parameters exactly as expected?"
✅ Evaluates: "Do the tool outputs match expectations?"
❌ Does NOT evaluate: Response quality, tone, or conversational aspects - only tool usage

Key Features

Direct Comparison: Uses deep equality to match actual vs expected tool calls
Flexible Validation: Configure what to evaluate - tool names, parameters, outputs, or all
Order Awareness: Optionally validate sequential tool execution
Ratio-Based Scoring: Provides 0.0-1.0 score based on percentage of correctly matched tools

How It Works

The evaluation uses a direct comparison approach:

Extract Tool Calls: Retrieves actual and expected tool calls from log metadata (metadata.tools_called and metadata.expected_tools)
Match & Score: Compares tools using your configured parameters (names, inputs, outputs, ordering) and calculates: score = (# of matched tools) / (# of expected tools)

Reactive Agents

Tool Correctness

Overview

What Gets Evaluated

Key Features

How It Works

On this page