Tool Correctness
Validates whether AI agents call the expected tools with correct parameters and outputs
Overview
Tool Correctness evaluates how accurately your AI agent calls the expected tools with correct parameters and outputs. It compares the tools actually called by the agent against a predefined set of expected tool calls, evaluating tool names, input parameters, and outputs.
Ideal for: Validating multi-tool workflows, ensuring correct tool parameters, verifying tool outputs, testing function-calling capabilities, and quality assurance for tool-using agents.
What Gets Evaluated
This evaluation compares actual tool calls against expected tool calls:
- ✅ Evaluates: "Did the agent call the correct tools?"
- ✅ Evaluates: "Are the parameters exactly as expected?"
- ✅ Evaluates: "Do the tool outputs match expectations?"
- ❌ Does NOT evaluate: Response quality, tone, or conversational aspects - only tool usage
Key Features
- Direct Comparison: Uses deep equality to match actual vs expected tool calls
- Flexible Validation: Configure what to evaluate - tool names, parameters, outputs, or all
- Order Awareness: Optionally validate sequential tool execution
- Ratio-Based Scoring: Provides 0.0-1.0 score based on percentage of correctly matched tools
How It Works
The evaluation uses a direct comparison approach:
- Extract Tool Calls: Retrieves actual and expected tool calls from log metadata (
metadata.tools_calledandmetadata.expected_tools) - Match & Score: Compares tools using your configured parameters (names, inputs, outputs, ordering) and calculates:
score = (# of matched tools) / (# of expected tools)