Argument Correctness

Validates whether AI agents provide correct and appropriate arguments to tool calls

Overview

Argument Correctness evaluates whether your AI agent provides correct and appropriate parameters to tool calls. This method assumes the tools were already selected and focuses specifically on validating the quality, format, and suitability of the arguments passed to each tool.

Ideal for: Validating API parameter correctness, ensuring arguments align with user intent, debugging parameter extraction issues, and quality assurance for production agents.

What Gets Evaluated

This evaluation analyzes the input parameters for each tool call, not the tool selection itself:

✅ Evaluates: "You called get_weather(location='New York') - is 'New York' the right parameter?"
✅ Evaluates: "Are the parameters properly formatted and sufficient?"
✅ Evaluates: "Do the arguments align with user intent?"
❌ Does NOT evaluate: "Should you have called get_weather vs get_forecast?" - only the parameter quality

Key Features

Parameter-Focused Analysis: Validates argument quality without judging tool selection
Binary Assessment: Each tool receives a correct/incorrect assessment for its arguments
Ratio-Based Scoring: Overall score based on percentage of tools with correct arguments
Detailed Reasoning: Provides explanations for why arguments passed or failed

How It Works

The evaluation uses an LLM-as-a-judge approach:

Extract Context: Retrieves user input, agent output, and tool calls from logs
Analyze Each Tool: The LLM judge examines the arguments for each tool call
Score Arguments: Each tool receives a binary assessment - arguments are either correct or incorrect
Calculate Overall Score: score = (# of tools with correct arguments) / (total tools called)

💡 Pro Tip: Use Argument Correctness when you need to ensure your agent is passing the right parameters to tools, not when you're concerned about whether it's calling the right tools in the first place.

Reactive Agents

Argument Correctness

Overview

What Gets Evaluated

Key Features

How It Works

On this page