Task Completion
LLM-as-a-judge evaluation that measures how well your agent completes assigned tasks
Overview
Task Completion is an LLM-as-a-judge evaluation method that assesses whether your AI agent successfully completed a given task. It uses a sophisticated two-stage process to extract what the user wanted and what actually happened, then judges how well the outcome fulfills the original task.
Ideal for: Verifying task understanding, measuring execution quality, validating tool usage, and comparing agent performance across different scenarios.
What Gets Evaluated
This evaluation analyzes the task completion quality of your agent's responses:
- ✅ Evaluates: "Did the agent understand what the user wanted?"
- ✅ Evaluates: "Was the task completed successfully?"
- ✅ Evaluates: "How well was the task executed?"
- ❌ Does NOT evaluate: Response style, tone, or formatting - only task completion
Key Features
- Two-Stage Analysis: Extracts task intent and outcome, then judges completion quality
- Comprehensive Assessment: Considers task understanding, execution quality, and tool usage
- Flexible Scoring: Provides 0.0-1.0 score based on completion effectiveness
- Tool Integration: Evaluates whether tools were used appropriately for task completion
How It Works
The evaluation happens in two stages:
- Task & Outcome Extraction: Analyzes log data to identify what the user wanted and what actually happened
- Verdict Generation: An LLM judge evaluates the match between task and outcome, considering understanding, execution quality, and tool usage