Task Completion

LLM-as-a-judge evaluation that measures how well your agent completes assigned tasks

Overview

Task Completion is an LLM-as-a-judge evaluation method that assesses whether your AI agent successfully completed a given task. It uses a sophisticated two-stage process to extract what the user wanted and what actually happened, then judges how well the outcome fulfills the original task.

Ideal for: Verifying task understanding, measuring execution quality, validating tool usage, and comparing agent performance across different scenarios.

What Gets Evaluated

This evaluation analyzes the task completion quality of your agent's responses:

✅ Evaluates: "Did the agent understand what the user wanted?"
✅ Evaluates: "Was the task completed successfully?"
✅ Evaluates: "How well was the task executed?"
❌ Does NOT evaluate: Response style, tone, or formatting - only task completion

Key Features

Two-Stage Analysis: Extracts task intent and outcome, then judges completion quality
Comprehensive Assessment: Considers task understanding, execution quality, and tool usage
Flexible Scoring: Provides 0.0-1.0 score based on completion effectiveness
Tool Integration: Evaluates whether tools were used appropriately for task completion

How It Works

The evaluation happens in two stages:

Task & Outcome Extraction: Analyzes log data to identify what the user wanted and what actually happened
Verdict Generation: An LLM judge evaluates the match between task and outcome, considering understanding, execution quality, and tool usage

Reactive Agents

Task Completion

Overview

What Gets Evaluated

Key Features

How It Works

On this page