Reactive Agents

Task Completion

LLM-as-a-judge evaluation that measures how well your agent completes assigned tasks

Overview

Task Completion is an LLM-as-a-judge evaluation method that assesses whether your AI agent successfully completed a given task. It uses a sophisticated two-stage process to extract what the user wanted and what actually happened, then judges how well the outcome fulfills the original task.

Ideal for: Verifying task understanding, measuring execution quality, validating tool usage, and comparing agent performance across different scenarios.

What Gets Evaluated

This evaluation analyzes the task completion quality of your agent's responses:

  • ✅ Evaluates: "Did the agent understand what the user wanted?"
  • ✅ Evaluates: "Was the task completed successfully?"
  • ✅ Evaluates: "How well was the task executed?"
  • ❌ Does NOT evaluate: Response style, tone, or formatting - only task completion

Key Features

  • Two-Stage Analysis: Extracts task intent and outcome, then judges completion quality
  • Comprehensive Assessment: Considers task understanding, execution quality, and tool usage
  • Flexible Scoring: Provides 0.0-1.0 score based on completion effectiveness
  • Tool Integration: Evaluates whether tools were used appropriately for task completion

How It Works

The evaluation happens in two stages:

  1. Task & Outcome Extraction: Analyzes log data to identify what the user wanted and what actually happened
  2. Verdict Generation: An LLM judge evaluates the match between task and outcome, considering understanding, execution quality, and tool usage