Reactive Agents

Optimization Algorithm

Understanding performance impact of configuration parameters

Overview

Reactive Agents uses a sophisticated optimization algorithm that combines Thompson Sampling (a Bayesian multi-armed bandit approach) with K-Means++ clustering to automatically learn and select the best hyperparameter configurations for different types of user requests.

Core Components

1. Thompson Sampling (Multi-Armed Bandit)

The system treats each hyperparameter configuration as an "arm" in a multi-armed bandit problem. Thompson Sampling is used to balance exploration (trying new configurations) with exploitation (using known good configurations).

Implementation (lib/server/middlewares/idkhub-configuration.ts:30-56):

function getOptimalArm(arms: SkillOptimizationArm[]): SkillOptimizationArm {
  // Implement Thompson Sampling algorithm for multi-armed bandit
  // Thompson Sampling uses Bayesian approach: sample from posterior Beta distribution
  // and select the arm with highest sampled value

  let optimalArm = arms[0];
  let maxSample = -Infinity;

  for (const arm of arms) {
    // Beta distribution parameters with uniform prior (Beta(1,1))
    // alpha = successes + 1, beta = failures + 1
    const successes = arm.stats.total_reward;
    const failures = arm.stats.n - arm.stats.total_reward;
    const alpha = successes + 1;
    const beta = failures + 1;

    // Sample from Beta(alpha, beta)
    const sample = sampleBeta(alpha, beta);

    if (sample > maxSample) {
      maxSample = sample;
      optimalArm = arm;
    }
  }

  return optimalArm;
}

Key Characteristics:

  • Uses Beta distribution with uniform prior Beta(1,1)
  • Samples from posterior distribution to balance exploration and exploitation
  • Selects arm with highest sampled value
  • Automatically adapts based on observed rewards

2. K-Means++ Clustering

User requests are grouped by semantic similarity using K-Means++ clustering on embeddings. This allows the system to learn different optimal configurations for different types of requests.

Implementation (lib/server/utils/math.ts:100-171):

export function kMeansClustering(
  embeddings: number[][],
  k: number,
  maxIterations = 100,
): ClusterResult {
  const n = embeddings.length;

  if (k >= n) {
    // Each point is its own cluster
    return {
      clusters: Array.from({ length: n }, (_, i) => i),
      centroids: embeddings.map((e) => [...e]),
      iterations: 0,
    };
  }

  // Initialize centroids using k-means++
  let centroids = initializeCentroidsKMeansPlusPlus(embeddings, k);
  const clusters = new Array(n).fill(0);

  for (let iteration = 0; iteration < maxIterations; iteration++) {
    let changed = false;

    // Assign each point to the nearest centroid
    for (let i = 0; i < n; i++) {
      let minDistance = Infinity;
      let nearestCluster = 0;

      for (let c = 0; c < k; c++) {
        const distance = calculateDistance(embeddings[i], centroids[c]);
        if (distance < minDistance) {
          minDistance = distance;
          nearestCluster = c;
        }
      }

      if (clusters[i] !== nearestCluster) {
        clusters[i] = nearestCluster;
        changed = true;
      }
    }

    // If no assignments changed, we've converged
    if (!changed) {
      return { clusters, centroids, iterations: iteration + 1 };
    }

    // Update centroids
    const newCentroids: number[][] = [];
    for (let c = 0; c < k; c++) {
      const clusterPoints = embeddings.filter((_, i) => clusters[i] === c);
      if (clusterPoints.length > 0) {
        newCentroids.push(calculateCentroid(clusterPoints));
      } else {
        // Keep the old centroid if no points are assigned to this cluster
        newCentroids.push([...centroids[c]]);
      }
    }

    centroids = newCentroids;
  }

  return { clusters, centroids, iterations: maxIterations };
}

Key Characteristics:

  • K-Means++ initialization for better initial centroid selection
  • Euclidean distance-based clustering
  • Automatic convergence detection
  • Maximum 100 iterations limit

3. Cluster Selection

For each incoming request, the system finds the most relevant cluster using cosine similarity:

function getOptimalCluster(
  embedding: number[],
  clusters: SkillOptimizationCluster[],
): SkillOptimizationCluster {
  // Find the cluster with the highest cosine similarity to the embedding
  let optimalCluster = clusters[0];
  let maxSimilarity = -1;

  for (const cluster of clusters) {
    const similarity = cosineSimilarity(embedding, cluster.centroid);
    if (similarity > maxSimilarity) {
      maxSimilarity = similarity;
      optimalCluster = cluster;
    }
  }

  return optimalCluster;
}

4. Statistical Updates

After each request, the system updates arm statistics using incremental formulas:

Implementation (lib/server/middlewares/optimizer/hyperparameters.ts:19-38):

export async function updateArmStats(
  userDataStorageConnector: UserDataStorageConnector,
  arm: SkillOptimizationArm,
  reward: number,
) {
  // Update arm statistics using incremental update formulas for Thompson Sampling
  const newN = arm.stats.n + 1;
  const newTotalReward = arm.stats.total_reward + reward;
  const newMean = newTotalReward / newN;
  const newN2 = arm.stats.n2 + reward * reward;

  await userDataStorageConnector.updateSkillOptimizationArm(arm.id, {
    stats: {
      n: newN,
      mean: newMean,
      n2: newN2,
      total_reward: newTotalReward,
    },
  });
}

Configuration Space

The system optimizes across 9 base configurations (lib/server/optimization/base-arms.ts), varying:

ConfigurationTemperature RangeReasoning Level
Arm 10 - 0.330 (no reasoning)
Arm 20 - 0.330.5 (moderate)
Arm 30 - 0.331 (full reasoning)
Arm 40.34 - 0.660
Arm 50.34 - 0.660.5
Arm 60.34 - 0.661
Arm 70.67 - 10
Arm 80.67 - 10.5
Arm 90.67 - 11

These base configurations are further combined with different:

  • Models (e.g., GPT-4, Claude, etc.)
  • System prompts
  • Other hyperparameters

Creating a large configuration space to explore and optimize.

How Configuration Parameters Affect the Algorithm

The optimization system is highly configurable. Here's how each parameter influences the algorithm:

Skill Description

Impact: Guides initial system prompt generation and defines the task context

  • Used to generate seed system prompts via LLM (lib/server/optimization/utils/system-prompt.ts:20-24)
  • Provides context for what the skill should accomplish
  • Influences the quality of initial configurations before optimization begins

Configuration Count (Number of Clusters)

Impact: Determines how finely request types are segmented

  • Directly sets k in K-Means++ clustering (lib/server/middlewares/optimizer/clusters.ts:34-52)
  • Higher values = more specialized configurations for different request types
  • Lower values = more general configurations across broader request types
  • Formula: Total Arms = configuration_count × allowed_models × system_prompt_count × 9

System Prompt Count

Impact: Expands configuration space with different instruction variants

  • Generates multiple system prompt variations per model/cluster combination
  • Each system prompt becomes part of separate arms
  • After convergence, system prompts are refined through reflection on best-performing arms (lib/server/middlewares/optimizer/system-prompt.ts:63-138)

Allowed Models

Impact: Multiplies the configuration space by testing different AI models

  • Each model gets its own set of arms with all hyperparameter combinations
  • Allows the system to learn which models perform best for different request types
  • Models are associated with skills via getSkillModels() (lib/server/optimization/skill-optimizations.ts:44)

Enabled Evaluation Methods

Impact: Defines how success is measured and rewards are calculated

  • Multiple evaluation methods can be enabled simultaneously
  • Each method produces a score (0-1) for a completed request
  • Scores are averaged into a single reward signal (lib/server/middlewares/optimizer/hyperparameters.ts:12-16)
  • This reward directly updates Thompson Sampling statistics: total_reward, which determines alpha in Beta distribution
  • Different evaluation methods = different optimization objectives

Clustering Interval

Impact: Controls how often the system adapts to changing patterns

  • Triggers automatic re-clustering using K-Means++
  • Allows system to discover new request types over time
  • Old clusters are matched to new clusters to preserve learned statistics

Algorithm Flow with Configuration Parameters

Configuration Space Examples

The total number of arms (configurations) grows multiplicatively with each parameter:

Formula:

Total Arms = configuration_count × allowed_models × system_prompt_count × 9 base arms

Example Configurations

ScenarioClustersModelsPromptsBase ArmsTotal Arms
Small222972
Medium3339243
Large5449720
Enterprise105592,250

Impact of increasing each parameter:

  • More clusters: Better segmentation of request types, but requires more data to converge
  • More models: Tests different AI providers/versions, increases cost but finds optimal model per cluster
  • More prompts: More instruction variations to explore, beneficial when task requires precise wording
  • Base arms: Fixed at 9 (3 temperature ranges × 3 reasoning levels)

Evaluation Methods and Reward Signal

The enabled evaluation methods directly determine what the system optimizes for. Here's how they create the reward signal:

Evaluation Flow

  1. Request Completes: AI response is generated using selected arm configuration
  2. Run Evaluations: Each enabled evaluation method runs independently (lib/server/middlewares/logs.ts:188-207)
  3. Score Generation: Each method produces a score between 0 and 1
  4. Reward Calculation: Scores are averaged into a single reward
  5. Statistical Update: Reward updates Thompson Sampling statistics

Example Evaluation Scenarios

Scenario 1: Single Evaluation Method (Accuracy)

Enabled Methods: [accuracy]
Evaluation Scores: {accuracy: 0.85}
Final Reward: 0.85
→ System optimizes purely for accuracy

Scenario 2: Multiple Evaluation Methods

Enabled Methods: [accuracy, latency, cost]
Evaluation Scores: {
  accuracy: 0.90,
  latency: 0.75,   (faster = higher score)
  cost: 0.60       (cheaper = higher score)
}
Final Reward: (0.90 + 0.75 + 0.60) / 3 = 0.75
→ System balances all three objectives

Scenario 3: Custom Evaluation Weights

Note: Current implementation uses simple averaging.
Future: Could implement weighted averaging for prioritizing certain metrics.

Impact on Thompson Sampling

The reward directly affects Thompson Sampling's Beta distribution parameters:

// From lib/server/middlewares/idkhub-configuration.ts:30-33
const successes = arm.stats.total_reward;  // ← Sum of all rewards
const failures = arm.stats.n - arm.stats.total_reward;
const alpha = successes + 1;  // ← Drives Beta distribution
const beta = failures + 1;

Key Insight: Changing evaluation methods changes what "success" means, which changes which arms get selected over time.

Optimization Process

  1. Configuration Setup: Define skill description, clusters, models, prompts, and evaluation methods
  2. Initialization: Generate system prompts and create full arm configuration space
  3. Request Arrival: A new user request arrives
  4. Embedding Generation: Convert request to vector embedding
  5. Cluster Selection: Find the nearest cluster using cosine similarity
  6. Arm Selection: Use Thompson Sampling to select a configuration (model + prompt + hyperparameters)
  7. Execution: Execute the request with the selected configuration
  8. Evaluation: Run enabled evaluation methods and calculate reward
  9. Learning: Update arm statistics for future Thompson Sampling decisions
  10. Re-clustering: Periodically re-cluster to adapt to changing request patterns
  11. Reflection: After convergence, generate improved system prompts based on best-performing arms

Automatic Re-clustering

The system automatically re-clusters when the clustering_interval is reached (lib/server/middlewares/optimizer/clusters.ts:54-173). This allows the system to:

  • Adapt to changing request patterns
  • Discover new types of requests
  • Optimize configurations for emerging use cases

Benefits

  1. Adaptive Learning: Automatically learns which configurations work best
  2. Context-Aware: Different configurations for different types of requests
  3. Exploration/Exploitation Balance: Thompson Sampling naturally balances trying new configurations with using proven ones
  4. Scalable: Handles large configuration spaces efficiently
  5. Self-Improving: Performance improves over time as more data is collected