Optimization Algorithm

Overview

Reactive Agents uses a sophisticated optimization algorithm that combines Thompson Sampling (a Bayesian multi-armed bandit approach) with K-Means++ clustering to automatically learn and select the best hyperparameter configurations for different types of user requests.

Core Components

1. Thompson Sampling (Multi-Armed Bandit)

The system treats each hyperparameter configuration as an "arm" in a multi-armed bandit problem. Thompson Sampling is used to balance exploration (trying new configurations) with exploitation (using known good configurations).

Implementation (lib/server/middlewares/idkhub-configuration.ts:30-56):

function getOptimalArm(arms: SkillOptimizationArm[]): SkillOptimizationArm {
  // Implement Thompson Sampling algorithm for multi-armed bandit
  // Thompson Sampling uses Bayesian approach: sample from posterior Beta distribution
  // and select the arm with highest sampled value

  let optimalArm = arms[0];
  let maxSample = -Infinity;

  for (const arm of arms) {
    // Beta distribution parameters with uniform prior (Beta(1,1))
    // alpha = successes + 1, beta = failures + 1
    const successes = arm.stats.total_reward;
    const failures = arm.stats.n - arm.stats.total_reward;
    const alpha = successes + 1;
    const beta = failures + 1;

    // Sample from Beta(alpha, beta)
    const sample = sampleBeta(alpha, beta);

    if (sample > maxSample) {
      maxSample = sample;
      optimalArm = arm;
    }
  }

  return optimalArm;
}

Key Characteristics:

Uses Beta distribution with uniform prior Beta(1,1)
Samples from posterior distribution to balance exploration and exploitation
Selects arm with highest sampled value
Automatically adapts based on observed rewards

2. K-Means++ Clustering

User requests are grouped by semantic similarity using K-Means++ clustering on embeddings. This allows the system to learn different optimal configurations for different types of requests.

Implementation (lib/server/utils/math.ts:100-171):

export function kMeansClustering(
  embeddings: number[][],
  k: number,
  maxIterations = 100,
): ClusterResult {
  const n = embeddings.length;

  if (k >= n) {
    // Each point is its own cluster
    return {
      clusters: Array.from({ length: n }, (_, i) => i),
      centroids: embeddings.map((e) => [...e]),
      iterations: 0,
    };
  }

  // Initialize centroids using k-means++
  let centroids = initializeCentroidsKMeansPlusPlus(embeddings, k);
  const clusters = new Array(n).fill(0);

  for (let iteration = 0; iteration < maxIterations; iteration++) {
    let changed = false;

    // Assign each point to the nearest centroid
    for (let i = 0; i < n; i++) {
      let minDistance = Infinity;
      let nearestCluster = 0;

      for (let c = 0; c < k; c++) {
        const distance = calculateDistance(embeddings[i], centroids[c]);
        if (distance < minDistance) {
          minDistance = distance;
          nearestCluster = c;
        }
      }

      if (clusters[i] !== nearestCluster) {
        clusters[i] = nearestCluster;
        changed = true;
      }
    }

    // If no assignments changed, we've converged
    if (!changed) {
      return { clusters, centroids, iterations: iteration + 1 };
    }

    // Update centroids
    const newCentroids: number[][] = [];
    for (let c = 0; c < k; c++) {
      const clusterPoints = embeddings.filter((_, i) => clusters[i] === c);
      if (clusterPoints.length > 0) {
        newCentroids.push(calculateCentroid(clusterPoints));
      } else {
        // Keep the old centroid if no points are assigned to this cluster
        newCentroids.push([...centroids[c]]);
      }
    }

    centroids = newCentroids;
  }

  return { clusters, centroids, iterations: maxIterations };
}

Key Characteristics:

K-Means++ initialization for better initial centroid selection
Euclidean distance-based clustering
Automatic convergence detection
Maximum 100 iterations limit

3. Cluster Selection

For each incoming request, the system finds the most relevant cluster using cosine similarity:

function getOptimalCluster(
  embedding: number[],
  clusters: SkillOptimizationCluster[],
): SkillOptimizationCluster {
  // Find the cluster with the highest cosine similarity to the embedding
  let optimalCluster = clusters[0];
  let maxSimilarity = -1;

  for (const cluster of clusters) {
    const similarity = cosineSimilarity(embedding, cluster.centroid);
    if (similarity > maxSimilarity) {
      maxSimilarity = similarity;
      optimalCluster = cluster;
    }
  }

  return optimalCluster;
}

4. Statistical Updates

After each request, the system updates arm statistics using incremental formulas:

Implementation (lib/server/middlewares/optimizer/hyperparameters.ts:19-38):

export async function updateArmStats(
  userDataStorageConnector: UserDataStorageConnector,
  arm: SkillOptimizationArm,
  reward: number,
) {
  // Update arm statistics using incremental update formulas for Thompson Sampling
  const newN = arm.stats.n + 1;
  const newTotalReward = arm.stats.total_reward + reward;
  const newMean = newTotalReward / newN;
  const newN2 = arm.stats.n2 + reward * reward;

  await userDataStorageConnector.updateSkillOptimizationArm(arm.id, {
    stats: {
      n: newN,
      mean: newMean,
      n2: newN2,
      total_reward: newTotalReward,
    },
  });
}

Configuration Space

The system optimizes across 9 base configurations (lib/server/optimization/base-arms.ts), varying:

Configuration	Temperature Range	Reasoning Level
Arm 1	0 - 0.33	0 (no reasoning)
Arm 2	0 - 0.33	0.5 (moderate)
Arm 3	0 - 0.33	1 (full reasoning)
Arm 4	0.34 - 0.66	0
Arm 5	0.34 - 0.66	0.5
Arm 6	0.34 - 0.66	1
Arm 7	0.67 - 1	0
Arm 8	0.67 - 1	0.5
Arm 9	0.67 - 1	1

These base configurations are further combined with different:

Models (e.g., GPT-4, Claude, etc.)
System prompts
Other hyperparameters

Creating a large configuration space to explore and optimize.

How Configuration Parameters Affect the Algorithm

The optimization system is highly configurable. Here's how each parameter influences the algorithm:

Skill Description

Impact: Guides initial system prompt generation and defines the task context

Used to generate seed system prompts via LLM (lib/server/optimization/utils/system-prompt.ts:20-24)
Provides context for what the skill should accomplish
Influences the quality of initial configurations before optimization begins

Configuration Count (Number of Clusters)

Impact: Determines how finely request types are segmented

Directly sets k in K-Means++ clustering (lib/server/middlewares/optimizer/clusters.ts:34-52)
Higher values = more specialized configurations for different request types
Lower values = more general configurations across broader request types
Formula: Total Arms = configuration_count × allowed_models × system_prompt_count × 9

System Prompt Count

Impact: Expands configuration space with different instruction variants

Generates multiple system prompt variations per model/cluster combination
Each system prompt becomes part of separate arms
After convergence, system prompts are refined through reflection on best-performing arms (lib/server/middlewares/optimizer/system-prompt.ts:63-138)

Allowed Models

Impact: Multiplies the configuration space by testing different AI models

Each model gets its own set of arms with all hyperparameter combinations
Allows the system to learn which models perform best for different request types
Models are associated with skills via getSkillModels() (lib/server/optimization/skill-optimizations.ts:44)

Enabled Evaluation Methods

Impact: Defines how success is measured and rewards are calculated

Multiple evaluation methods can be enabled simultaneously
Each method produces a score (0-1) for a completed request
Scores are averaged into a single reward signal (lib/server/middlewares/optimizer/hyperparameters.ts:12-16)
This reward directly updates Thompson Sampling statistics: total_reward, which determines alpha in Beta distribution
Different evaluation methods = different optimization objectives

Clustering Interval

Impact: Controls how often the system adapts to changing patterns

Triggers automatic re-clustering using K-Means++
Allows system to discover new request types over time
Old clusters are matched to new clusters to preserve learned statistics

Algorithm Flow with Configuration Parameters

Configuration Space Examples

The total number of arms (configurations) grows multiplicatively with each parameter:

Formula:

Total Arms = configuration_count × allowed_models × system_prompt_count × 9 base arms

Example Configurations

Scenario	Clusters	Models	Prompts	Base Arms	Total Arms
Small	2	2	2	9	72
Medium	3	3	3	9	243
Large	5	4	4	9	720
Enterprise	10	5	5	9	2,250

Impact of increasing each parameter:

More clusters: Better segmentation of request types, but requires more data to converge
More models: Tests different AI providers/versions, increases cost but finds optimal model per cluster
More prompts: More instruction variations to explore, beneficial when task requires precise wording
Base arms: Fixed at 9 (3 temperature ranges × 3 reasoning levels)

Evaluation Methods and Reward Signal

The enabled evaluation methods directly determine what the system optimizes for. Here's how they create the reward signal:

Evaluation Flow

Request Completes: AI response is generated using selected arm configuration
Run Evaluations: Each enabled evaluation method runs independently (lib/server/middlewares/logs.ts:188-207)
Score Generation: Each method produces a score between 0 and 1
Reward Calculation: Scores are averaged into a single reward
Statistical Update: Reward updates Thompson Sampling statistics

Example Evaluation Scenarios

Scenario 1: Single Evaluation Method (Accuracy)

Enabled Methods: [accuracy]
Evaluation Scores: {accuracy: 0.85}
Final Reward: 0.85
→ System optimizes purely for accuracy

Scenario 2: Multiple Evaluation Methods

Enabled Methods: [accuracy, latency, cost]
Evaluation Scores: {
  accuracy: 0.90,
  latency: 0.75,   (faster = higher score)
  cost: 0.60       (cheaper = higher score)
}
Final Reward: (0.90 + 0.75 + 0.60) / 3 = 0.75
→ System balances all three objectives

Scenario 3: Custom Evaluation Weights

Note: Current implementation uses simple averaging.
Future: Could implement weighted averaging for prioritizing certain metrics.

Impact on Thompson Sampling

The reward directly affects Thompson Sampling's Beta distribution parameters:

// From lib/server/middlewares/idkhub-configuration.ts:30-33
const successes = arm.stats.total_reward;  // ← Sum of all rewards
const failures = arm.stats.n - arm.stats.total_reward;
const alpha = successes + 1;  // ← Drives Beta distribution
const beta = failures + 1;

Key Insight: Changing evaluation methods changes what "success" means, which changes which arms get selected over time.

Optimization Process

Configuration Setup: Define skill description, clusters, models, prompts, and evaluation methods
Initialization: Generate system prompts and create full arm configuration space
Request Arrival: A new user request arrives
Embedding Generation: Convert request to vector embedding
Cluster Selection: Find the nearest cluster using cosine similarity
Arm Selection: Use Thompson Sampling to select a configuration (model + prompt + hyperparameters)
Execution: Execute the request with the selected configuration
Evaluation: Run enabled evaluation methods and calculate reward
Learning: Update arm statistics for future Thompson Sampling decisions
Re-clustering: Periodically re-cluster to adapt to changing request patterns
Reflection: After convergence, generate improved system prompts based on best-performing arms

Automatic Re-clustering

The system automatically re-clusters when the clustering_interval is reached (lib/server/middlewares/optimizer/clusters.ts:54-173). This allows the system to:

Adapt to changing request patterns
Discover new types of requests
Optimize configurations for emerging use cases

Benefits

Adaptive Learning: Automatically learns which configurations work best
Context-Aware: Different configurations for different types of requests
Exploration/Exploitation Balance: Thompson Sampling naturally balances trying new configurations with using proven ones
Scalable: Handles large configuration spaces efficiently
Self-Improving: Performance improves over time as more data is collected

Optimization Algorithm

On this page