Benchmarking

Compare two or more named strategies with replicated runs.


curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://computalot.com/api/v1/jobs \
  -d '{
    "type": "benchmark",
    "runner_command": ["python", "evaluate.py"],
    "project": "my-project",
    "candidates": {
      "strategy_a": {"model": "gpt4", "temperature": 0.7},
      "strategy_b": {"model": "claude", "temperature": 0.5},
      "baseline": {"model": "random"}
    },
    "shared_payload": {"dataset": "test_set_v3", "n_trials": 100},
    "replicas": 3,
    "rank_by": "score",
    "timeout_s": 1800
  }'

Creates 9 tasks (3 candidates x 3 replicas). Results include a ranked leaderboard with per-candidate statistics.

Sweep vs Benchmark: Use sweep for exploring a parameter grid. Use benchmark for comparing named alternatives with statistical confidence.