Benchmarking
Compare two or more named strategies with replicated runs.
curl -sS -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://computalot.com/api/v1/jobs \
-d '{
"type": "benchmark",
"runner_command": ["python", "evaluate.py"],
"project": "my-project",
"candidates": {
"strategy_a": {"model": "gpt4", "temperature": 0.7},
"strategy_b": {"model": "claude", "temperature": 0.5},
"baseline": {"model": "random"}
},
"shared_payload": {"dataset": "test_set_v3", "n_trials": 100},
"replicas": 3,
"rank_by": "score",
"timeout_s": 1800
}'Creates 9 tasks (3 candidates x 3 replicas). Results include a ranked leaderboard with per-candidate statistics.
Sweep vs Benchmark: Use sweep for exploring a parameter grid. Use benchmark for comparing named alternatives with statistical confidence.
Last updated on