Skip to Content
ExamplesBenchmarking

Benchmarking

Compare two or more named strategies with replicated runs.

curl -sS -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ https://computalot.com/api/v1/jobs \ -d '{ "type": "benchmark", "runner_command": ["python", "evaluate.py"], "project": "my-project", "candidates": { "strategy_a": {"model": "gpt4", "temperature": 0.7}, "strategy_b": {"model": "claude", "temperature": 0.5}, "baseline": {"model": "random"} }, "shared_payload": {"dataset": "test_set_v3", "n_trials": 100}, "replicas": 3, "rank_by": "score", "timeout_s": 1800 }'

Creates 9 tasks (3 candidates x 3 replicas). Results include a ranked leaderboard with per-candidate statistics.

Sweep vs Benchmark: Use sweep for exploring a parameter grid. Use benchmark for comparing named alternatives with statistical confidence.

Last updated on