Skip to Content
Jobs

Jobs

A job is a unit of work you submit. Computalot breaks it into tasks scheduled across workers.

For sealed recipes, submit jobs via POST /api/v1/recipes/:name/jobs with a typed payload — no project or runner_command needed.

For projects, submit jobs via POST /api/v1/jobs with a project name, runner command, and payload.

Choosing a job type

I want to…Use
Run one script with JSON input/outputstructured_runner
Run code across a list of inputsstructured_runner + fan_out.by
Evaluate many tiny inputs per worker taskstructured_runner + fan_out.by / fan_out.items + batch_size
Evaluate CMA/evolutionary candidatesstructured_runner + fan_out.items
Train a model on a GPUstructured_runner + profile: "gpu"
Search a parameter gridsweep
Run simulations and reduce resultsmap_reduce
Compare named strategiesbenchmark
Submit many jobs at oncePOST /api/v1/jobs/batch

When in doubt, use structured_runner.

Submitting a job

curl -sS https://computalot.com/api/v1/jobs \ -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "type": "structured_runner", "runner_command": ["python3", "evaluate.py"], "payload": {"model": "gpt-4", "dataset": "test_v3"}, "project": "my-project", "timeout_s": 600 }'

Sizing Requirements

requirements are minimums, not exact machine picks. storage_gb should cover more than dataset size: include your runtime footprint, writable caches, temp files, checkpoints, and sandbox/runtime overhead on the worker. Heavy ML runtimes often need tens of GB of free disk even before model weights are downloaded.

If one project revision runs on both lightweight CPU jobs and heavyweight GPU training jobs, split those runtimes when possible instead of sending one oversized environment everywhere. Smaller runtime footprints improve placement and reduce disk pressure on workers.

Runner protocol

  1. Computalot writes the task payload to a temp file
  2. Your script reads $COMPUTALOT_TASK_PAYLOAD (JSON input)
  3. Your script writes JSON to $COMPUTALOT_TASK_RESULT (output)
  4. Exit 0 = success, non-zero = failure

Report progress by printing to stdout:

print(f"COMPUTALOT_PROGRESS:{json.dumps({'step': 42, 'loss': 0.05})}")

Normal stdout/stderr now feeds task live_feedback.output_tail and the SSE job stream promptly while the task is still running instead of waiting for the 30s heartbeat window. If your runner wraps another process, keep that child unbuffered or flush explicitly so Computalot can forward logs as they are produced.

Job lifecycle

planning - queued - running - completed | partial | failed | cancelled

  • Poll: GET /api/v1/jobs/:id every 2-5s
  • Stream: GET /api/v1/jobs/:id/stream for SSE updates, including running-task output tails via task.live_feedback.output_tail
  • Batch watch: GET /api/v1/jobs/watch?ids=... for one SSE stream carrying client_ref, tags, meta, variant, aggregate summary fields, and persistence flags
  • Project stream: GET /api/v1/projects/:name/stream for one project-wide feed instead of per-job polling
  • Per-task detail: GET /api/v1/jobs/:id/tasks for live_feedback, latest_progress, checkpoint/resume state, and preserved last-failed-attempt diagnostics while a retry is queued or running
  • Canonical terminal results: GET /api/v1/results/:job_id for per-task results, aggregate fields, completeness, and artifact IDs
  • Output continuity: GET /api/v1/jobs/:id/output for aggregated stdout/stderr; during retries it preserves the most recent failed attempt until the current attempt emits new diagnostics
  • Artifacts: GET /api/v1/artifacts to list files from your jobs, then GET /api/v1/artifacts/:id to download them
  • Cancel: PUT /api/v1/jobs/:id/cancel

Supported fan-out shapes

  • {"fan_out": {"by": "models"}} — split one payload array field into one task per item
  • {"fan_out": {"items": [{...}, {...}]}} — provide the exact payload object for each task
  • {"fan_out": {"chunks": 20, "range_field": "total_seeds", "total": 10000}} — split a numeric range into chunk tasks

These shapes are mutually exclusive. Mixing by, items, or chunks + total in one request returns 422; Choose exactly one fan-out shape per job. Add batch_size or batch_per_task when each fan-out item is tiny and you want one dispatched task to process multiple items locally. Batched tasks receive payload._batch metadata.

Public payload contract

Public job, task, watch, and result payloads keep your submitted payload, meta, variant, aggregate fields, and artifact IDs, but they redact placement-only fields such as current_node, provider IDs, runtime paths, and image refs/digests. Treat those public surfaces as the user contract; infrastructure placement stays internal to Computalot.

Tags, batch, webhooks, and dependencies

  • Tags: "tags": ["experiment_42"] then filter with GET /api/v1/jobs?tag=experiment_42
  • Batch: POST /api/v1/jobs/batch for up to 200 jobs in one request; successful entries preserve index, payload, meta, and variant
  • Webhooks: "callback_url": "https://..." for completion notifications
  • DAG: "depends_on": ["job_id"] to chain jobs
  • Dependency artifact handoff: downstream jobs can use _artifacts.download: {"dataset": {"job_id": "job_id", "artifact": "dataset"}} to fetch a named artifact from a completed dependency
  • Shared coordination: PUT /api/v1/projects/:name/kv/:key stores small project-scoped JSON values, and payload._shared.resolve injects them or dependency job result paths into payload._shared.values before dispatch
Last updated on