Skip to content

Retries & timeouts

A task that exits non-zero can be retried automatically. A run that takes too long can be killed. Both are configured per-task, with sensible defaults from [defaults] if you leave them off.

Retries apply only to tasks. Services don’t retry — they restart instead.

[tasks.publish-feed]
cron = "*/15 * * * *"
run = "/usr/local/bin/publish.sh"
retry_attempts = 3
retry_delay = "30s"
retry_backoff = "exponential"
  • Default: 0 (no retries).
  • Semantics: the number of additional attempts after the first failure.
  • Triggers a retry: the run ended with failed (non-zero exit), timeout, crashed, or log_overflow (cancelled by log_on_full = "kill_task" after exceeding log_max_size).
  • Does not trigger a retry: success, stopped (manual stop via the API/CLI/UI, or a sibling run cancelled it via on_overlap = "terminate"), or skipped (on_overlap = "skip" rejected the firing because another run was still going). Stopped is a deliberate human action; skipped means the original run is still running and another attempt would just race it.

retry_delay is a duration string ("5s", "2m", "1h"). When retries are enabled, retry_delay defaults to "5s".

retry_backoff chooses how the wait between retries grows. The same values are used by services’ restart_backoff, so they move between the two contexts cleanly.

ValueWait before attempt N (1-indexed retries)With retry_delay = "10s"
constant (or unset)constant delay10s, 10s, 10s …
lineardelay × N10s, 20s, 30s, 40s …
exponentialdelay × 2^(N-1)10s, 20s, 40s, 80s, 160s, 300s …

All schedules are capped at 5 minutes. exponential with a short base hits the cap and stays there.

Each attempt is a separate run, with its own ID, exit code, and captured log file. The Web UI lists every attempt under the task’s run history, numbered by retry_attempt (0 for the first try, 1 for the first retry, and so on). That’s deliberate: if attempt 1 silently corrupted state and attempt 2 succeeded, you can still go back and read attempt 1’s stderr.

[tasks.heavy-job]
cron = "0 3 * * *"
run = "/usr/local/bin/heavy-job.sh"
timeout = "30m"
  • Same duration syntax as retry_delay ("30s", "5m", "1h").
  • Default: inherited from [defaults] timeout if set; otherwise no timeout — the run is allowed to take as long as it likes.
  • Scope: per attempt. A retry gets a fresh timeout window; time spent waiting in retry_delay doesn’t count against it.

When the deadline hits, the daemon SIGTERMs the run’s process group, waits up to the task’s graceful_stop (default "5s"), then SIGKILLs any survivors and records the run with end reason timeout. The same SIGTERM-then-wait flow applies to on_overlap = "terminate", manual stops, and daemon shutdown — graceful_stop is the single knob.

  • on_overlap = "terminate" plus retries. If a new firing terminates the running attempt, that attempt records end reason stopped — which blocks any further retries. The new run from the terminate policy is a fresh execution, not a retry.
  • Manual stop. Same story: stopping a run from the API/UI records stopped and ends the retry chain.
  • max_concurrent > 1. Retries don’t count against max_concurrent. A retry only fires after its predecessor has finished, so there’s no overlap to evaluate.

Services have a different model because they’re meant to stay up:

Tasks and services share the same backoff vocabulary — constant / linear / exponential — so one rule is easier to remember. Tasks add retry_attempts (services run forever); services add restart_delay (the supervisor owns the cadence).

FieldTasksServices
retry_attempts✅ default 0❌ rejected
retry_delay✅ default 5s❌ rejected
retry_backoffconstant / linear / exponential, default constant❌ rejected
restart_delay❌ rejected✅ default 1s
restart_backoff❌ rejectedconstant / linear / exponential, default exponential

A service supervisor restarts a replica forever (with bounded exponential backoff) until you stop it explicitly.

A replica that stays up at least backoff_reset_after (default 60s) resets its restart counter, so a service that fails repeatedly at first doesn’t keep slow restart delays once it stabilises. Configure it in [defaults] (applies to every service that doesn’t override it) or per-service:

[defaults]
backoff_reset_after = "30s" # global default
[services.flaky-worker]
run = "/usr/local/bin/worker"
backoff_reset_after = "2m" # this one needs longer to call "stable"

If you want “finite, escalating wait” semantics on a service, model it as a task with cron/retry_attempts instead. If you want “indefinite self-healing supervision” on a workload, that’s [services.*].