Retries & timeouts

A task that exits non-zero can be retried automatically. A run that takes too long can be killed. Both are configured per-task, with sensible defaults from [defaults] if you leave them off.

Retries apply only to tasks. Services don’t retry — they restart instead.

`retry_attempts`

[tasks.publish-feed]
cron           = "*/15 * * * *"
run            = "/usr/local/bin/publish.sh"
retry_attempts = 3
retry_delay    = "30s"
retry_backoff  = "exponential"

Default: 0 (no retries).
Semantics: the number of additional attempts after the first failure.
Triggers a retry: the run ended with failed (non-zero exit), timeout, crashed, or log_overflow (cancelled by log_on_full = "kill_task" after exceeding log_max_size).
Does not trigger a retry: success, stopped (manual stop via the API/CLI/UI, or a sibling run cancelled it via on_overlap = "terminate"), or skipped (on_overlap = "skip" rejected the firing because another run was still going). Stopped is a deliberate human action; skipped means the original run is still running and another attempt would just race it.

`retry_delay` and `retry_backoff`

retry_delay is a duration string ("5s", "2m", "1h"). When retries are enabled, retry_delay defaults to "5s".

retry_backoff chooses how the wait between retries grows. The same values are used by services’ restart_backoff, so they move between the two contexts cleanly.

Value	Wait before attempt N (1-indexed retries)	With `retry_delay = "10s"`
`constant` (or unset)	constant `delay`	10s, 10s, 10s …
`linear`	`delay × N`	10s, 20s, 30s, 40s …
`exponential`	`delay × 2^(N-1)`	10s, 20s, 40s, 80s, 160s, 300s …

All schedules are capped at 5 minutes. exponential with a short base hits the cap and stays there.

What retries look like in history

Each attempt is a separate run, with its own ID, exit code, and captured log file. The Web UI lists every attempt under the task’s run history, numbered by retry_attempt (0 for the first try, 1 for the first retry, and so on). That’s deliberate: if attempt 1 silently corrupted state and attempt 2 succeeded, you can still go back and read attempt 1’s stderr.

`timeout`

[tasks.heavy-job]
cron    = "0 3 * * *"
run     = "/usr/local/bin/heavy-job.sh"
timeout = "30m"

Same duration syntax as retry_delay ("30s", "5m", "1h").
Default: inherited from [defaults] timeout if set; otherwise no timeout — the run is allowed to take as long as it likes.
Scope: per attempt. A retry gets a fresh timeout window; time spent waiting in retry_delay doesn’t count against it.

When the deadline hits, the daemon SIGTERMs the run’s process group, waits up to the task’s graceful_stop (default "5s"), then SIGKILLs any survivors and records the run with end reason timeout. The same SIGTERM-then-wait flow applies to on_overlap = "terminate", manual stops, and daemon shutdown — graceful_stop is the single knob.

Interactions

on_overlap = "terminate" plus retries. If a new firing terminates the running attempt, that attempt records end reason stopped — which blocks any further retries. The new run from the terminate policy is a fresh execution, not a retry.
Manual stop. Same story: stopping a run from the API/UI records stopped and ends the retry chain.
max_concurrent > 1. Retries don’t count against max_concurrent. A retry only fires after its predecessor has finished, so there’s no overlap to evaluate.

Services don’t retry

Services have a different model because they’re meant to stay up:

Tasks and services share the same backoff vocabulary — constant / linear / exponential — so one rule is easier to remember. Tasks add retry_attempts (services run forever); services add restart_delay (the supervisor owns the cadence).

Field	Tasks	Services
`retry_attempts`	✅ default `0`	❌ rejected
`retry_delay`	✅ default `5s`	❌ rejected
`retry_backoff`	✅ `constant` / `linear` / `exponential`, default `constant`	❌ rejected
`restart_delay`	❌ rejected	✅ default `1s`
`restart_backoff`	❌ rejected	✅ `constant` / `linear` / `exponential`, default `exponential`

A service supervisor restarts a replica forever (with bounded exponential backoff) until you stop it explicitly.

Backoff reset: `backoff_reset_after`

A replica that stays up at least backoff_reset_after (default 60s) resets its restart counter, so a service that fails repeatedly at first doesn’t keep slow restart delays once it stabilises. Configure it in [defaults] (applies to every service that doesn’t override it) or per-service:

[defaults]
backoff_reset_after = "30s"     # global default

[services.flaky-worker]
run                 = "/usr/local/bin/worker"
backoff_reset_after = "2m"      # this one needs longer to call "stable"

If you want “finite, escalating wait” semantics on a service, model it as a task with cron/retry_attempts instead. If you want “indefinite self-healing supervision” on a workload, that’s [services.*].

Where to next

Concurrency policies — what stops a retry chain when on_overlap = "terminate" fires.
Notifications model — coalescing repeated failure alerts so retries don’t spam your channel.
[tasks.*] reference — the exact schema for every retry and timeout field.