Retries & timeouts
If a task exits non-zero, RunWisp can take another crack at it. If a run
drags on too long, RunWisp can pull the plug. You set both per task, and
anything you leave off falls back to its built-in default — retry
settings don’t inherit from
[defaults].
Retries are a task thing only. Services don’t retry — they restart instead.
retry_attempts
Section titled “retry_attempts”[tasks.publish-feed]cron = "*/15 * * * *"run = "/usr/local/bin/publish.sh"retry_attempts = 3retry_delay = "30s"retry_backoff = "exponential"retry_attempts is the number of extra tries after the first one
fails — it defaults to 0, so out of the box nothing retries. Set it to
3 and a failing run gets up to three more goes.
Not every ending counts as a failure worth retrying, though. A retry
fires when a run ends in failed (non-zero exit), timeout, crashed,
or log_overflow (which is what log_on_full = "kill_task" does after
a run blows past log_max_size).
It does not fire on success, stopped, or skipped. stopped
means a human deliberately killed it from the API/CLI/UI, or a sibling
run cancelled it via on_overlap = "terminate" — either way, retrying
would override an intentional decision. skipped means the original run
is still going, so another attempt would just race it.
retry_delay and retry_backoff
Section titled “retry_delay and retry_backoff”retry_delay is how long to wait before trying again, written as a
duration ("5s", "2m", "1h"). Turn retries on and it defaults to
"5s".
retry_backoff decides whether that wait stays put or grows with each
attempt. Services use the same vocabulary for their restart_backoff,
so once you know it here it carries over there too.
| Value | Wait before attempt N (1-indexed retries) | With retry_delay = "10s" |
|---|---|---|
constant (or unset) | constant delay | 10s, 10s, 10s … |
linear | delay × N | 10s, 20s, 30s, 40s … |
exponential | delay × 2^(N-1) | 10s, 20s, 40s, 80s, 160s, 300s … |
No matter which curve you pick, every wait is capped at 5
minutes. An
exponential schedule with a short base climbs up to that ceiling and
then just sits there.
What retries look like in history
Section titled “What retries look like in history”Every attempt is its own run — its own ID, its own exit code, its own
log file. The Web UI lists them all under the task’s history, numbered
by retry_attempt (0 for the first try, 1 for the first retry, and
on up). That’s on purpose: if attempt 1 quietly corrupted something and
attempt 2 papered over it by succeeding, you can still go back and read
exactly what attempt 1 printed.
timeout
Section titled “timeout”[tasks.heavy-job]cron = "0 3 * * *"run = "/usr/local/bin/heavy-job.sh"timeout = "30m"It takes the same duration syntax as retry_delay ("30s", "5m",
"1h"). If you don’t set one, it inherits from [defaults] timeout —
and if that isn’t set either, there’s no timeout at all and the run
can take as long as it needs. The window is per attempt, so each
retry starts the clock fresh, and time spent waiting out retry_delay
doesn’t eat into it.
When the deadline passes, the daemon SIGTERMs the run’s process group,
gives it up to graceful_stop (default "5s") to bow out, then SIGKILLs
whatever’s left and records the run with end reason timeout. That same
SIGTERM-then-wait dance is what on_overlap = "terminate", manual stops,
and daemon shutdown all use — graceful_stop is the one knob behind all
of it.
Interactions
Section titled “Interactions”A few places where retries bump into other features:
on_overlap = "terminate"plus retries. When a fresh firing terminates the running attempt, that attempt ends asstopped, which shuts the retry chain down. The run the terminate policy starts is a brand-new execution, not a retry.- Manual stop. Same deal — stop a run from the API or UI and it ends
as
stopped, which ends the retry chain. max_concurrent > 1. Retries go through the same concurrency check — but since a retry only starts once its predecessor has finished, there’s no overlap left to evaluate.
Services don’t retry
Section titled “Services don’t retry”Services play by different rules, because the whole point of a service is to stay up.
The two do share a backoff vocabulary — constant / linear /
exponential — so there’s only one set of words to remember. The
difference is in the surrounding knobs: tasks have retry_attempts
(services run forever, so there’s nothing to count), and services have
restart_delay (the supervisor sets the pace).
| Field | Tasks | Services |
|---|---|---|
retry_attempts | ✅ default 0 | ❌ rejected |
retry_delay | ✅ default 5s | ❌ rejected |
retry_backoff | ✅ constant / linear / exponential, default constant | ❌ rejected |
restart_delay | ❌ rejected | ✅ default 1s |
restart_backoff | ❌ rejected | ✅ constant / linear / exponential, default exponential |
A service supervisor restarts a replica with bounded exponential
backoff as long as it looks like the replica can come back — but if
it exits before healthy_after too many times in a row, the supervisor
marks that slot FATAL
and stops trying until you manually restart the service.
Healthy threshold: healthy_after
Section titled “Healthy threshold: healthy_after”Once a replica has stayed up for at least healthy_after (default 60s),
RunWisp calls it healthy — and that one threshold pulls double duty. It
resets the restart counter, so a service that flapped a bunch on startup
isn’t stuck with long restart delays after it finally settles down. And
it clears the failed-start streak behind
FATAL, so a
service that comes up and stays up is never given up on. Set it in
[defaults] to cover every service that
doesn’t say otherwise, or per service:
[defaults]healthy_after = "30s" # global default
[services.flaky-worker]run = "/usr/local/bin/worker"healthy_after = "2m" # this one needs longer to call "stable"Rule of thumb: if you want a finite, escalating set of retries, that’s a
task with cron and retry_attempts. If you want a workload kept alive
and self-healing indefinitely, that’s [services.*].