[services.*]
A service is an always-on process that RunWisp works to keep alive. It
exits, it gets restarted. It crashes, it gets restarted. The only thing
that stops a service for good is you — the Stop Service button in
the Web UI, or s on a service execution in the TUI. And even that stop
flag only lives in memory: restart the daemon and the service comes
right back up on its own.
The TOML key — api-worker in [services.api-worker] — is the
service name. It shares a single namespace with [tasks.*], so
names have to be unique across both kinds. Each instance’s run lands in
the same history view as your task runs, with its own ULID, log file,
and lifecycle.
Minimum example
Section titled “Minimum example”[services.metrics-collector]run = "/usr/local/bin/metrics-agent"That’s a fully working service: one instance, restarted forever, with bounded exponential backoff.
Identity & metadata
Section titled “Identity & metadata”| Key | Default | What it does |
|---|---|---|
[services.*] | required | The service name (the TOML table key). Used in CLI, API, and log paths. |
run | required | Shell command. Multi-line OK with TOML triple-quotes. |
description | (empty) | Human-readable description shown in the UI and TUI. |
group | "Services" | UI grouping label. |
api_trigger | true | Allow manual trigger from CLI / API / UI. (Restart is the usual interaction for services.) |
Instances
Section titled “Instances”[services.api-worker]instances = 3run = "/usr/local/bin/worker"| Key | Default | What it does |
|---|---|---|
instances | 1 | Number of concurrent instances. Bounded 1 ≤ instances ≤ 64. |
Each instance shows up as its own run, tagged with its own
instance_index (0, 1, 2, and so on). When you run more than one,
the Web UI and TUI label them name#1, name#2, name#3 — a 1-based
count off the stored index, so a single-instance service stays just
name. They all share the same config, their logs pool together per
service, and when one instance’s process exits the supervisor restarts
just that one.
Restart behaviour
Section titled “Restart behaviour”| Key | Default | What it does |
|---|---|---|
restart_delay | 1s | Base delay between restarts. Go duration string. |
restart_backoff | "exponential" | Curve applied to restart_delay: constant, linear, or exponential (shared with task retry_backoff). |
healthy_after | inherited | Uptime an instance must reach to count as healthy — resets its restart counter and clears the failed-start streak. Exits faster than this count as failed starts. See [defaults] for the inherited value (default "60s"). |
start_retries | 3 | Consecutive failed starts an instance tolerates before it’s marked FATAL and stops restarting. |
The backoff has a ceiling, so even after a long stretch of flapping the
delay between restarts won’t climb forever. And once an instance has
stayed up for
healthy_after,
its backoff counter resets — so a
service that flaps early but eventually settles down doesn’t stay stuck
with slow restarts.
restart = "always" is baked in and can’t be overridden — that’s
the deal with a service. If what you want is “run once and exit,” that’s
a task.
When a service can’t start: FATAL
Section titled “When a service can’t start: FATAL”Restarting forever is the right move for a service that occasionally falls over. It’s the wrong move for one that can never start — a typo in the command, a missing binary, a bad config. Without a backstop that service would flap silently until you happened to look. RunWisp’s first rule is nothing silently fails, so there’s a backstop.
Every time an instance exits, RunWisp checks how long it stayed up. Reach
healthy_after and it counts as a successful start: the failed-start
counter resets to zero. Exit faster than healthy_after with a failure
(what counts as failure — a clean exit
on a success code never does), and that’s a failed start. Rack up more
than start_retries failed starts in a row and the instance goes
FATAL: it stops restarting, a start_failed run is recorded, and a
service.fatal notification lights the in-app bell and any configured
notifiers. The loud terminal failure you
want instead of an invisible flap.
healthy_after is the one bar for both jobs: the same uptime that resets
the restart backoff also marks the instance as successfully started. So
“stable enough to reset the backoff” and “stable enough to count as
started” are one number to reason about — the way supervisord’s
startsecs does both.
FATAL lives in memory only — like every other in-memory supervisor flag,
it’s the runtime’s idea of desired state, never persisted. Restarting the
daemon, or hitting Restart on the service in the UI/API, gives it a
fresh start_retries budget and tries again; a restart is your “try
again” signal. The start_failed runs stay in history either way, so the
failures remain visible after the daemon comes back.
[services.flaky]run = "exit 1" # can never come uphealthy_after = "5s" # must stay up 5s to count as healthystart_retries = 2 # give up after 2 failed starts in a rowThis service tries to start, fails fast, retries twice more, then goes FATAL and stops — instead of burning a restart every second forever.
Startup
Section titled “Startup”| Key | Default | What it does |
|---|---|---|
priority | 0 | Boot start order across services — lower starts first. Ties break on name. Start order only. |
autostart | true | Whether the service comes up at boot. false boots it stopped until you start it from the UI/API. |
priority decides the order the daemon spawns services at boot: the
lowest number goes first, and services that share a number start in
alphabetical order. On its own it is not a readiness gate — it just
fixes the otherwise-random spawn order so the service that wants a shared
port first reliably gets it first. If you need one service to actually be
up before another starts, that’s depends_on
below.
autostart = false is for services you want defined but not running on
every boot — a maintenance worker, a one-off migration runner you trigger
by hand. The service shows up stopped; hit Start in the Web UI (or
the REST restart endpoint) to bring it up. This desired state isn’t
persisted: it’s re-read from runwisp.toml every boot, so a manually
started autostart = false service is stopped again after a restart, and
a manually stopped autostart = true service comes back up. The TOML is
always the source of truth.
Boot order with depends_on
Section titled “Boot order with depends_on”| Key | Default | What it does |
|---|---|---|
depends_on | none | Other services that must be healthy before this one starts at boot. Order only. |
When a service genuinely needs another to be up first — an API that talks to a database, a worker that needs its broker — list the dependency:
[services.db]run = "postgres -D /var/lib/pg"healthy_after = "3s" # call it "up" once it stays alive 3s
[services.api]run = "./api-server"depends_on = ["db"] # wait for db to be healthy, then startAt boot, api waits until db reports healthy — that is, at least
one db instance has stayed live for its
healthy_after window — and only then starts. A
service can depend on several at once (depends_on = ["db", "cache"]);
it waits for all of them.
This is deliberately small: boot ordering and a readiness gate, nothing
more. It is not a workflow engine. There are no cascade restarts (if
db crashes after api came up, api is left alone — supervision,
not orchestration), no run-to-completion edges, and no data passing.
That’s orchestration, which is a non-goal.
A few rules worth knowing:
- It never deadlocks your boot. If a dependency never becomes
healthy — it keeps crashing, or it’s
autostart = false— the dependent waits a bounded window (the dep’shealthy_afterplus a few seconds) and then starts anyway, with a loud warning in the log. Nothing silently hangs. - Manual start/restart is ungated.
depends_ononly gates the automatic boot. Hitting Start on a service from the UI/API brings it up immediately, dependencies or not. - Shutdown reverses it. On graceful shutdown, dependents are stopped before the services they depend on, so the API drains before the database it was talking to goes away.
- Cycles are rejected at load.
a → b → afails validation with the cycle named, the same as any other config error. References must point at services (not tasks) that actually exist.
priority and depends_on cooperate: depends_on decides what’s
allowed to start, priority breaks ties between services that are
independently ready to go.
Concurrency
Section titled “Concurrency”| Key | Default | What it does |
|---|---|---|
on_overlap | "skip" | What happens when something tries to start a new run while one is going. |
Services default to
on_overlap = "skip"
because the supervisor already
holds the instance count steady, so overlap almost never comes up. Try
to manually trigger a service that’s already running and it’s cleanly
turned away. There’s no max_concurrent on services either — how many
copies run is instances, not some in-flight overlap count.
Graceful shutdown
Section titled “Graceful shutdown”| Key | Default | What it does |
|---|---|---|
graceful_stop | "5s" | Grace period per instance after the stop signal, before SIGKILL — for manual stop, Restart Service, daemon shutdown. |
stop_signal | "SIGTERM" | Signal that opens the stop ladder. One of SIGTERM, SIGINT, SIGQUIT, SIGHUP, SIGKILL, SIGUSR1, SIGUSR2 (the SIG prefix is optional). |
graceful_stop covers the whole process group: the stop signal goes to the
instance’s process group, every descendant gets the same grace window,
and anything still standing afterward is SIGKILL’d together. The signal is
SIGTERM unless you set stop_signal; SIGKILL skips the grace window
entirely. If your
graceful_stop is longer than
[daemon] shutdown_timeout,
the daemon warns you at boot — and during a whole-daemon shutdown, each
instance is capped by the daemon-wide limit no matter what its own
setting says.
Logs & retention
Section titled “Logs & retention”Logs work exactly like they do on tasks — same fields, same defaults.
When you leave keep_runs and keep_for off here, they inherit from
the [defaults] section.
| Key | Default |
|---|---|
log_max_size | 100MB |
log_on_full | "drop_old" |
keep_runs | inherits [defaults] keep_runs |
keep_for | inherits [defaults] keep_for |
The accept/reject rules are the same as on tasks too: positive numbers
cap, leaving a field off inherits. Zero means “inherit from
[defaults]” for keep_runs; negative values are rejected for both
fields.
One thing to keep in mind — a service’s history can grow a lot faster
than a task’s, because every crash is another run row. So set keep_runs
with that in mind; 200 is a sensible starting point for a service
that’s prone to flapping.
See Logs & retention for the underlying behaviour.
Environment & secrets
Section titled “Environment & secrets”env, env_file, secrets, and secrets_file work exactly the same
way they do on tasks — see
[tasks.*] Environment & secrets
for the merge order, visibility rules, and validation.
[services.api]run = "/srv/api/bin/server"env = { PORT = "8080", LOG_LEVEL = "info" }secrets_file = "/etc/runwisp/api-secrets.env"Every instance of a multi-instance service receives the same merged environment.
working_dir, shell, umask, and user also work the same as on tasks — see
[tasks.*] Working directory & shell.
Every instance runs in the same directory, under the same interpreter and mask,
as the same OS user (user needs the daemon running as root).
Compose-backed services
Section titled “Compose-backed services”[services.<name>] can run a service defined in an existing
docker-compose.yml instead of a raw shell command. Set
compose_file = "./docker-compose.yml" and (optionally)
compose_service = "<name>" — when omitted, the service name defaults
to the table key. run and compose_file are mutually exclusive.
For importing the whole compose file at once, see [compose.*].
[services.api]compose_file = "./docker-compose.yml"compose_service = "api"notify_on_failure = ["slack-prod"]Notifications
Section titled “Notifications”| Key | Default | What it does |
|---|---|---|
notify_on_failure | (none) | Notifier IDs to alert when an instance ends failed, crashed, timeout, or missed, or when a slot goes fatal. |
notify_on_success | (none) | Notifier IDs to alert on run.succeeded (a clean instance shutdown). |
Same shape and same behaviour as on [tasks.*] — right down to the
implicit [notify] global_notifiers (default ["inapp"]) getting added
in. The shared reference lives at
Per-task notifications, with
[tasks.*] notifications as the
mirror entry. A failed instance in a [services.*] block reaches the
exact same channels a failed [tasks.*] run would.
Cooperating with graceful_stop
Section titled “Cooperating with graceful_stop”The practical advice: trap SIGTERM in your run command and exit
cleanly. The starter file’s pattern is a fine place to begin:
trap 'echo "SIGTERM — shutting down"; exit 0' TERM INTwhile true; do # do workdoneAn instance that catches SIGTERM and exits cleanly is recorded with
end_reason = stopped.
What’s rejected on services
Section titled “What’s rejected on services”cron,catch_up— services aren’t cron-driven.params— services don’t take per-execution parameters; they aren’t manually triggered.retry_attempts,retry_delay,retry_backoff— services restart instead of retry. Userestart_delay/restart_backoff.max_concurrent,queue_max— instance count isinstances; services don’t queue.- A name shared with a
[tasks.*]entry. - Empty or missing
run. instancesoutside[1, 64].
Worked example: 3 queue workers
Section titled “Worked example: 3 queue workers”[services.api-worker]description = "Three always-on workers consuming the same job queue"instances = 3restart_delay = "2s"restart_backoff = "exponential"healthy_after = "2m" # this one needs longer to call "stable"graceful_stop = "20s" # leave time to finish the in-flight jobkeep_runs = 500notify_on_failure = ["slack-ops"]run = """trap 'echo "SIGTERM — draining and exiting"; exit 0' TERM INTecho "[$(date -Iseconds)] worker starting up..."while true; do /usr/local/bin/consume-jobdone"""