Skip to content

[services.*]

A service is an always-on process that RunWisp works to keep alive. It exits, it gets restarted. It crashes, it gets restarted. The only thing that stops a service for good is you — the Stop Service button in the Web UI, or s on a service execution in the TUI. And even that stop flag only lives in memory: restart the daemon and the service comes right back up on its own.

The TOML key — api-worker in [services.api-worker] — is the service name. It shares a single namespace with [tasks.*], so names have to be unique across both kinds. Each instance’s run lands in the same history view as your task runs, with its own ULID, log file, and lifecycle.

[services.metrics-collector]
run = "/usr/local/bin/metrics-agent"

That’s a fully working service: one instance, restarted forever, with bounded exponential backoff.

KeyDefaultWhat it does
[services.*]requiredThe service name (the TOML table key). Used in CLI, API, and log paths.
runrequiredShell command. Multi-line OK with TOML triple-quotes.
description(empty)Human-readable description shown in the UI and TUI.
group"Services"UI grouping label.
api_triggertrueAllow manual trigger from CLI / API / UI. (Restart is the usual interaction for services.)
[services.api-worker]
instances = 3
run = "/usr/local/bin/worker"
KeyDefaultWhat it does
instances1Number of concurrent instances. Bounded 1 ≤ instances ≤ 64.

Each instance shows up as its own run, tagged with its own instance_index (0, 1, 2, and so on). When you run more than one, the Web UI and TUI label them name#1, name#2, name#3 — a 1-based count off the stored index, so a single-instance service stays just name. They all share the same config, their logs pool together per service, and when one instance’s process exits the supervisor restarts just that one.

KeyDefaultWhat it does
restart_delay1sBase delay between restarts. Go duration string.
restart_backoff"exponential"Curve applied to restart_delay: constant, linear, or exponential (shared with task retry_backoff).
healthy_afterinheritedUptime an instance must reach to count as healthy — resets its restart counter and clears the failed-start streak. Exits faster than this count as failed starts. See [defaults] for the inherited value (default "60s").
start_retries3Consecutive failed starts an instance tolerates before it’s marked FATAL and stops restarting.

The backoff has a ceiling, so even after a long stretch of flapping the delay between restarts won’t climb forever. And once an instance has stayed up for healthy_after, its backoff counter resets — so a service that flaps early but eventually settles down doesn’t stay stuck with slow restarts.

restart = "always" is baked in and can’t be overridden — that’s the deal with a service. If what you want is “run once and exit,” that’s a task.

Restarting forever is the right move for a service that occasionally falls over. It’s the wrong move for one that can never start — a typo in the command, a missing binary, a bad config. Without a backstop that service would flap silently until you happened to look. RunWisp’s first rule is nothing silently fails, so there’s a backstop.

Every time an instance exits, RunWisp checks how long it stayed up. Reach healthy_after and it counts as a successful start: the failed-start counter resets to zero. Exit faster than healthy_after with a failure (what counts as failure — a clean exit on a success code never does), and that’s a failed start. Rack up more than start_retries failed starts in a row and the instance goes FATAL: it stops restarting, a start_failed run is recorded, and a service.fatal notification lights the in-app bell and any configured notifiers. The loud terminal failure you want instead of an invisible flap.

healthy_after is the one bar for both jobs: the same uptime that resets the restart backoff also marks the instance as successfully started. So “stable enough to reset the backoff” and “stable enough to count as started” are one number to reason about — the way supervisord’s startsecs does both.

FATAL lives in memory only — like every other in-memory supervisor flag, it’s the runtime’s idea of desired state, never persisted. Restarting the daemon, or hitting Restart on the service in the UI/API, gives it a fresh start_retries budget and tries again; a restart is your “try again” signal. The start_failed runs stay in history either way, so the failures remain visible after the daemon comes back.

[services.flaky]
run = "exit 1" # can never come up
healthy_after = "5s" # must stay up 5s to count as healthy
start_retries = 2 # give up after 2 failed starts in a row

This service tries to start, fails fast, retries twice more, then goes FATAL and stops — instead of burning a restart every second forever.

KeyDefaultWhat it does
priority0Boot start order across services — lower starts first. Ties break on name. Start order only.
autostarttrueWhether the service comes up at boot. false boots it stopped until you start it from the UI/API.

priority decides the order the daemon spawns services at boot: the lowest number goes first, and services that share a number start in alphabetical order. On its own it is not a readiness gate — it just fixes the otherwise-random spawn order so the service that wants a shared port first reliably gets it first. If you need one service to actually be up before another starts, that’s depends_on below.

autostart = false is for services you want defined but not running on every boot — a maintenance worker, a one-off migration runner you trigger by hand. The service shows up stopped; hit Start in the Web UI (or the REST restart endpoint) to bring it up. This desired state isn’t persisted: it’s re-read from runwisp.toml every boot, so a manually started autostart = false service is stopped again after a restart, and a manually stopped autostart = true service comes back up. The TOML is always the source of truth.

KeyDefaultWhat it does
depends_onnoneOther services that must be healthy before this one starts at boot. Order only.

When a service genuinely needs another to be up first — an API that talks to a database, a worker that needs its broker — list the dependency:

[services.db]
run = "postgres -D /var/lib/pg"
healthy_after = "3s" # call it "up" once it stays alive 3s
[services.api]
run = "./api-server"
depends_on = ["db"] # wait for db to be healthy, then start

At boot, api waits until db reports healthy — that is, at least one db instance has stayed live for its healthy_after window — and only then starts. A service can depend on several at once (depends_on = ["db", "cache"]); it waits for all of them.

This is deliberately small: boot ordering and a readiness gate, nothing more. It is not a workflow engine. There are no cascade restarts (if db crashes after api came up, api is left alone — supervision, not orchestration), no run-to-completion edges, and no data passing. That’s orchestration, which is a non-goal. A few rules worth knowing:

  • It never deadlocks your boot. If a dependency never becomes healthy — it keeps crashing, or it’s autostart = false — the dependent waits a bounded window (the dep’s healthy_after plus a few seconds) and then starts anyway, with a loud warning in the log. Nothing silently hangs.
  • Manual start/restart is ungated. depends_on only gates the automatic boot. Hitting Start on a service from the UI/API brings it up immediately, dependencies or not.
  • Shutdown reverses it. On graceful shutdown, dependents are stopped before the services they depend on, so the API drains before the database it was talking to goes away.
  • Cycles are rejected at load. a → b → a fails validation with the cycle named, the same as any other config error. References must point at services (not tasks) that actually exist.

priority and depends_on cooperate: depends_on decides what’s allowed to start, priority breaks ties between services that are independently ready to go.

KeyDefaultWhat it does
on_overlap"skip"What happens when something tries to start a new run while one is going.

Services default to on_overlap = "skip" because the supervisor already holds the instance count steady, so overlap almost never comes up. Try to manually trigger a service that’s already running and it’s cleanly turned away. There’s no max_concurrent on services either — how many copies run is instances, not some in-flight overlap count.

KeyDefaultWhat it does
graceful_stop"5s"Grace period per instance after the stop signal, before SIGKILL — for manual stop, Restart Service, daemon shutdown.
stop_signal"SIGTERM"Signal that opens the stop ladder. One of SIGTERM, SIGINT, SIGQUIT, SIGHUP, SIGKILL, SIGUSR1, SIGUSR2 (the SIG prefix is optional).

graceful_stop covers the whole process group: the stop signal goes to the instance’s process group, every descendant gets the same grace window, and anything still standing afterward is SIGKILL’d together. The signal is SIGTERM unless you set stop_signal; SIGKILL skips the grace window entirely. If your graceful_stop is longer than [daemon] shutdown_timeout, the daemon warns you at boot — and during a whole-daemon shutdown, each instance is capped by the daemon-wide limit no matter what its own setting says.

Logs work exactly like they do on tasks — same fields, same defaults. When you leave keep_runs and keep_for off here, they inherit from the [defaults] section.

KeyDefault
log_max_size100MB
log_on_full"drop_old"
keep_runsinherits [defaults] keep_runs
keep_forinherits [defaults] keep_for

The accept/reject rules are the same as on tasks too: positive numbers cap, leaving a field off inherits. Zero means “inherit from [defaults]” for keep_runs; negative values are rejected for both fields.

One thing to keep in mind — a service’s history can grow a lot faster than a task’s, because every crash is another run row. So set keep_runs with that in mind; 200 is a sensible starting point for a service that’s prone to flapping.

See Logs & retention for the underlying behaviour.

env, env_file, secrets, and secrets_file work exactly the same way they do on tasks — see [tasks.*] Environment & secrets for the merge order, visibility rules, and validation.

[services.api]
run = "/srv/api/bin/server"
env = { PORT = "8080", LOG_LEVEL = "info" }
secrets_file = "/etc/runwisp/api-secrets.env"

Every instance of a multi-instance service receives the same merged environment.

working_dir, shell, umask, and user also work the same as on tasks — see [tasks.*] Working directory & shell. Every instance runs in the same directory, under the same interpreter and mask, as the same OS user (user needs the daemon running as root).

[services.<name>] can run a service defined in an existing docker-compose.yml instead of a raw shell command. Set compose_file = "./docker-compose.yml" and (optionally) compose_service = "<name>" — when omitted, the service name defaults to the table key. run and compose_file are mutually exclusive.

For importing the whole compose file at once, see [compose.*].

[services.api]
compose_file = "./docker-compose.yml"
compose_service = "api"
notify_on_failure = ["slack-prod"]
KeyDefaultWhat it does
notify_on_failure(none)Notifier IDs to alert when an instance ends failed, crashed, timeout, or missed, or when a slot goes fatal.
notify_on_success(none)Notifier IDs to alert on run.succeeded (a clean instance shutdown).

Same shape and same behaviour as on [tasks.*] — right down to the implicit [notify] global_notifiers (default ["inapp"]) getting added in. The shared reference lives at Per-task notifications, with [tasks.*] notifications as the mirror entry. A failed instance in a [services.*] block reaches the exact same channels a failed [tasks.*] run would.

The practical advice: trap SIGTERM in your run command and exit cleanly. The starter file’s pattern is a fine place to begin:

Terminal window
trap 'echo "SIGTERM — shutting down"; exit 0' TERM INT
while true; do
# do work
done

An instance that catches SIGTERM and exits cleanly is recorded with end_reason = stopped.

  • cron, catch_up — services aren’t cron-driven.
  • params — services don’t take per-execution parameters; they aren’t manually triggered.
  • retry_attempts, retry_delay, retry_backoff — services restart instead of retry. Use restart_delay / restart_backoff.
  • max_concurrent, queue_max — instance count is instances; services don’t queue.
  • A name shared with a [tasks.*] entry.
  • Empty or missing run.
  • instances outside [1, 64].
[services.api-worker]
description = "Three always-on workers consuming the same job queue"
instances = 3
restart_delay = "2s"
restart_backoff = "exponential"
healthy_after = "2m" # this one needs longer to call "stable"
graceful_stop = "20s" # leave time to finish the in-flight job
keep_runs = 500
notify_on_failure = ["slack-ops"]
run = """
trap 'echo "SIGTERM — draining and exiting"; exit 0' TERM INT
echo "[$(date -Iseconds)] worker starting up..."
while true; do
/usr/local/bin/consume-job
done
"""