The Orchestrator Nobody Talks About
Ask how to run services in production and you get the same liturgy: containers, Kubernetes, a managed cloud, maybe systemd if someone admits to liking Linux. Nobody says launchd. launchd is the thing that runs Spotlight, the daemon plumbing Apple ships for its own OS — surely not a production orchestrator.
Except it is one. It supervises processes, restarts them when they die, starts them at boot, runs them on timers, captures their logs, and does all of it with zero additional software installed. I run three public websites, multiple AI model servers, and a fleet of monitoring jobs on a single Mac, and every one of them is a launchd job. The whole stack survives crashes, reboots, and my own mistakes, and the orchestration layer cost nothing and has never once been the thing that broke.
If you are running production from hardware you own — which I have argued for elsewhere and will keep arguing for — launchd is not the compromise option. It is the correct one.
The Primitive That Does the Work: KeepAlive
The core of the whole pattern is one key in a plist: KeepAlive. Set it to true and launchd watches the process; if it exits for any reason, launchd starts it again. That single primitive is 80% of what people deploy an orchestrator to get.
My local vision model server is the honest test case. It is a Python process serving an ML model on a port — exactly the kind of thing that dies: out-of-memory, a poisoned request, a dependency hiccup. It runs under a launchd job with KeepAlive. I have kill-tested it deliberately: kill -9 the process, and launchd respawns it, model reloaded and serving again in about fifteen seconds, no human involved. That is a self-healing service in a dozen lines of XML.
The websites run the same way. Each Next.js server is its own launchd job on its own port. If one crashes at 3 a.m., it is back before any monitor would even have paged me. And because each job declares RunAtLoad, a reboot recovers the entire fleet in the right state without a runbook.
Deploys are one command: launchctl kickstart -k kills and restarts a job by label. My deploy script builds, then kickstarts the three site jobs. That is the entire CD pipeline. No registry, no rolling update strategy, no YAML beyond the plists that already exist.
Timers Replace Cron — and Then Replace Your Monitoring Vendor
The second half of launchd is StartInterval — run this job every N seconds. It is cron with supervision, and it quietly replaces a surprising amount of paid tooling.
Every ten minutes, a watcher job probes all my public endpoints — apex domains plus the routes that matter — and alerts on anything unreachable, any 4xx/5xx, and any SSL certificate within fourteen days of expiry. That is the core of an uptime-monitoring subscription, running as a shell script on a timer, watching from the same machine that serves the traffic.
The one I recommend most, though, is the pattern I call a drift watcher. It is a script of assertions about how the machine is supposed to look: these jobs loaded, these ports listening, these processes present, these config values set. Seventeen checks, runs in under two seconds, every ten minutes. When something drifts — a job unloaded during debugging and never reloaded, a port squatted by a zombie, a config reverted — I find out within ten minutes instead of during an outage three weeks later. Infrastructure does not usually fail loudly. It drifts quietly and fails later. A drift watcher converts silent decay into a same-day fix, and launchd is the perfect place for it because the watcher itself is supervised: if it dies, it comes back too.
The Honest Limits
launchd orchestrates one machine. There is no scheduling across nodes, no service mesh, no replica sets. If you need horizontal scale across a cluster, this is not the tool, and pretending otherwise would be selling you something.
But be honest about the workload first. A single modern Mac serves multiple production websites, runs local AI inference, and executes a monitoring fleet with capacity to spare — mine does it daily. The overwhelming majority of solo operations and small products fit on one strong machine, and for one machine, cluster tooling is pure overhead: more layers, more failure modes, more things to patch, in exchange for solving a scale problem you do not have.
There are real sharp edges — plist XML is unforgiving, GUI-session versus system-daemon contexts confuse everyone at first, and the error messages are terse. The learning curve is a weekend. The alternative learning curve is Kubernetes.
The Pattern, Compressed
The whole architecture is four moves:
- One plist per service, with
KeepAliveandRunAtLoad— crash recovery and boot recovery for free. launchctl kickstart -kas the deploy primitive — build, kickstart, done.StartIntervaljobs for the watchers — uptime probes and a drift watcher asserting the machine's intended state.- Everything logs to files via
StandardOutPath/StandardErrorPath, so incidents are atailaway.
No subscriptions, no cluster, no control plane to babysit. The supervisor shipped with the operating system, and it has been the most reliable component in my entire stack — which is exactly what you want from the layer whose only job is to keep everything else alive.
FAQ
Is launchd actually reliable enough for production services?
It supervises every core service on macOS itself, which is a harder reliability bar than most products ever face. In my fleet the launchd layer has never been the failure point — processes crash, and launchd restarts them. The reliability risk lives in your services, not the supervisor.
How is launchd different from just using cron?
Cron only starts things on a schedule; it never watches them. launchd supervises: a KeepAlive job that crashes gets restarted automatically, and scheduled jobs that die get rerun on the next interval. You get cron's timers plus a process babysitter in one built-in tool.
What happens to launchd services when the Mac reboots?
Jobs with RunAtLoad start automatically when the machine comes up, so a reboot recovers the whole fleet in its intended state with no manual runbook. Combined with a tunnel for public traffic, the site is back the moment the machine is.
When do I actually need Kubernetes instead of this?
When the workload genuinely exceeds one machine — horizontal scale across nodes, replicas for zero-downtime guarantees, multi-team clusters. For a solo operation running sites, inference, and monitors that fit on one strong Mac, cluster tooling adds failure modes without adding capability.