The MLX Stack: 7 Tools That Replaced My Cloud Subscriptions

The "sovereign AI" pitch is easy to romanticize and hard to actually execute. The romantic version is I run everything locally now. The actual version is I assembled a stack of seven specific tools, each doing one job well, and most of them are imperfect in some way I had to learn to live with.

This is that stack, in 2026, on an M-series Mac, with the rough edges named.

1. MLX — The Inference Engine

What it is. Apple's machine learning framework for Apple Silicon. The closest thing to PyTorch but built from the ground up for unified memory and the M-series neural engine.

What it replaced. Cloud GPU instances and the steady drip of API calls for inference.

The rough edge. Library coverage is younger than PyTorch. Some models you want to run will not have an MLX port the day they release. You learn to wait two weeks, or to do the conversion yourself.

The MLX runtime is what makes this entire post possible. Without it, the local inference quality on Apple Silicon is worse than the same hardware running PyTorch via CPU/MPS shim. With it, an M4 Max becomes genuinely competitive for 27B–32B inference workloads.

2. LM Studio — The Model Manager

What it is. A desktop GUI for browsing, downloading, and serving local models. OpenAI-compatible HTTP server included.

What it replaced. The "is this model the right one for this task" problem. In the API world you A/B test against a frontier model and call it done. Locally, you have to physically pick a checkpoint, quantize it, load it, test it. LM Studio collapses that loop to minutes.

The rough edge. It is closed-source. If you do not want a closed-source binary in your sovereign stack, the open alternative is ollama — which I also run, for headless server work. LM Studio is the desktop happy path. Ollama is the production path.

The pattern: pick a model with LM Studio, prove the quality, then bake it into an Ollama-served instance for everything else to talk to.

3. Ollama — The Server Layer

What it is. A daemon that loads and serves local models with a small API surface. Pulls models from a centralized registry, manages them like Docker images.

What it replaced. A whole category of self-hosted LLM-server boilerplate. Pre-Ollama, you wrote a FastAPI shim around llama.cpp and prayed your error handling held up. Post-Ollama, you ollama run <model> and it just works.

The rough edge. Token throughput is slightly lower than a hand-rolled MLX server for the same model. The convenience tax is real but small. For most workloads it does not matter.

Ollama runs on a different port from LM Studio, which means I can have both up at the same time — LM Studio for exploration, Ollama for everything that needs to be always-on.

4. ComfyUI — The Image Generation Loop

What it is. A node-graph UI for diffusion model workflows. SDXL, Flux, Stable Diffusion 3 — all of it runs locally on Apple Silicon now, well enough for production work.

What it replaced. Midjourney subscription. DALL-E API calls. Most of the "I need an image for this thing" credit-card moments.

The rough edge. The node graph is intimidating for the first week. After that it is liberating — you have access to the entire diffusion pipeline at every stage and can build workflows that are not possible in any closed product.

For brand and music-cover work I use ComfyUI for everything now. The output quality at SDXL with the right LoRAs and a decent sampler is competitive with the cloud products for my use cases, and the iteration speed is dramatically faster because there is no queue.

5. Whisper.cpp — The Transcription Layer

What it is. A C++ port of OpenAI's Whisper speech recognition model, optimized for CPU and Apple Silicon.

What it replaced. The OpenAI Whisper API and the various transcription SaaS products. Otter, Rev, Descript's transcription tier — all the tools I used to lean on for "turn this audio into text."

The rough edge. The largest Whisper model takes significant memory and is not faster than real-time on a MacBook Air. On an M4 Max the largest model runs faster than real-time and the quality is API-tier. The hardware gradient matters here more than for text models.

Whisper.cpp gets used daily — voice notes to text, podcast transcripts, music-session reference recordings, the works. The API meter that used to run on this is silent now.

6. Cloudflare Tunnel — The Reach Layer

What it is. Cloudflare's named-tunnel product. Lets you expose a local port to the public internet through Cloudflare's edge, with routing rules per hostname, no NAT punching required.

What it replaced. Renting servers somewhere just so the internet could reach my services. Three of my domains today route to local processes on this Mac through one Cloudflare tunnel.

The rough edge. You are still depending on Cloudflare. Sovereign-purist would say a Cloudflare tunnel is not sovereign. Practically, the alternative is running my own edge — which is a multi-month project I am not willing to commit to right now. The trade-off is acceptable.

This is the tool that turns a local Mac into a publicly reachable platform. Without it, sovereign infrastructure is a single-machine experiment. With it, sovereign infrastructure is internet-facing production infrastructure that happens to live on a Mac in Las Vegas.

7. The Glue — A Custom FastAPI Shim

What it is. A few hundred lines of Python that put a single OpenAI-compatible HTTP surface in front of MLX, Ollama, ComfyUI, and Whisper. Routes a single request to the right downstream tool based on the model name.

What it replaced. The cognitive load of remembering which port serves what. With one shim, every internal tool talks to a single endpoint and the shim figures out where the request actually goes.

The rough edge. I wrote it. It is bespoke. It has the fragility profile of any custom code at the seam between four different upstream tools. When one of the upstreams changes its API, the shim breaks until I fix it.

The shim is not optional. Sovereign AI is a fleet of tools, not a single tool. Some piece of glue has to make the fleet feel like one product. You can use a third-party glue (LiteLLM, OpenRouter-compatible proxies) or you can write your own. I write my own because the sovereign principle says the load-bearing piece is mine.

What This Stack Does Not Do

Every honest infrastructure post needs this section.

This stack does not match a frontier API model on the hardest reasoning tasks. The frontier is still the frontier. For the fraction of my workload where a 27B local model produces a noticeably worse result, I still hit the cloud — and that is fine, because that fraction is small.

This stack does not have automatic horizontal scaling. If I needed to serve a thousand concurrent users tomorrow I would still go cloud, or build a multi-machine cluster, which is its own project. The current stack scales to me, my agents, and a small number of trusted users — which is the right scope for a sovereign personal stack.

This stack is not trivial to set up. The tooling is mature in 2026 but it still rewards an operator who is willing to read the docs and debug. If you want a one-click sovereign AI experience, the tooling is not there yet. Maybe in 2027.

What I Wish I Had Known Earlier

Start with one model and one tool. Get one workload off the cloud. Prove to yourself it works. Then add the next tool.

I tried to stand up the entire stack in a weekend the first time and fried two days debugging cross-tool issues. The right path is one tool at a time, one workload at a time, until each piece is boringly reliable.

That is how sovereign AI actually gets built — not as a single architectural moment, but as a slow migration of workloads from the meter to the metal, one rough edge at a time.

FAQ

Do I need all seven tools?

No. The minimum sovereign stack is one inference engine (MLX or Ollama) and a glue layer to point your apps at it. Everything else is additive based on the workloads you want to repatriate from the cloud.

How much storage do I need for this?

Plan on 200–500 GB just for model weights once you start running multiple models for different jobs. The 27B and 32B 4-bit checkpoints are 15–20 GB each. Add Whisper, add a few diffusion models, add LoRAs — it adds up fast.

What about Linux?

Most of this stack works on Linux too, but the trade-off shifts. On Linux you get faster GPU inference and a worse desktop tooling story. On Mac you get a better desktop tooling story and slower (but still good) inference. Choose based on whether your workflow is "operator at a desktop" or "headless server in a closet."

Is the cost savings really worth the complexity?

Honest answer: only if your daily inference load is non-trivial. If you spend $50/month on inference APIs, the breakeven on the hardware and the operator time is a long way out. If you spend $500/month, the math gets very obvious. Above that, the sovereign stack pays for itself within a quarter.