How a Smaller Quant Made My Local LLM 2.5x Faster: Fitting DeepSeek-R1 Entirely on a 4GB GPU

I run my reasoning model on a 4GB AMD GPU. Not a typo. A consumer card with less video memory than a mid-range phone has RAM, serving a 7-billion-parameter DeepSeek-R1 distill as the local brain behind a stack of automations. For months I treated that 4GB as a hard ceiling — too small for anything serious, so I leaned on bigger quants and let them spill onto the CPU.

That was the mistake. The fix was counter-intuitive enough that I want to write it down, because it cuts against the instinct every operator has: use the highest-precision quant you can fit. On a small GPU, that instinct is wrong. The thing that actually moves throughput is not precision. It is whether the entire model lives on the GPU or not.

The setup

The reasoning surface is a Windows box I call CyberPower, serving DeepSeek-R1-Distill-Qwen-7B over an OpenAI-compatible API on port 1234. Under the hood it is Docker llama.cpp (llama-server.exe) wrapped in an NSSM Windows service so it survives reboots and respawns within seconds if it dies. Single model, reboot-survivable, no babysitting.

The GPU is a 4GB AMD card driven through Vulkan. That number governs everything. A 7B model at Q4_K_M is roughly 4.4GB on disk — already larger than the card. So llama.cpp does the reasonable thing: it puts as many transformer layers as fit on the GPU (-ngl) and runs the rest on the CPU. That split is the silent killer.

The numbers nobody tells you to measure

Here is the throughput ladder I climbed, all on the same hardware, same model family, measured in tokens per second on a real multi-step reasoning prompt:

12.6 tok/s — baseline, Q4_K_M, only 7 GPU layers
15.9 tok/s — Q4_K_M, 16 GPU layers
17.8 tok/s — Q4_K_M, 20 GPU layers
31.9 tok/s — IQ3_XS, all 29 of 29 layers on the GPU

That last jump is +153% over baseline and +79% over the best Q4_K_M config — and it came from switching to a smaller, lower-precision quant. The IQ3_XS imatrix build of the same model is 3.19GB. With KV cache quantized, the whole thing fits inside 4GB with room to spare: Vulkan reports the model at 2962MB and the KV cache at 238MB, about 3.2GB total. Every layer runs on the GPU. There is no CPU half of the model dragging each token through system RAM and back across the PCIe bus.

The bottleneck was never the arithmetic precision. It was the CPU-offloaded layers. Kill the split, and the slowest part of the pipeline disappears.

The exact config

The launch flags that got me there:

-c 8192 -fa on -ctk q8_0 -ctv q8_0 --cache-ram 2048 -ngl 99

-ngl 99 is "put every layer on the GPU you can" — which, because the smaller quant fits, means all 29. Flash attention (-fa on) and an 8-bit KV cache (-ctk q8_0 -ctv q8_0) shrink the cache enough that an 8192-token context still fits the budget. The KV cache is where context length quietly eats VRAM; quantizing it to q8_0 is what let me keep 8K of context and still fit the model.

Because llama.cpp here is single-model, it ignores the model field in the request entirely. You change the served model by reconfiguring the service (nssm set LlamaServer AppParameters ... and restart), not by passing a different name in the API call. Worth knowing before you spend an hour wondering why your model: "gpt-4" request is answering as DeepSeek.

Did quality survive?

This is the part you have to actually check, not assume. IQ3_XS is a 3-bit imatrix quant — aggressive. The whole point of going smaller was speed, and speed is worthless if the model gets dumber.

I ran it against five multi-step reasoning problems with known answers: arithmetic chains, a relative-speed train problem, a cats-and-mice logic puzzle. Five for five correct. The IQ3 build held. The one caveat I'd flag: a 3-bit R1 can occasionally over-reason on adversarial trick questions — it spirals in its chain-of-thought. That is manageable, and I'll get to it. The reasoning quality on real work did not regress.

I did not delete the Q4_K_M file. It still sits on disk as a one-command revert if I ever decide a specific job needs the higher-precision build. That is the discipline: the faster config is the default, the quality-max config is one nssm set away. Reversible beats clever.

The R1 gotcha that will waste your afternoon

DeepSeek-R1 emits its chain-of-thought into a separate reasoning_content field, and the clean answer into content. If you set max_tokens too low, the model spends the entire budget thinking and content comes back empty. You get a 200 response, no error, and a blank answer.

The fix is to keep max_tokens at 2000 or higher for any consumer of this API. I had this trap in six different scripts before I tracked it down. Budget for the reasoning tokens or the model looks broken when it is working perfectly.

Vision lives somewhere else

This box is text-only now. The vision surface — age-gating, image tagging — runs separately on a Mac using MLX on Apple Silicon, serving DeepSeek-VL2-small on port 9445. Two machines, two DeepSeek models, one principle: everything local, $0 marginal cost, nothing leaves the building. No cloud API, no per-token bill, no rate limit but my own hardware.

That separation matters. The reasoning model wants every byte of GPU for the model weights; bolting a vision model onto the same card would force the split I just spent this whole post eliminating. Right tool, right machine.

The operator takeaway

If you run local models on constrained hardware, measure tokens per second before you trust your intuition about quants. The mental model "bigger quant = better" is built for machines where the model fits the GPU either way. On a small card, there is a cliff: the moment the model fits entirely in VRAM, throughput jumps in a way no amount of precision tuning on a spilled model will match.

So the real lever is not "what is the best quant my card can technically load." It is "what is the largest quant that fits entirely on the GPU with my context budget." Find that, run -ngl 99, quantize the KV cache, and verify quality on held-out problems. A 4GB card running a 7B reasoning model at 32 tokens a second, for nothing, is the kind of result that makes the cloud bill look optional.

FAQ

Why is a smaller quant faster than a larger one?

Because the larger quant did not fit entirely on the 4GB GPU, so llama.cpp ran some layers on the CPU. Every token then had to cross between GPU and system RAM, and the CPU layers became the bottleneck. The smaller IQ3_XS quant fits the whole model on the GPU, eliminating the split. The speed came from full GPU residency, not from the precision change — precision dropped and throughput still more than doubled.

Does the lower-precision quant hurt answer quality?

In my testing, no — IQ3_XS scored 5/5 on multi-step reasoning problems with known answers. The one caveat is that a 3-bit R1 distill can occasionally over-reason on adversarial trick questions. I keep the higher-precision Q4_K_M build on disk as a one-command revert in case a specific workload needs it, but the fast build is the default.

What does the KV cache quantization actually buy me?

The KV cache grows with context length and competes with model weights for VRAM. Quantizing it to 8-bit (-ctk q8_0 -ctv q8_0) shrinks it enough that an 8192-token context plus the full model still fits inside 4GB — about 3.2GB total. Without it, you would have to cut context length or drop GPU layers, both of which slow you down.

Why does my model field get ignored in API requests?

This llama.cpp setup serves a single model, so it ignores the model parameter in the request body entirely and answers with whatever model the service was launched with. To switch models you reconfigure the service launch parameters and restart it, not change the request. If you are getting DeepSeek answers from a model: "gpt-4" call, this is why.