Running LLMs locally is mostly a memory bandwidth problem. For single-user inference the GPU or Apple Silicon almost always waits on memory. Every token needs the model's weights pulled across the bus, so if you know the bandwidth of the memory the weights live in, and how big the weights are, you can get a decent estimate of tokens/sec before spending anything.

The calculator below does exactly that. Pick your hardware tiers, the model you want to run, and the quantization. It tells you roughly how fast it'll go.

Primary memory GPU VRAM or Apple unified memory
System RAM falls back here if the model doesn't fit above
SSD last-resort streaming from disk
Model
Context & KV cache Llama 3 70B fp16 ~320 KB/token, 8B ~128, Mistral 7B GQA ~64
- tokens/sec
Memory footprint -
How this is calculated

LLM inference on a single user is almost always memory-bandwidth bound. Every generated token requires the model's weights plus the full KV cache to be read from memory. So the speed ceiling is roughly bandwidth ÷ bytes read per token.

Weights bytes per token: full model size for a dense model, active parameter size for MoE (e.g. Mixtral 8x7B: 47B total, 13B active). Model size = params × bytes-per-param (int4 = 0.5, int8 = 1, fp16 = 2, etc.).

KV cache is read in full every token regardless of MoE. Total KV cache = context length × KV cache per token. Typical values for modern GQA models: Llama 3 70B fp16 ~320 KB/token, Llama 3 8B ~128 KB, Mistral 7B ~64 KB, Qwen 32B ~64 KB. Halve for fp8 quantization.

Allocation follows llama.cpp's default: the KV cache is pinned to primary memory, and weights fill whatever's left, spilling to system RAM then SSD. This means a 70B dense model whose weights spill to RAM will still have attention running at GPU speed, with only the FFN layers paying the RAM penalty. For MoE, only active experts are read per token, so weights in RAM hurt far less than they would for a dense model of the same total size. With Apple unified memory, there is no separate system RAM tier.

Ignored: prompt processing (prefill), compute-bound behaviour on very small models, and any software-side overhead. Real-world numbers are typically 60–90% of the theoretical ceiling. Treat the output as an optimistic upper bound.

A few things worth saying about the output:

  • It's an upper bound. Real inference stacks (llama.cpp, MLX, vLLM) hit 60–90% of the theoretical ceiling. Treat the number as "best case".
  • KV cache is pinned to primary memory. This matches llama.cpp's default. The KV cache always lives on the fastest tier, and weights fill whatever capacity is left, spilling to RAM then SSD as needed.
  • MoE gets a big break when spilling. Because only a fraction of weights are read per token, a big MoE model with weights in system RAM is a lot more usable than a dense model of the same total size. Run the numbers before writing off a config.
  • KV cache estimates are rough. The default of 100 KB/token is typical for modern GQA models. Llama 3 70B fp16 is ~320 KB, Llama 3 8B ~128 KB, Mistral 7B ~64 KB, Qwen 32B ~64 KB. Halve for fp8, quarter for int4.
  • Prompt processing (prefill) is different. Prefill is compute-bound, not memory-bound, so it's roughly flat with model size and scales with context length. The calculator only models generation (decode).
  • Bandwidth numbers are nominal. Apple publishes peak bandwidth but sustained is typically 80–90% of that. I've used the published peaks; adjust mentally.