Expert-aware quantisation: near-Q4 quality at near-Q2 size?

While researching and writing my last article on the history of KV cache compression, it occurred to me while there has been so much implemented research on KV cache efficiency, actual model weights quantisation is still pretty blunt.

This makes sense - at large scale with many tens of thousands of GPUs the weights themselves aren't a huge efficiency bottleneck for the most part, and KV cache starts dominating memory usage.

But, for us lowly serfs who don't have access to a warehouse full of HBM memory, it is a problem. The key constraint for local models is (mostly) just loading the weights into something fast enough.

Profiling

I spend a lot of time profiling applications to improve their performance, and a couple of months ago I built a tool to do the same for MoE models.

This got me thinking. What if instead of just quantising the entire model to a certain level - the blunt hammer I mentioned - we instead profile the model first and then quantise the "cold" experts selectively, for that specific set of tasks?

For this research I profiled Qwen3.6 35B-A3B on C++ programming tasks. There's an important nuance worth flagging up front: this particular model is very well load-balanced, so when it's reading code it spreads the work across its experts almost evenly (a per-layer Gini coefficient near 0 - basically uniform). The selectivity only shows up when it's generating code.

And there it concentrates hard. Running a handful of C++ generation prompts through, the per-layer Gini coefficient jumps to 0.61 - meaning the top 32 of the 256 experts handle ~50% of the routing during code generation, versus the 12.5% you'd expect at random. That concentration is exactly what we can exploit: if only a subset of experts really matter for the task, we can keep those at high precision and crush the rest.

Once we've got these traces showing which experts are hot (used a lot for the specific domain) vs cold (not used), we can then move on to the next step.

Model neurosurgery

This took (Claude Code) a fair while - ironically I suspect Fable would have been perfect for this kind of task.

The core idea was to allow llama.cpp to read different levels of quantisation per expert, which had a fair few issues. Eventually though, it figured it out (running autonomously for a good 90 minutes!).

It also wrote a script to take the profiling data and do quantisation per expert.

Results

All numbers below are perplexity (lower is better) measured on CPU. "Reading code" is a held-out chunk of real C++ source; "writing code" is a set of the model's own C++ generations. The tiered models keep a "hot" set of 64 experts (out of 256) at high precision and drop the other 192 "cold" experts to 2-bit.

Model	Size	PPL - reading	PPL - writing
Q8 (baseline)	35GB	1.568	n/a *
uniform Q4	20GB	1.582	1.449
Q8-hot / Q2-cold (profiled)	18GB	1.620	1.456
Q8-hot / Q2-cold (random)	18GB	1.667	1.492
Q4-hot / Q2-cold (profiled)	14GB	1.635	1.477
Q4-hot / Q2-cold (random)	14GB	1.684	1.511
uniform Q2	13GB	2.103	1.977

* The "writing code" eval was generated by the Q8 model, so scoring Q8 against it would be circular - it's left out.

A few things jump out.

First, the baseline. Full-fat Q8 (35 GB) scores 1.568 reading C++, and a "blunt" Q2 quantisation of everything (13 GB) jumps to 2.103 - a big drop in quality for less than half the size. (Perplexity is roughly "how surprised the model is at each token" - so going from 1.57 to 2.10 is the model getting noticeably dumber, not lobotomised, but clearly worse.)

Now the actual experiment. I A/B tested the tiered approach two ways: random - pick the hot experts arbitrarily, as a control - versus profiled - keep the experts our profiling flagged as hot for C++ and crush the cold ones. The profiled version wins every single time: across two precision tiers and both eval sets, that's four out of four. With Q8 hot / Q2 cold (18 GB), random tiering scores 1.667 while the profiled version recovers nearly half of that gap back towards Q8, landing at 1.620. So the core idea works - which experts you protect matters, and the profile tells you which ones.

But here's the catch I have to be honest about: uniform Q4 is really good. On code, 4-bit is almost lossless - Q4 (20 GB) scores 1.582, basically tying Q8. So the fancy Q8-hot/Q2-cold model, despite all the cleverness, doesn't actually beat just using Q4 everywhere at a similar size.

The win shows up when you go smaller than Q4. I built a Q4-hot / Q2-cold version - 4-bit for the hot experts, 2-bit for the cold ones - which comes in at 14 GB, just 1 GB more than the blunt Q2 model. And it scores 1.635 reading and 1.477 writing - recovering ~90% of the quality gap between Q2 and Q4 for that single extra gigabyte. That's the real result: near-Q4 quality at near-Q2 size, by spending your bits on the experts that actually matter for the task.

Conclusion

This is absolutely nowhere near production ready and needs a lot of work from someone that knows the llama.cpp codebase far better than me. I only ran this on CPU which is (very) slow, my eval sets are small, and there's no doubt the vibe coded implementation Claude came up with could be improved further.

There's loads of interesting angles to continue researching on this. I tried a couple of tiers here (Q8/Q2 and Q4/Q2), but there's no reason you couldn't go further - pushing the cold experts down to sub-2-bit (IQ1/IQ2) would drop the model below the uniform-Q2 size while keeping the hot experts sharp. You could imagine a whole gradient: hottest experts at high precision, then incrementally more aggressive quantisation as experts get colder.

There's also the question of how many experts you keep sharp. I protected 64 of the 256 here, which turns out to be pretty generous - generous enough that even picking those 64 at random recovers around 80% of the Q2-to-Q4 gap by itself. That's less surprising than it first looks: an MoE layer's output is a weighted sum of its active experts, so keeping any quarter of them accurate anchors the result no matter which quarter you pick. Profiling buys the last ~10% by making sure the experts that actually fire are the protected ones. Where I'd expect it to really pull away is at small hot sets - keep only 16 experts sharp and random selection would mostly be protecting cold experts and fall apart, while the profile tells you exactly which handful matter. That's the experiment I'd run next: it should shrink the model and widen the profiled-vs-random gap, which is where this whole approach earns its keep.

But in the end I think this is a really interesting approach. If we could get mass-scale profiling data from real world llama.cpp executions, it may allow a really big jump in quality. I can see a world where the harness detects what domain the task is in, downloads a quantised model for that specific domain and then runs prompts through it. This really takes advantage that storage is cheap (relatively speaking) and RAM is expensive. So having a bunch of different quantisation variants - of the same model - on disk is pretty doable.

Prior art

I should add that there is a fair bit of prior art in this space. The closest I found, DynaExq, does something very similar dynamically at serving time from router traces - but I couldn't find anyone doing it as a static, domain-profiled quantisation you ship as a single GGUF. Here are some links that I read up on while doing this:

Closest prior art (variable precision per expert):

DynaExq - a serving system with a "hotness-aware precision controller" that reads router traces to keep hot experts at higher precision and crush the cold ones, done dynamically at runtime.
Mixture-Compressor - folds expert activation frequency directly into a per-expert bit-width allocation.
MoPEQ - assigns per-expert bits by sensitivity and explicitly avoids using activation frequency to do it - a nice counterpoint to the approach here.

Foundational quantisation:

AWQ, GPTQ, SmoothQuant and SpQR - the lineage of protecting the salient weights and crushing the rest. SpQR in particular is basically the within-tensor version of what I'm doing across experts.
llama.cpp's importance matrix (imatrix) and k-quants - already uses calibration-data importance to steer per-weight quantisation. This is really just pushing that same idea up to per-expert granularity.

On-device routing in the wild:

Apple's on-device and server foundation models - their ~3B on-device model uses a mixed 2-bit/4-bit scheme (~3.7 bits per weight) with LoRA adapters to claw back quality, then routes harder requests to the cloud. Not far off the "small quantised model for the task at hand" idea.

If you're working on these kind of optimisations I'd love to hear from you - please feel free to reach out on my contact page.