[Attention If You Run a Local LLM: Google Research Announces TurboQuant]
AI models use something called a "KV cache" when conversing. Simply put, it is a notepad used by the AI to quickly reference previously read content; however, as conversations get longer, this cache grows larger and consumes all of the GPU memory. Consequently, handling this requires an expensive GPU. (This is slightly different from tokens. While it grows along with tokens, it is temporary data that exists only briefly in memory and disappears upon session restart.)
TurboQuant is a compression algorithm that reduces this notepad by more than six times while maintaining zero loss of accuracy. Additionally, it speeds up to eight times faster. This represents a tremendous improvement in efficiency.
There have been many attempts at this in the past, but no matter how much compression was attempted, additional memory (overhead) was required, which presented significant challenges for implementation. However, TurboQuant is innovative in that it eliminates that additional memory itself through mathematical tricks (converting vectors to polar coordinates + 1-bit error checking).
Consequently, the conclusion is:
- Longer conversations become possible on the same GPU
- AI service operating costs are reduced
- Larger context windows can be used in local models
This is the result.
A person named Prince, who works at MLX (an operating system for running local LLMs like Ollama), implemented this directly on MLX and tested it, and the results are as follows.
Test method: Needle-in-a-Haystack test with Qwen3.5-35B-A3B model (8.5K, 32.7K, 64.2K contexts)
- 6 out of 6 correct answers (at all quantization levels)
- TurboQuant 2.5-bit: KV cache reduced by 4.9x
- TurboQuant 3.5-bit: KV cache reduced by 3.8x
- 0 loss in accuracy (Unbelievable...)
Actually, I am also running Qwen 27b on a Mac Mini 64GB via Ollama. To be precise, I was running the Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled model. While the distilled model performed better and was snappier than running Qwen raw, the speed was still frustratingly slow.
If the KV cache is reduced by 4 to 5 times as calculated, we could see performance exceeding 100k+ on the same 64GB RAM, moving from a 32k context window... and potentially running models larger than the one currently in operation.
If you were planning to run local models on a Mac Mini, or are currently doing so, this is certainly news worth noting.
More details and sources