avatar
웹3 솔라나 광기 연구실
Follow
Posts
avatar
웹3 솔라나 광기 연구실
03-25
[Attention If You Run a Local LLM: Google Research Announces TurboQuant] AI models use something called a "KV cache" when conversing. Simply put, it is a notepad used by the AI to quickly reference previously read content; however, as conversations get longer, this cache grows larger and consumes all of the GPU memory. Consequently, handling this requires an expensive GPU. (This is slightly different from tokens. While it grows along with tokens, it is temporary data that exists only briefly in memory and disappears upon session restart.) TurboQuant is a compression algorithm that reduces this notepad by more than six times while maintaining zero loss of accuracy. Additionally, it speeds up to eight times faster. This represents a tremendous improvement in efficiency. There have been many attempts at this in the past, but no matter how much compression was attempted, additional memory (overhead) was required, which presented significant challenges for implementation. However, TurboQuant is innovative in that it eliminates that additional memory itself through mathematical tricks (converting vectors to polar coordinates + 1-bit error checking). Consequently, the conclusion is: - Longer conversations become possible on the same GPU - AI service operating costs are reduced - Larger context windows can be used in local models This is the result. A person named Prince, who works at MLX (an operating system for running local LLMs like Ollama), implemented this directly on MLX and tested it, and the results are as follows. Test method: Needle-in-a-Haystack test with Qwen3.5-35B-A3B model (8.5K, 32.7K, 64.2K contexts) - 6 out of 6 correct answers (at all quantization levels) - TurboQuant 2.5-bit: KV cache reduced by 4.9x - TurboQuant 3.5-bit: KV cache reduced by 3.8x - 0 loss in accuracy (Unbelievable...) Actually, I am also running Qwen 27b on a Mac Mini 64GB via Ollama. To be precise, I was running the Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled model. While the distilled model performed better and was snappier than running Qwen raw, the speed was still frustratingly slow. If the KV cache is reduced by 4 to 5 times as calculated, we could see performance exceeding 100k+ on the same 64GB RAM, moving from a 32k context window... and potentially running models larger than the one currently in operation. If you were planning to run local models on a Mac Mini, or are currently doing so, this is certainly news worth noting. More details and sources
OPUS
0.97%
avatar
웹3 솔라나 광기 연구실
03-15
A fact many people don't know: Conversing with AI in Korean is 50%–70% more expensive. In English, roughly one word corresponds to one token. "Hello" is 1 token, and "artificial intelligence" is 2 tokens. However, Korean is a bit different. "안녕하세요" (Hello) is broken down into 2 or 3 tokens. Due to its combinatorial structure, Hangul is structurally designed to use more tokens than English. Writing the same content in Korean consumes about 1.52 times more tokens than writing in English. Since API costs are proportional to tokens, Korean is a whopping 50%–70% more expensive for the same content. Furthermore, if you receive the AI's response in Korean as well, the output tokens are likewise 50–70% more expensive. And regarding this, some people have mentioned that there are other research results (arxiv.org/pdf/2507.00246), This is a study that only tested mathematics, and it even completely excludes models we frequently use, such as GPT and Claude. These models receive RLHF based on English, so the results may differ. The models used in this study are DeepSeek R1, Qwen 2.5, and Qwen 3, so they are all LLMs originating from China… Also, the premise that "token reduction = efficiency" is a bit problematic. Even if thinking in Korean reduces the number of tokens, the cost of one Korean token is higher than one English token (byte count, processing cost). So, in conclusion, if you use AI frequently and keep reaching the rate limit, I recommend that you just converse in English, treating it as a form of English practice for now haha.
GPT
0%
loading indicator
Loading..