TurboQuant in LlamaMan - Squeezing More Context Out of the Same GPU

I know, I know - second Llamaman post in a row but just hear me out...

If you've spent any time running large language models at home you've eventually run into the same wall: you load up a model and throw a big prompt at it, and something dies. Either it's you waiting forever for a response from larger models or in some cases the software managing your LLM straight up crashes your system (Ollama is a great tool but idk why it does that to me sometimes). But most often it's the response quality that falls off a cliff because the context just doesn't fit. The weights got offloaded fine but there's simply no room left for the conversation.

This is the KV cache problem. And TurboQuant is one of the more interesting attempts at solving it.


What Is the KV Cache, and Why Is It Eating Your VRAM

Every transformer model keeps a running record of attention keys and values as it processes your prompt. This is the KV cache. It's what allows the model to not re-read your entire prompt from scratch on every new token it generates. The longer your context, the bigger this cache grows, and on long conversations with big models this becomes the main VRAM consumer.

The original weight quantization work - Q4, Q8, and the rest - solved the problem of fitting the model itself into your GPU. The KV cache was largely left alone at full float16 precision, or compressed only lightly with q8_0/q4_0.

TurboQuant is a research paper from Google (published at ICLR 2026) that applies more aggressive compression to the KV cache using a technique called PolarQuant combined with Walsh-Hadamard rotation. NGL I don't fully understand the math behind this but at least I know what the expected result should be. 🤷 The gist: it rotates the values into a space where they're easier to quantize accurately. The result is a much smaller cache with surprisingly little quality loss.

Think of it like packing a hiking bag. You don't just shove everything in - you roll the clothes, you compress them and you put the heavy stuff at the bottom. You end up with the same gear in half the space, and it all unrolls when you need it. That's roughly what the rotation-then-quantize approach does: it reorganizes the values before squishing them so you lose less when you squish. 🫠


The Three Compression Levels

LlamaMan's TurboQuant support adds three new KV cache types on top of the standard ones llama.cpp already has:

Format Bits/val Compression Quality vs q8_0
turbo4 4.25 3.8x +0.23% PPL
turbo3 3.5 4.6x +1.06% PPL
turbo2 2.5 6.4x +6.48% PPL

The "PPL" column is perplexity - a measure of how much the model's output probability distribution changes from the baseline. +0.23% is essentially nothing. +6.48% is noticeable but the model is still coherent. The compression numbers tell the other half of the story: turbo4 gets you 3.8x smaller cache, turbo2 gets you 6.4x.

turbo4 is the obvious starting point for almost everyone. Within a quarter of a percent of full q8_0 quality, at nearly four times the compression. That's a genuinely good trade.

The more interesting thing the table hints at is asymmetric K/V compression. K and V have different properties - K controls attention routing, V carries the content. Compressing K too aggressively is where most quality loss comes from. So the practical move is often to keep K at q8_0 and compress V with turbo2 or turbo3. You get most of the VRAM savings with nearly none of the quality cost.


Getting It to Actually Build

TurboQuant+ is not in mainline llama.cpp. It lives on an experimental fork. That means to use it, LlamaMan needs a dedicated Docker image that compiles llama-server from source with CUDA support - which is a full multi-stage build pulling a CUDA dev image, cloning the fork, compiling, then copying into a runtime image.

This is where things got annoying.

Attempt one. Clone TheTom/llama-cpp-turboquant, build with -DGGML_CUDA=ON and -DCMAKE_CUDA_ARCHITECTURES=all-major. The build ran, the image came out. The turbo2/3/4 cache types were missing. Turns out the turbo KV work lives on the feature/turboquant-kv-cache branch, not main - and all-major for CUDA architectures apparently wasn't doing what I expected either.

via GIPHY

Then the linker started. The multi-stage Docker build compiles in a CUDA dev image but runs in a slim runtime image. The build stage has libcuda.so available as a stub from the toolkit. The build links against it. The problem: the stub at build time is libcuda.so but cmake and the dynamic linker expected libcuda.so.1 - the versioned symlink. Build fails. Fix: create the symlink manually before cmake runs.

ln -sf /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

That got cmake to find it. But then the build itself couldn't see it because the system linker didn't know the stubs path existed. Fix: write an ldconfig config for the stubs directory and run ldconfig before the cmake step. At this point libcuda.so.1 is a stub that only exists to satisfy the linker at build time - the real libcuda.so.1 comes from the NVIDIA driver at runtime. That's fine - the stub just needs to be there for the build to complete.

via GIPHY

Then runtime. The binary launched, failed immediately with a missing shared library. The multi-stage Dockerfile was only copying llama-server across from the build stage - one binary. But llama.cpp builds a collection of shared libraries alongside the server binary, and llama-server expects them in the same directory at runtime. Fix: copy the entire build/bin/ directory instead of just the executable.

# Before
COPY --from=builder /build/build/bin/llama-server /app/llama-server

# After
COPY --from=builder /build/build/bin/ /app/

via GIPHY

Then the turbo types still didn't work. Briefly tried a CUDA-specific fork (signalnine/llama-cpp-turboquant-cuda) that had the turbo CUDA kernels. That got turbo2/3/4 appearing in the server but the fork was behind on other things. Ended up going back to TheTom/llama-cpp-turboquant on the feature/turboquant-kv-cache branch, which turned out to have the CUDA kernels too - just on the right branch instead of main.

The final Dockerfile is maybe a hundred lines. Getting there took half a night.

via GIPHY

It could be that I have a problem.


What to Expect

The honest version: TurboQuant compresses the KV cache, not the model weights. If a 70B model didn't fit in your GPU before, it still won't. What changes is how much context you can hold once the model is loaded.

If you're running a 13B model on 8GB VRAM and hitting context limits at 4K tokens, turbo4 might let you push that to 15K+. If you're running a 34B partially offloaded to RAM and the KV cache was eating into the layers you could keep on GPU, you might get a few more layers on-device and noticeably faster generation. The VRAM the cache was occupying is now available for more context or for breathing room.

It won't make a slow model fast. It won't fix quality issues that come from weight quantization. And on smaller models (7B range) you'll notice the quality difference more than on larger ones - smaller models are less robust to compression in general.

The best use case is exactly what it says on the tin: longer contexts in the same VRAM budget. If that's your bottleneck, it's worth trying. If your bottleneck is getting the model weights loaded at all, look elsewhere.

turbo4 with asymmetric K/V (-ctk q8_0 -ctv turbo4) is the safe starting point for almost everyone. Test with actual prompts before relying on it for anything important - perplexity numbers don't always tell the full story at the model level.



Live long and prosper. 🖖👽

Share this article

Copied!

Join the conversation

Like & Comment on