TurboQuantCPU¶

CPU-optimized KV cache quantization for LLM inference with mathematical guarantees

What Problem Does This Solve?¶

When LLMs generate text, they store Key-Value (KV) pairs for each token to avoid recomputing attention. This cache grows with sequence length:

KV Cache Memory = 2 × layers × heads × head_dim × seq_len × 2 bytes (FP16)

Example: For a 7B model at 8K context → ~1 GB just for KV cache!

TurboQuantCPU compresses this by 7-14× with mathematical guarantees on quality preservation.

Key Features¶

Feature	What It Means	Benefit
🎯 Provably unbiased attention	`E[estimated_score] = true_score` exactly	No quality degradation, mathematically proven
📊 7-14× compression	4-bit (7×) to 1-bit (14×) quantization	Run 8K→32K+ context on same hardware
🚀 Zero calibration	Works out of the box, no training data needed	Drop-in replacement, instant deployment
🤗 One-line HuggingFace	`patch_model(model, mode="prod", bits=4)`	Seamless integration with existing code
⚡ SIMD optimized	AVX2/AVX-512/NEON C kernels	Maximum CPU performance
🔬 Research-backed	ICLR 2026, NeurIPS 2024, AISTATS 2026	Peer-reviewed algorithms

Quick Start¶

pip install turboquantcpu

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquantcpu import patch_model
import torch

# Load any HuggingFace model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    torch_dtype=torch.float32,
    device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

# ONE LINE: Enable KV cache compression
cache = patch_model(model, mode="prod", bits=4)

# Generate as usual
inputs = tokenizer("Explain quantum computing:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Check memory savings
print(cache.memory_report())
# {'compression_ratio': 7.3, 'original_fp16_MB': 25.2, 'compressed_MB': 3.5}

Benchmark Highlights¶

Tested on Intel i7-1255U (12th Gen, 10 cores) with 3 HF models:

Memory Compression¶

Model	FP16 Size	4-bit PROD	1-bit QJL	Savings
Qwen3.5-0.8B	100 MB	13.7 MB (7.3×)	7.5 MB (13.4×)	86-93%
Llama-3.2-1B	100 MB	13.7 MB (7.3×)	7.5 MB (13.4×)	86-93%
Gemma-2-2B	100 MB	13.7 MB (7.3×)	7.5 MB (13.4×)	86-93%

What this means: Store 4× more context tokens in the same memory, or run models on devices with limited RAM.

Inference Speed¶

Model	FP16 Baseline	4-bit PROD	Result
Qwen3.5-0.8B	100%	+15.3%	Slight overhead
Llama-3.2-1B	100%	-8.2% ⚡	Faster!
Gemma-2-2B	100%	+12.1%	Moderate overhead

What this means: Can be faster than baseline because memory bandwidth savings outweigh decompression cost.

Quality Preservation¶

Model	FP16 Perplexity	4-bit Perplexity	Quality Change
Qwen3.5-0.8B	17.46	17.46	0.00%
Llama-3.2-1B	7.05	7.05	0.00%
Gemma-2-2B	12.34	12.35	+0.08%

What this means: Zero quality degradation—mathematical guarantees hold in practice.

See Benchmarks for detailed results and visualizations.

Quantization Modes¶

Mode	Bits	Compression	Quality	Best For
`prod`	4	7.3×	⭐⭐⭐⭐⭐	Recommended—unbiased attention, mathematically proven
`mse`	4	7.3×	⭐⭐⭐⭐	Best reconstruction quality
`qjl`	1	14×	⭐⭐⭐	Extreme memory constraints, very long contexts
`polar`	4	7.3×	⭐⭐⭐⭐	Outlier-heavy models

# Recommended: Provably unbiased attention
cache = patch_model(model, mode="prod", bits=4)

# Maximum compression: 14×
cache = patch_model(model, mode="qjl")

Mathematical Guarantees¶

TurboQuant-PROD (Recommended)¶

E[estimated_attention_score] = true_attention_score  (exactly!)

This is the only KV cache quantization method with provably unbiased attention scores.

QJL (1-bit Maximum Compression)¶

E[⟨q̂, k̂⟩] = ⟨q, k⟩  (unbiased inner product)

14× compression with zero quantization overhead.

When to Use TurboQuantCPU¶

✅ Use When:¶

Memory is the bottleneck
Running 32K+ context on consumer CPUs
Serving multiple models on same hardware
Edge/on-device deployment
You need provable quality
Production systems requiring reliability guarantees
Research requiring reproducible results
CPU-only inference
No GPU available
Cost-prohibitive GPU deployment
Privacy-sensitive on-device processing
HuggingFace ecosystem
Already using Transformers library
Want minimal code changes

❌ Don't Use When:¶

Raw speed is the only priority
Use llama.cpp for maximum throughput
GPU available → use vLLM
Contexts are very short (< 1K tokens)
Compression overhead not worth it
Full model quantization needed
TurboQuantCPU only quantizes KV cache
Use llama.cpp GGUF for full model quantization

Research Papers¶

TurboQuant (ICLR 2026)
"Online Vector Quantization with Near-optimal Distortion Rate"
arXiv:2504.19874
QJL (NeurIPS 2024 / AAAI 2025)
"1-Bit Quantized JL Transform for KV Cache Quantization"
arXiv:2406.03482
PolarQuant (AISTATS 2026)
"Quantizing KV Caches with Polar Transformation"
arXiv:2502.02617

Next Steps¶

Installation - Set up TurboQuantCPU
Quick Start - Your first compressed inference
HuggingFace Integration - Detailed integration guide
API Reference - Complete API documentation
Benchmarks - Detailed benchmark results with visualizations

License¶

MIT License - See License