Complete Parameter Reference¶

This guide provides exhaustive documentation for every parameter in TurboQuantCPU.

TurboQuantizer Parameters¶

`head_dim` (int, required)¶

Description: The dimension of each attention head vector.

Common Values: - 64: Smaller models (Qwen 0.5B, some distilled models) - 128: Standard size (Llama 2/3, Mistral, Qwen 7B, most 7B-13B models) - 256: Larger models (Gemma 2 9B, some specialized architectures)

Technical Details: - Internally padded to next power of 2 for FWHT efficiency - d=128 → padded to 128 (no padding) - d=96 → padded to 128 (32 zeros added)

Performance Impact: - Higher dimensions = more computation per token - But also better compression ratios (fixed overhead amortized)

Example:

# For Llama-3-8B (head_dim=128)
quantizer = TurboQuantizer(head_dim=128, num_kv_heads=8, ...)

# For Phi-3-mini (head_dim=96, padded to 128)
quantizer = TurboQuantizer(head_dim=96, num_kv_heads=32, ...)

`num_kv_heads` (int, required)¶

Description: Number of key/value heads in the model.

Also Known As: - GQA (Grouped Query Attention) heads - KV heads - num_key_value_heads (HuggingFace config)

Common Patterns:

Model Size	Q Heads	KV Heads	GQA Ratio
0.5B-1B	14-16	2-4	4:1 to 8:1
7B	32	8	4:1
8B	32	8	4:1
70B	64	8	8:1

Why It Matters: - Fewer KV heads = more compression - GQA reduces memory without quality loss - Query heads broadcast to KV heads

Example:

# Llama-3-8B: 32 query heads, 8 KV heads
quantizer = TurboQuantizer(
    head_dim=128,
    num_kv_heads=8,      # KV heads
    num_q_heads=32,      # Query heads
    ...
)

`num_q_heads` (int, optional)¶

Description: Number of query heads. Defaults to num_kv_heads if not specified.

When to Specify: - Always specify for GQA models - Can omit for MHA (Multi-Head Attention) where Q heads = KV heads

Impact: - Used for broadcasting in attention computation - Does not affect compression (only keys are compressed)

`layer_idx` (int, default=0)¶

Description: Index of the transformer layer. Used for deterministic random rotations.

Why It Matters: - Each layer gets different random seeds - Ensures diversity across layers - Required for proper attention computation

Best Practice:

# Create one quantizer per layer
quantizers = [
    TurboQuantizer(
        head_dim=128,
        num_kv_heads=8,
        layer_idx=i,  # Unique per layer
        ...
    )
    for i in range(32)  # 32 layers
]

`mode` (TurboMode or str, default="prod")¶

Description: Quantization algorithm selection.

Options:

`"qjl"` or `TurboMode.QJL`¶

Bits: 1-bit per element
Compression: ~14×
Best for: Maximum compression, long contexts
Trade-off: Higher variance in estimates

`"mse"` or `TurboMode.MSE`¶

Bits: Configurable (2-8)
Compression: ~2-8×
Best for: Best reconstruction quality
Trade-off: Biased attention scores

`"prod"` or `TurboMode.PROD` (Recommended)¶

Bits: Configurable (2-8)
Compression: ~2-4×
Best for: Unbiased attention + good compression
Trade-off: Slower than MSE (two-stage)

Selection Guide:

Need maximum compression? → "qjl"
Need best quality? → "mse"
Need unbiased + good compression? → "prod" (default)

`bits` (int, default=4)¶

Description: Number of bits per coordinate.

Valid Range: 1-8

Impact by Mode:

Mode	bits	Actual Bits	Compression
QJL	1	1	~14×
MSE	2	2	~2×
MSE	4	4	~4×
PROD	4	3+1	~1.7-4×

Recommendation: - Start with 4 bits for most use cases - Use 2 bits for higher compression (slight quality loss) - Use 1 bit only with QJL mode

`outlier_frac` (float, default=0.0)¶

Description: Fraction of channels to allocate extra bits for outliers.

Range: 0.0 to 1.0

How It Works: - 0.0: All channels use bits bits - 0.1: Top 10% of channels use bits+1 bits - Creates "fractional-bit" quantization (e.g., 2.5-bit)

When to Use: - Models with known outlier channels - When some dimensions have much higher variance - For achieving specific average bit-rates

Example:

# 2.5-bit quantization (50% at 2 bits, 50% at 3 bits)
quantizer = TurboQuantizer(
    head_dim=128,
    num_kv_heads=8,
    bits=2,
    outlier_frac=0.5,  # Half channels get 3 bits
    ...
)

`n_threads` (int, default=0)¶

Description: Number of threads for parallel compression.

Special Values: - 0: Auto (min(num_kv_heads, 4)) - N: Use N threads explicitly

Performance Notes: - Parallelism is per-head - Diminishing returns beyond num_kv_heads - Memory bandwidth often the bottleneck, not CPU

Recommendation: - Leave at 0 for automatic tuning - Set to 1 for deterministic/debugging

QJLQuantizer Parameters¶

`seed_offset` (int, default=0)¶

Description: Additional seed offset for random rotation.

Use Case: - When you need different rotations than TurboQuant in same layer - Ensures independence between QJL and Turbo stages in PROD mode

`normalise_keys` (bool, default=True)¶

Description: Whether to L2-normalize keys before quantization.

Should Always Be True: - Required for unbiased estimator - Norm stored separately and restored during scoring

When False: - Only for special research purposes - Breaks mathematical guarantees

PolarQuantizer Parameters¶

`n_r_bits` (int, default=2)¶

Description: Bits for radius quantization.

Impact: - More bits = finer radius resolution - Radius is always positive (Rayleigh distribution)

`n_theta_bits` (int, default=2)¶

Description: Bits for angle quantization.

Impact: - More bits = finer angular resolution - Angle is in [0, 2π)

Total Bits:

total_bits = n_r_bits + n_theta_bits  # per pair (2 coordinates)
bits_per_coord = total_bits / 2

Common Configurations:

n_r_bits	n_theta_bits	Total	bits/coord	Compression
2	2	4	2	~4×
3	3	6	3	~2.7×
2	3	5	2.5	~3.2×

CacheConfig Parameters¶

`max_seq_len` (int, default=32768)¶

Description: Maximum sequence length before sliding window eviction.

Behavior: - Cache grows until max_seq_len - Then oldest tokens are dropped - Prevents unbounded memory growth

Recommendation: - Set to model's max context length - Larger = more memory but longer context

`value_mode` (str, default="int8")¶

Description: Quantization mode for value cache.

Options: - "fp16": No compression (2 bytes/element) - "int8": 8-bit symmetric (1 byte/element) - "int4": 4-bit asymmetric (0.5 bytes/element)

Impact: - Values are easier to quantize than keys - INT8 typically has <0.1% perplexity impact - INT4 gives 2× compression with minor quality loss

`value_group_size` (int, default=32)¶

Description: Group size for value quantization scaling.

Trade-off: - Smaller groups = better accuracy, more overhead - Larger groups = worse accuracy, less overhead

Recommendation: - 32 is good default - 64 for higher compression - 16 for better quality

`h2o_config` (H2OConfig, optional)¶

Description: Configuration for H2O sparse attention.

When Enabled: - Tracks attention scores across tokens - Evicts low-attention ("light") tokens - Keeps high-attention ("heavy hitter") tokens

Parameters: - heavy_budget: Number of heavy hitter tokens to keep - recent_budget: Number of recent tokens to always keep

PatchConfig Parameters¶

`skip_layers` (List[int], default=[])¶

Description: Layer indices to skip patching.

Use Cases: - Skip first layer (often has different patterns) - Skip specific layers for debugging - Gradual deployment (patch half the layers)

`verbose` (bool, default=True)¶

Description: Whether to print patching information.

Output Includes: - Model architecture detected - Number of layers patched - Compression estimates - Memory savings projection

Complete Configuration Examples¶

Example 1: Maximum Compression (Long Context)¶

from turboquantcpu import TurboQuantizer, TurboMode

# QJL for 14× compression
quantizer = TurboQuantizer(
    head_dim=128,
    num_kv_heads=8,
    num_q_heads=32,
    mode=TurboMode.QJL,
    bits=1,
    layer_idx=0,
)

# Expected: 14× compression
# Use case: 128K+ context on limited memory

Example 2: Best Quality (Accuracy Critical)¶

# Turbo-MSE with 4 bits
quantizer = TurboQuantizer(
    head_dim=128,
    num_kv_heads=8,
    mode=TurboMode.MSE,
    bits=4,
    outlier_frac=0.1,  # Extra precision for outliers
    layer_idx=0,
)

# Expected: 4× compression, minimal quality loss
# Use case: Production serving where accuracy matters

Example 3: Balanced (Recommended)¶

# Turbo-PROD with 4 bits (default)
quantizer = TurboQuantizer(
    head_dim=128,
    num_kv_heads=8,
    num_q_heads=32,
    mode=TurboMode.PROD,  # Unbiased
    bits=4,
    layer_idx=0,
)

# Expected: 3-4× compression, unbiased attention
# Use case: General purpose, best trade-off

Example 4: Model-Specific (Qwen 7B)¶

# Qwen2.5-7B configuration
quantizer = TurboQuantizer(
    head_dim=128,
    num_kv_heads=4,      # GQA: 4 KV heads
    num_q_heads=28,      # 28 query heads
    mode="prod",
    bits=4,
    layer_idx=layer_idx, # Set per layer
)

Example 5: Full Model Cache¶

from turboquantcpu import CacheConfig, CompressedKVCache

config = CacheConfig(
    num_layers=32,
    num_kv_heads=8,
    num_q_heads=32,
    head_dim=128,
    max_seq_len=32768,
    mode="prod",
    bits=4,
    value_mode="int8",
    value_group_size=32,
)

cache = CompressedKVCache.from_config(config)

Parameter Tuning Guide¶

For Maximum Compression¶

{
    "mode": "qjl",
    "bits": 1,
    "value_mode": "int4",
}
# Expected: 14× keys + 2× values = ~8× total

For Maximum Quality¶

{
    "mode": "mse",
    "bits": 4,
    "outlier_frac": 0.1,
    "value_mode": "int8",
}
# Expected: 4× compression, <0.5% quality loss

For Long Context¶

{
    "mode": "prod",
    "bits": 4,
    "max_seq_len": 131072,
    "h2o_config": H2OConfig(heavy_budget=1024, recent_budget=256),
}
# Expected: 4× compression + smart eviction

For Edge Devices¶

{
    "mode": "qjl",
    "bits": 1,
    "n_threads": 1,  # Save power
    "value_mode": "int4",
}
# Expected: Maximum memory savings, single-threaded

Performance Impact Reference¶

Parameter	Memory Impact	Speed Impact	Quality Impact
`head_dim`	Linear	Linear	None
`mode="qjl"`	14× smaller	Slower	Higher variance
`mode="mse"`	4× smaller	Medium	Biased
`mode="prod"`	3× smaller	Slowest	Unbiased
`bits`	Linear	Linear	Exponential
`outlier_frac`	Minor	Minor	Positive
`n_threads`	None	Parallel speedup	None

Complete Parameter Reference¶

TurboQuantizer Parameters¶

head_dim (int, required)¶

num_kv_heads (int, required)¶

num_q_heads (int, optional)¶

layer_idx (int, default=0)¶

mode (TurboMode or str, default="prod")¶

"qjl" or TurboMode.QJL¶

"mse" or TurboMode.MSE¶

"prod" or TurboMode.PROD (Recommended)¶

bits (int, default=4)¶

outlier_frac (float, default=0.0)¶

n_threads (int, default=0)¶

QJLQuantizer Parameters¶

seed_offset (int, default=0)¶

normalise_keys (bool, default=True)¶

PolarQuantizer Parameters¶

n_r_bits (int, default=2)¶

n_theta_bits (int, default=2)¶

CacheConfig Parameters¶

max_seq_len (int, default=32768)¶

value_mode (str, default="int8")¶

value_group_size (int, default=32)¶

h2o_config (H2OConfig, optional)¶

PatchConfig Parameters¶

skip_layers (List[int], default=[])¶

verbose (bool, default=True)¶

Complete Configuration Examples¶

Example 1: Maximum Compression (Long Context)¶

Example 2: Best Quality (Accuracy Critical)¶

Example 3: Balanced (Recommended)¶

Example 4: Model-Specific (Qwen 7B)¶

Example 5: Full Model Cache¶

Parameter Tuning Guide¶

For Maximum Compression¶

For Maximum Quality¶

For Long Context¶

For Edge Devices¶

Performance Impact Reference¶

`head_dim` (int, required)¶

`num_kv_heads` (int, required)¶

`num_q_heads` (int, optional)¶

`layer_idx` (int, default=0)¶

`mode` (TurboMode or str, default="prod")¶

`"qjl"` or `TurboMode.QJL`¶

`"mse"` or `TurboMode.MSE`¶

`"prod"` or `TurboMode.PROD` (Recommended)¶

`bits` (int, default=4)¶

`outlier_frac` (float, default=0.0)¶

`n_threads` (int, default=0)¶

`seed_offset` (int, default=0)¶

`normalise_keys` (bool, default=True)¶

`n_r_bits` (int, default=2)¶

`n_theta_bits` (int, default=2)¶

`max_seq_len` (int, default=32768)¶

`value_mode` (str, default="int8")¶

`value_group_size` (int, default=32)¶

`h2o_config` (H2OConfig, optional)¶

`skip_layers` (List[int], default=[])¶

`verbose` (bool, default=True)¶