BeeLlama.cpp on a Mac Mini M4 16GB: DFlash and TurboQuant for Local LLMs

Local LLMs Apple Silicon Speculative Decoding


A few days ago Anbeeld dropped BeeLlama.cpp — a performance-focused llama.cpp fork that bundles DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, adaptive draft control, reasoning-loop protection, and multimodal support (with caveats — under --mmproj, non-DFlash speculative modes are disabled. The docs say flat-only but the current source enables tree DFlash with mmproj too — treat tree + multimodal as experimental until measured) into one binary. The headline config in the r/LocalLLaMA announcement: Qwen 3.6 27B at Q5 with 200K of practically lossless KV cache and vision, on a single RTX 3090, peaking at 135 tok/s — 2-3x faster than baseline llama.cpp.

That's a 24GB VRAM target. The Mac Mini M4 has 16GB of unified memory, total. So the question I care about: which parts of Bee actually translate to a 16GB Apple Silicon machine, and which are GPU-rich-only? This post is the honest version of that answer.

What's in the box

BeeLlama.cpp keeps the familiar llama.cpp tools and server flow (llama-cli, llama-server, GGUF, --mmap) and adds the following on top:

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ and the original DFlash port come from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits). Bee tightens, renames, and stabilizes the surface.

If you want the deep version of how the algorithms work and where the 8x speculative-decoding numbers come from, I wrote a DFlash and DDTree explainer a few weeks ago.

How the new pieces actually work

If you've only used vanilla speculative decoding in upstream llama.cpp, three of Bee's mechanisms are structurally different from what you've seen. Worth knowing before you start tuning flags.

DFlash drafter cross-attends to hidden states, not tokens. A standard speculative drafter is a small autoregressive LM that predicts the next token from previous tokens. DFlash's drafter is a block-diffusion model that conditions on the target's hidden states — pulled from a per-layer 4096-slot ring buffer the target writes into during its forward pass. That's why --spec-dflash-cross-ctx exists: it sets how many of those recent hidden-state vectors the drafter cross-attends to. The advantage: the drafter sees what the target is thinking, not just what it has emitted, which is why DFlash drafters can be tiny and still maintain high acceptance. The cost: a per-layer ring buffer of recent hidden states. Bee stores the CPU ring as float32 (see common/speculative.cpp), so on a 32-layer model with hidden dim 4096 and 4096 slots, that's 32 × 4096 × 4096 × 4 bytes ≈ 2 GiB of hidden-state memory, before drafter weights. The GPU cross ring is bounded by cross_ctx rather than the full 4096, but the CPU ring stays full-size. On 16GB unified, this is the line item that quietly eats your budget.

DFlash speculative decoding flow Target model Qwen 3 8B (36 layers) writes hidden states per forward pass Hidden-state ring f32, n_layers × 4096 slots × hidden_dim circular write ~2 GiB on 32-layer drafter reads last cross_ctx slots CUDA: GPU cross-ring Metal: host-only DFlash drafter block-diffusion LM cross-attends to ring proposes k drafts/step tiny vs target — drafter cost is cheap profit/fringe controller tunes k at runtime hidden states cross_ctx k drafts return to target — verified in 1 batched forward pass acceptance rate decides what's emitted
The drafter conditions on what the target is thinking, not just what it emitted. One target weight read amortizes verification of k draft tokens — the bandwidth win that makes speculation worthwhile on a 120 GB/s Mac.

TurboQuant rotates before quantizing. Standard KV quantization (Q8_0, Q4_0) is round-to-nearest on the raw K/V tensors. The problem is outliers: a few channels with large values force a wide quantization range, which crushes precision for everyone else. TurboQuant pre-multiplies K and V by a Walsh-Hadamard Transform (WHT) — a fast, structured ±1 orthogonal rotation — before scalar quantization, then de-rotates after dequant. The rotation spreads outlier energy across all channels, so the post-rotation tensor is much more Gaussian and quantizes far better at the same bit-width. WHT is O(n log n), allocation-free, and trivially SIMD-friendly. The result is that turbo3 / turbo4 beat naive Q3 / Q4 KV at the same byte budget. Bee reserves the "practically lossless" framing for the higher-bit modes, with turbo2 being aggressive enough that you should measure quality on your workload.

TCQ uses a trellis, not a codebook lookup. Trellis-Coded Quantization (the _tcq variants, from the spiritbuun paper) replaces independent per-element quantization with a Viterbi-style soft path through a structured trellis. Each output value is jointly optimized with its neighbors, so quantization error is decorrelated rather than accumulating. At ~3.25 bpv (the turbo3_tcq rating in Bee's storage table), TCQ closes most of the gap to fp16 KV that scalar quant can't. Catch for Mac users: at the time of writing, the TCQ kernels are CUDA-only — the CPU reference path in ggml-turbo-quant.c zeros TCQ rows rather than implementing them, so don't enable turbo*_tcq on Metal yet.

Adaptive draft control is a feedback loop, not a heuristic. Fixed --spec-draft-n-max is wasteful: when acceptance is high you under-draft, when it's low you over-draft and waste compute on rejected tokens. Bee's default profit controller continuously estimates effective tok/s under speculation and compares against an ongoing no-spec baseline. If speculation is losing it shrinks the horizon, otherwise it grows it. The fringe alternative is simpler — it maps observed acceptance-rate bands to discrete draft depths. profit handles bursty workloads better, fringe is more predictable. On 16GB this matters because aggressive drafting eats memory bandwidth that would otherwise serve the target's decode.

Why a 16GB Mac user should care

Two of Bee's features matter disproportionately on a memory-starved machine, and one matters less than you'd think.

TurboQuant KV-cache compression is the big win. On a 16GB Mac, your bottleneck is rarely raw flops — it's that every token of context spends KV bytes you don't have. At Q4 weights, a 7B model leaves < 9GB for KV + activations + OS. At long context, KV is the line item that explodes. Bee's Mac-supported options are turbo3 (~5.12x compression) and turbo4 (~3.88x). turbo2, turbo2_tcq, and turbo3_tcq are CUDA-only at the time of writing per Bee's quickstart. Bee reserves "practically lossless" for the higher-bit modes (turbo4) — turbo3 is more aggressive and quality-cost is workload-dependent.

KV byte math, Qwen 3 8B at 32K context:

Architecture: 36 layers, GQA with 8 KV heads × 128 dim = 1024 KV channels.
Per-token KV bytes = 2 (K+V) × 36 layers × 1024 dim × bytes_per_elem

fp16 KV: 144 KB/token → ~4.5 GB at 32K
Q8_0 KV (~1.06 bytes/elem in ggml): 76.5 KB/token → ~2.4 GB at 32K
turbo4 (Bee block: 66 B / 128 vals = 4.125 bpv, ~3.88x) Mac path: ~37.1 KiB/token → ~1.16 GiB at 32K
turbo3 (Bee block: 50 B / 128 vals = 3.125 bpv, ~5.12x) Mac path: ~28.1 KiB/token → ~900 MiB at 32K
turbo3_tcq (3.25 bpv, ~4.92x, CUDA-only today): ~29 KB/token → ~940 MiB at 32K (reference only on Mac)

The Mac-usable rows are the turbo3 / turbo4 lines. The TCQ entry is shown for comparison since the kernels aren't ported to Metal yet. That's the difference between "out of memory" and "fine, use the rest for weights and a drafter."

KV cache @ 32K context — Qwen 3 8B (36 layers, 1024 KV dim) fp16 4500 MB Q8_0 2380 MB turbo4 1188 MiB (Mac) turbo3 900 MiB (Mac) turbo3_tcq 940 MiB (CUDA-only) Metal-supported CUDA-only today baseline
KV cache footprint shrinks 4–5× under TurboQuant. turbo3 on a 16GB Mac frees ~3.6 GiB vs fp16, room for weights, drafter, and OS.

DFlash speculative decoding is real but harder to fit. Speculative decoding requires a second model in memory (the drafter) plus a hidden-state ring buffer per target layer. On a 24GB 3090 the math is comfortable. On 16GB unified, every byte the drafter takes is a byte you can't spend on target weights or KV. On Mac you also pay an extra ~2 GiB for the f32 CPU ring (the GPU cross-ring fast path is CUDA-only). Practical answer: speculative decoding only pays off if you size for it from the start — small target (7B–14B class) plus a tiny DFlash drafter, both at aggressive quants. With a 27B target there is no room for a drafter on 16GB.

The deeper reason speculative decoding helps on Apple Silicon at all is memory bandwidth, not compute. The base Mac Mini M4 has roughly 120 GB/s of unified memory bandwidth. Decoding one token at a time reads the entire model's weights once per token — at Q4 a 5GB model means decode is bandwidth-bound around 24 tok/s before any compute cost. Speculative decoding amortizes that read: one batched forward pass over k draft tokens reads the weights once and verifies k candidates. The exact breakeven depends on target verification cost vs single-token decode, drafter cost, and KV/activation overhead — there's no clean "acceptance > 1/k" rule. Rough intuition: speculation wins when expected emitted tokens per verification step exceed the verification overhead measured in single-decode-equivalents. That's why DFlash's high acceptance rate matters more than its raw drafter speed. The drafter is cheap, the target read is the line item.

DDTree branch verification is mostly off-limits. Bee disables it automatically when the target spans more than one GPU — but on Mac the relevant constraint is total memory, not GPU count. The branch budget multiplies the KV state being verified per step. On 16GB, you generally want budget = 0 and put those bytes into context length.

Honest note: The Bee announcement and quickstart are written against CUDA on a 3090/4090. I haven't seen public Mac/Metal benchmarks for the fork yet. Below is what should work given that Bee is a llama.cpp fork (Metal backend is upstream) and how the features compose against 16GB unified memory. Treat the suggested configs as starting points to measure, not as published numbers.

Building on Apple Silicon

Bee is a llama.cpp fork, so the Metal build path should be the standard llama.cpp one. Metal is enabled by default on macOS — no flags needed.

git clone https://github.com/Anbeeld/beellama.cpp
cd beellama.cpp
cmake -B build
cmake --build build --config Release -j

# Sanity check Metal is on
./build/bin/llama-cli --version

If cmake isn't installed: brew install cmake. If you've never built llama.cpp before, my Running Qwen locally on a Mac Mini M4 16GB post walks through the upstream version with model download and the --mmap trick — those instructions transfer 1:1 here.

Reading the Bee source (commit 10b2a7f at time of writing), the Metal story is materially worse than I'd guessed before checking. The fast paths for several headline features resolve CUDA proc addresses. The CPU reference paths exist but in some cases (TCQ) zero out the relevant rows rather than implementing them. Honest portability map:

Component Kernel shape Metal status today
TurboQuant (WHT rotate + scalar quant) WHT is structured ±1 ops, dequant is ALU-bound Partial — Metal wires some TurboQuant paths, not all
TCQ (trellis decode, turbo*_tcq variants) Per-block Viterbi-style decode CUDA-only effectively. CPU stub in ggml-turbo-quant.c zeros TCQ rows — no real Metal kernel. Avoid TCQ on Mac for now.
DFlash GPU cross-ring + KV projection cache GPU-side ring buffer + cross-attention kernel CUDA-only. Bee's quickstart explicitly says macOS DFlash uses the CPU ring path.
DFlash host hidden-state ring path Hidden-state ring lives in host f32. Drafter execution still controlled by --spec-draft-ngl Works on Mac, but pays the f32 ring memory tax above
Adaptive draft control (profit/fringe) Host-side controller None — runs on CPU regardless
DDTree tree masks + recurrent kernels Custom tree-attention with parent_ids indirection CUDA-specific or fallback-heavy. Author marks WIP. Treat as off on Mac.
CopySpec rolling-hash matcher Host-side hash table over token IDs None — CPU path, runs anywhere

The practical Mac story: turbo3 / turbo4 KV + DFlash via the host hidden-state ring + CopySpec are the parts you can actually use today. turbo2, all *_tcq variants, and DDTree need Metal ports that don't exist yet. If a CUDA-only quant appears to "work" on Metal it may be silently producing zeroed rows — verify outputs before trusting them.

Picking a model that actually fits

The Reddit demo uses Qwen 3.6 27B at Q5 with a DFlash drafter. On 16GB unified, that's not the right target. Realistic options, ordered by how aggressive you want to be:

Target Quant Weights ~ Speculative? Notes
Qwen 3.6-35B-A3B (MoE) IQ3_XXS ~13–14GB on disk. ~3B active params. Resident RAM depends on mmap + expert paging No Tight/swap-bound on 16GB. IQ4_XS (~18.8GB) and Q4_K_M (~21.4GB, sized for 24GB GPUs) will both swap. Start with smaller context, grow only if measurements stay clean.
Qwen 3 14B Q4_K_M ~9GB Marginal Bee shines here only if you find a tiny matching DFlash drafter and run KV at turbo3 (TCQ is CUDA-only).
Qwen 3 8B Q4_K_M / Q5_K_M ~5–6GB Yes Best fit for actual DFlash on 16GB — leaves headroom for drafter + KV + vision.
Qwen 3 4B / 7B class Q5_K_M ~3–5GB Yes (or CopySpec) Where TurboQuant + long context shines. Easy 32K+ context with turbo3 on Metal.

If you want the 35B-A3B MoE — and you should, it's punching above its active-param count — Bee buys you long context via TurboQuant, not raw speed via DFlash. The MoE weights are mmap'd and the OS pages active experts. There's no spare memory for a drafter on 16GB.

One subtlety worth being precise about: speculative decoding doesn't require the drafter to "agree with" the target's expert routing. The target performs its own routing during verification, and acceptance is decided against target logits regardless of what the drafter thought. The real failure mode is more mundane — a small dense drafter often matches a sparsely-activated MoE target's distribution worse than it would match a dense target of comparable quality, so acceptance tends to be lower per byte of drafter. That's still bad for the 16GB budget, but for distribution-mismatch reasons rather than routing-mismatch reasons.

A working 16GB starting config

Pick the lane that matches what you need.

Lane A — long context, no speculative decoding (Qwen 3.6-35B-A3B MoE):

./build/bin/llama-server \
  -m models/Qwen3.6-35B-A3B-IQ3_XXS.gguf \
  --mmap \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  -c 8192 \
  --port 8080

Use IQ3_XXS on 16GB — IQ4_XS (~18.8GB) and Q4_K_M (~21.4GB) are sized for 24GB GPUs and will swap. The MoE handles weights via mmap, but on a 16GB machine the working set is still tight. Start at 8K context and grow only if memory pressure stays clean. TurboQuant cuts KV size enough to push context past what flat f16 KV would let you do. Do not use turbo*_tcq or turbo2 on Mac — those kernels are CUDA-only today.

Lane B — speculative decoding on a smaller target (Qwen 3 8B + DFlash drafter):

./build/bin/llama-server \
  -m models/Qwen3-8B-Q4_K_M.gguf \
  --spec-type dflash \
  --model-draft models/Qwen3-DFlash-Drafter.gguf \
  --spec-dflash-cross-ctx 1024 \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  -c 16384 \
  --port 8080

This is where Bee's adaptive draft-max controller earns its keep — the profit default automatically backs off when speculation isn't paying off and pushes harder when acceptance is high. Do not add --spec-branch-budget on 16GB — budget = 0 leaves more room for context. Mind the f32 CPU ring tax: with the GPU cross-ring being CUDA-only, expect roughly n_layers × hidden × ring_slots × 4 bytes on top of model and drafter weights.

Lane C — model-free speculation (no drafter at all):

./build/bin/llama-server \
  -m models/Qwen3-14B-Q4_K_M.gguf \
  --spec-type copyspec \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  -c 16384 \
  --port 8080

CopySpec is a rolling-hash suffix match over previous tokens — no draft model in memory and entirely host-side, so it works the same on Metal as on CUDA. Best on workloads with repetition (code edits, structured output, agent traces). The bytes you'd otherwise spend on a drafter go to KV and weights. On Mac, turbo3 is the most aggressive Metal-supported KV setting. turbo2 exists but its kernels are CUDA-only at the time of writing.

For multimodal, add --mmproj path/to/mmproj.gguf. Per the announcement, the model can be fully offloaded to CPU when VRAM is tight. On Apple Silicon the analogue is letting unified memory handle it without forcing all layers onto the GPU.

Tuning notes for 16GB

What I'd actually do

If I had a Mac Mini M4 16GB and wanted to test Bee tomorrow, the order would be:

  1. Build Bee. Run a Qwen 3 8B Q4 with no speculative decoding and turbo3 KV (skip TCQ on Metal). Measure tok/s and verify quality on real tasks.
  2. Add a DFlash drafter. Compare against (1). If Bee's profit controller settles to draft-max > 1 and tok/s actually goes up, keep it. If not, the drafter is just stealing memory.
  3. Try the 35B-A3B MoE with TurboQuant for long-context use cases (codebase Q&A, doc summarization). Speculative decoding is off the table here on 16GB.
  4. Switch to CopySpec for repetitive workloads (code edits, structured output) and compare against DFlash.

Two honest caveats. First, I have not run Bee on Metal yet — there's a non-zero chance some of the speculative paths have CUDA-only code that hasn't been ported. The author flags DDTree's tree kernels as work-in-progress, so Metal coverage there is poor at best. Second, even at the upper bound (Qwen 8B + DFlash drafter), 16GB is tight enough that the constants matter — a few hundred MB of leftover Safari tabs can be the difference between zero swap and visible stutter.

Useful links