BeeLlama.cpp on a Mac Mini M4 16GB: DFlash and TurboQuant for Local LLMs
A few days ago Anbeeld dropped BeeLlama.cpp — a performance-focused llama.cpp fork that bundles DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, adaptive draft control, reasoning-loop protection, and multimodal support (with caveats — under --mmproj, non-DFlash speculative modes are disabled. The docs say flat-only but the current source enables tree DFlash with mmproj too — treat tree + multimodal as experimental until measured) into one binary. The headline config in the r/LocalLLaMA announcement: Qwen 3.6 27B at Q5 with 200K of practically lossless KV cache and vision, on a single RTX 3090, peaking at 135 tok/s — 2-3x faster than baseline llama.cpp.
That's a 24GB VRAM target. The Mac Mini M4 has 16GB of unified memory, total. So the question I care about: which parts of Bee actually translate to a 16GB Apple Silicon machine, and which are GPU-rich-only? This post is the honest version of that answer.
What's in the box
BeeLlama.cpp keeps the familiar llama.cpp tools and server flow (llama-cli, llama-server, GGUF, --mmap) and adds the following on top:
- DFlash speculative decoding (
--spec-type dflash): the target model captures hidden states into a per-layer 4096-slot ring buffer. A separate DFlash drafter cross-attends to the most recent--spec-dflash-cross-ctxhidden-state tokens and proposes drafts for verification. - TurboQuant / TCQ KV-cache compression: five cache types —
turbo4(~3.88x),turbo3(~5.12x),turbo2(~7.53x),turbo3_tcq(~4.92x),turbo2_tcq(~7.11x) — set independently with--cache-type-kand--cache-type-v. The higher-bit options are reported as practically lossless on most workloads. - Adaptive draft-max control: instead of a fixed
--spec-draft-n-max, the server retunes draft horizon at runtime. Defaultprofitcontroller benchmarks speculative throughput against a no-spec baseline. Thefringealternative maps acceptance-rate bands to draft depth. - Multimodal:
--mmprojstays compatible with DFlash speculation. Non-DFlash speculative modes (e.g. CopySpec) are disabled while a vision projector is loaded. Bee docs claim flat-only, but the checked source also enables tree DFlash withmmproj— treat that path as experimental. Model can offload fully to CPU when VRAM is tight. - Reasoning-loop protection: detects repeated hidden reasoning and force-closes the block via
--reasoning-loop-windowand--reasoning-loop-max-period. - Sampled DFlash verification (
--spec-draft-temp): rejection-sampling drafter behavior when both temperatures > 0. - DDTree branch verification (
--spec-branch-budget): tree-style draft beyond the main path with GPUparent_ids, tree masks, and recurrent tree kernels. Auto-disabled if the target spans more than one GPU. Marked work-in-progress. - CopySpec model-free speculation (
--spec-type copyspec): rolling-hash suffix matching over previous tokens — no draft model required. - Per-request overrides: dotted JSON keys
speculative.n_maxandspeculative.branch_budgetin the request body let you change horizons without restarting the server.
TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ and the original DFlash port come from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits). Bee tightens, renames, and stabilizes the surface.
If you want the deep version of how the algorithms work and where the 8x speculative-decoding numbers come from, I wrote a DFlash and DDTree explainer a few weeks ago.
How the new pieces actually work
If you've only used vanilla speculative decoding in upstream llama.cpp, three of Bee's mechanisms are structurally different from what you've seen. Worth knowing before you start tuning flags.
DFlash drafter cross-attends to hidden states, not tokens. A standard speculative drafter is a small autoregressive LM that predicts the next token from previous tokens. DFlash's drafter is a block-diffusion model that conditions on the target's hidden states — pulled from a per-layer 4096-slot ring buffer the target writes into during its forward pass. That's why --spec-dflash-cross-ctx exists: it sets how many of those recent hidden-state vectors the drafter cross-attends to. The advantage: the drafter sees what the target is thinking, not just what it has emitted, which is why DFlash drafters can be tiny and still maintain high acceptance. The cost: a per-layer ring buffer of recent hidden states. Bee stores the CPU ring as float32 (see common/speculative.cpp), so on a 32-layer model with hidden dim 4096 and 4096 slots, that's 32 × 4096 × 4096 × 4 bytes ≈ 2 GiB of hidden-state memory, before drafter weights. The GPU cross ring is bounded by cross_ctx rather than the full 4096, but the CPU ring stays full-size. On 16GB unified, this is the line item that quietly eats your budget.
TurboQuant rotates before quantizing. Standard KV quantization (Q8_0, Q4_0) is round-to-nearest on the raw K/V tensors. The problem is outliers: a few channels with large values force a wide quantization range, which crushes precision for everyone else. TurboQuant pre-multiplies K and V by a Walsh-Hadamard Transform (WHT) — a fast, structured ±1 orthogonal rotation — before scalar quantization, then de-rotates after dequant. The rotation spreads outlier energy across all channels, so the post-rotation tensor is much more Gaussian and quantizes far better at the same bit-width. WHT is O(n log n), allocation-free, and trivially SIMD-friendly. The result is that turbo3 / turbo4 beat naive Q3 / Q4 KV at the same byte budget. Bee reserves the "practically lossless" framing for the higher-bit modes, with turbo2 being aggressive enough that you should measure quality on your workload.
TCQ uses a trellis, not a codebook lookup. Trellis-Coded Quantization (the _tcq variants, from the spiritbuun paper) replaces independent per-element quantization with a Viterbi-style soft path through a structured trellis. Each output value is jointly optimized with its neighbors, so quantization error is decorrelated rather than accumulating. At ~3.25 bpv (the turbo3_tcq rating in Bee's storage table), TCQ closes most of the gap to fp16 KV that scalar quant can't. Catch for Mac users: at the time of writing, the TCQ kernels are CUDA-only — the CPU reference path in ggml-turbo-quant.c zeros TCQ rows rather than implementing them, so don't enable turbo*_tcq on Metal yet.
Adaptive draft control is a feedback loop, not a heuristic. Fixed --spec-draft-n-max is wasteful: when acceptance is high you under-draft, when it's low you over-draft and waste compute on rejected tokens. Bee's default profit controller continuously estimates effective tok/s under speculation and compares against an ongoing no-spec baseline. If speculation is losing it shrinks the horizon, otherwise it grows it. The fringe alternative is simpler — it maps observed acceptance-rate bands to discrete draft depths. profit handles bursty workloads better, fringe is more predictable. On 16GB this matters because aggressive drafting eats memory bandwidth that would otherwise serve the target's decode.
Why a 16GB Mac user should care
Two of Bee's features matter disproportionately on a memory-starved machine, and one matters less than you'd think.
TurboQuant KV-cache compression is the big win. On a 16GB Mac, your bottleneck is rarely raw flops — it's that every token of context spends KV bytes you don't have. At Q4 weights, a 7B model leaves < 9GB for KV + activations + OS. At long context, KV is the line item that explodes. Bee's Mac-supported options are turbo3 (~5.12x compression) and turbo4 (~3.88x). turbo2, turbo2_tcq, and turbo3_tcq are CUDA-only at the time of writing per Bee's quickstart. Bee reserves "practically lossless" for the higher-bit modes (turbo4) — turbo3 is more aggressive and quality-cost is workload-dependent.
KV byte math, Qwen 3 8B at 32K context:
Architecture: 36 layers, GQA with 8 KV heads × 128 dim = 1024 KV channels.
Per-token KV bytes = 2 (K+V) × 36 layers × 1024 dim × bytes_per_elem
fp16 KV: 144 KB/token → ~4.5 GB at 32K
Q8_0 KV (~1.06 bytes/elem in ggml): 76.5 KB/token → ~2.4 GB at 32K
turbo4 (Bee block: 66 B / 128 vals = 4.125 bpv, ~3.88x) Mac path: ~37.1 KiB/token → ~1.16 GiB at 32K
turbo3 (Bee block: 50 B / 128 vals = 3.125 bpv, ~5.12x) Mac path: ~28.1 KiB/token → ~900 MiB at 32K
turbo3_tcq (3.25 bpv, ~4.92x, CUDA-only today): ~29 KB/token → ~940 MiB at 32K (reference only on Mac)
The Mac-usable rows are the turbo3 / turbo4 lines. The TCQ entry is shown for comparison since the kernels aren't ported to Metal yet. That's the difference between "out of memory" and "fine, use the rest for weights and a drafter."
turbo3 on a 16GB Mac frees ~3.6 GiB vs fp16, room for weights, drafter, and OS.DFlash speculative decoding is real but harder to fit. Speculative decoding requires a second model in memory (the drafter) plus a hidden-state ring buffer per target layer. On a 24GB 3090 the math is comfortable. On 16GB unified, every byte the drafter takes is a byte you can't spend on target weights or KV. On Mac you also pay an extra ~2 GiB for the f32 CPU ring (the GPU cross-ring fast path is CUDA-only). Practical answer: speculative decoding only pays off if you size for it from the start — small target (7B–14B class) plus a tiny DFlash drafter, both at aggressive quants. With a 27B target there is no room for a drafter on 16GB.
The deeper reason speculative decoding helps on Apple Silicon at all is memory bandwidth, not compute. The base Mac Mini M4 has roughly 120 GB/s of unified memory bandwidth. Decoding one token at a time reads the entire model's weights once per token — at Q4 a 5GB model means decode is bandwidth-bound around 24 tok/s before any compute cost. Speculative decoding amortizes that read: one batched forward pass over k draft tokens reads the weights once and verifies k candidates. The exact breakeven depends on target verification cost vs single-token decode, drafter cost, and KV/activation overhead — there's no clean "acceptance > 1/k" rule. Rough intuition: speculation wins when expected emitted tokens per verification step exceed the verification overhead measured in single-decode-equivalents. That's why DFlash's high acceptance rate matters more than its raw drafter speed. The drafter is cheap, the target read is the line item.
DDTree branch verification is mostly off-limits. Bee disables it automatically when the target spans more than one GPU — but on Mac the relevant constraint is total memory, not GPU count. The branch budget multiplies the KV state being verified per step. On 16GB, you generally want budget = 0 and put those bytes into context length.
Honest note: The Bee announcement and quickstart are written against CUDA on a 3090/4090. I haven't seen public Mac/Metal benchmarks for the fork yet. Below is what should work given that Bee is a llama.cpp fork (Metal backend is upstream) and how the features compose against 16GB unified memory. Treat the suggested configs as starting points to measure, not as published numbers.
Building on Apple Silicon
Bee is a llama.cpp fork, so the Metal build path should be the standard llama.cpp one. Metal is enabled by default on macOS — no flags needed.
git clone https://github.com/Anbeeld/beellama.cpp
cd beellama.cpp
cmake -B build
cmake --build build --config Release -j
# Sanity check Metal is on
./build/bin/llama-cli --version
If cmake isn't installed: brew install cmake. If you've never built llama.cpp before, my Running Qwen locally on a Mac Mini M4 16GB post walks through the upstream version with model download and the --mmap trick — those instructions transfer 1:1 here.
Reading the Bee source (commit 10b2a7f at time of writing), the Metal story is materially worse than I'd guessed before checking. The fast paths for several headline features resolve CUDA proc addresses. The CPU reference paths exist but in some cases (TCQ) zero out the relevant rows rather than implementing them. Honest portability map:
| Component | Kernel shape | Metal status today |
|---|---|---|
| TurboQuant (WHT rotate + scalar quant) | WHT is structured ±1 ops, dequant is ALU-bound | Partial — Metal wires some TurboQuant paths, not all |
TCQ (trellis decode, turbo*_tcq variants) |
Per-block Viterbi-style decode | CUDA-only effectively. CPU stub in ggml-turbo-quant.c zeros TCQ rows — no real Metal kernel. Avoid TCQ on Mac for now. |
| DFlash GPU cross-ring + KV projection cache | GPU-side ring buffer + cross-attention kernel | CUDA-only. Bee's quickstart explicitly says macOS DFlash uses the CPU ring path. |
| DFlash host hidden-state ring path | Hidden-state ring lives in host f32. Drafter execution still controlled by --spec-draft-ngl |
Works on Mac, but pays the f32 ring memory tax above |
Adaptive draft control (profit/fringe) |
Host-side controller | None — runs on CPU regardless |
| DDTree tree masks + recurrent kernels | Custom tree-attention with parent_ids indirection |
CUDA-specific or fallback-heavy. Author marks WIP. Treat as off on Mac. |
| CopySpec rolling-hash matcher | Host-side hash table over token IDs | None — CPU path, runs anywhere |
The practical Mac story: turbo3 / turbo4 KV + DFlash via the host hidden-state ring + CopySpec are the parts you can actually use today. turbo2, all *_tcq variants, and DDTree need Metal ports that don't exist yet. If a CUDA-only quant appears to "work" on Metal it may be silently producing zeroed rows — verify outputs before trusting them.
Picking a model that actually fits
The Reddit demo uses Qwen 3.6 27B at Q5 with a DFlash drafter. On 16GB unified, that's not the right target. Realistic options, ordered by how aggressive you want to be:
| Target | Quant | Weights ~ | Speculative? | Notes |
|---|---|---|---|---|
| Qwen 3.6-35B-A3B (MoE) | IQ3_XXS | ~13–14GB on disk. ~3B active params. Resident RAM depends on mmap + expert paging | No | Tight/swap-bound on 16GB. IQ4_XS (~18.8GB) and Q4_K_M (~21.4GB, sized for 24GB GPUs) will both swap. Start with smaller context, grow only if measurements stay clean. |
| Qwen 3 14B | Q4_K_M | ~9GB | Marginal | Bee shines here only if you find a tiny matching DFlash drafter and run KV at turbo3 (TCQ is CUDA-only). |
| Qwen 3 8B | Q4_K_M / Q5_K_M | ~5–6GB | Yes | Best fit for actual DFlash on 16GB — leaves headroom for drafter + KV + vision. |
| Qwen 3 4B / 7B class | Q5_K_M | ~3–5GB | Yes (or CopySpec) | Where TurboQuant + long context shines. Easy 32K+ context with turbo3 on Metal. |
If you want the 35B-A3B MoE — and you should, it's punching above its active-param count — Bee buys you long context via TurboQuant, not raw speed via DFlash. The MoE weights are mmap'd and the OS pages active experts. There's no spare memory for a drafter on 16GB.
One subtlety worth being precise about: speculative decoding doesn't require the drafter to "agree with" the target's expert routing. The target performs its own routing during verification, and acceptance is decided against target logits regardless of what the drafter thought. The real failure mode is more mundane — a small dense drafter often matches a sparsely-activated MoE target's distribution worse than it would match a dense target of comparable quality, so acceptance tends to be lower per byte of drafter. That's still bad for the 16GB budget, but for distribution-mismatch reasons rather than routing-mismatch reasons.
A working 16GB starting config
Pick the lane that matches what you need.
Lane A — long context, no speculative decoding (Qwen 3.6-35B-A3B MoE):
./build/bin/llama-server \
-m models/Qwen3.6-35B-A3B-IQ3_XXS.gguf \
--mmap \
--cache-type-k turbo3 \
--cache-type-v turbo3 \
-c 8192 \
--port 8080
Use IQ3_XXS on 16GB — IQ4_XS (~18.8GB) and Q4_K_M (~21.4GB) are sized for 24GB GPUs and will swap. The MoE handles weights via mmap, but on a 16GB machine the working set is still tight. Start at 8K context and grow only if memory pressure stays clean. TurboQuant cuts KV size enough to push context past what flat f16 KV would let you do. Do not use turbo*_tcq or turbo2 on Mac — those kernels are CUDA-only today.
Lane B — speculative decoding on a smaller target (Qwen 3 8B + DFlash drafter):
./build/bin/llama-server \
-m models/Qwen3-8B-Q4_K_M.gguf \
--spec-type dflash \
--model-draft models/Qwen3-DFlash-Drafter.gguf \
--spec-dflash-cross-ctx 1024 \
--cache-type-k turbo3 \
--cache-type-v turbo3 \
-c 16384 \
--port 8080
This is where Bee's adaptive draft-max controller earns its keep — the profit default automatically backs off when speculation isn't paying off and pushes harder when acceptance is high. Do not add --spec-branch-budget on 16GB — budget = 0 leaves more room for context. Mind the f32 CPU ring tax: with the GPU cross-ring being CUDA-only, expect roughly n_layers × hidden × ring_slots × 4 bytes on top of model and drafter weights.
Lane C — model-free speculation (no drafter at all):
./build/bin/llama-server \
-m models/Qwen3-14B-Q4_K_M.gguf \
--spec-type copyspec \
--cache-type-k turbo3 \
--cache-type-v turbo3 \
-c 16384 \
--port 8080
CopySpec is a rolling-hash suffix match over previous tokens — no draft model in memory and entirely host-side, so it works the same on Metal as on CUDA. Best on workloads with repetition (code edits, structured output, agent traces). The bytes you'd otherwise spend on a drafter go to KV and weights. On Mac, turbo3 is the most aggressive Metal-supported KV setting. turbo2 exists but its kernels are CUDA-only at the time of writing.
For multimodal, add --mmproj path/to/mmproj.gguf. Per the announcement, the model can be fully offloaded to CPU when VRAM is tight. On Apple Silicon the analogue is letting unified memory handle it without forcing all layers onto the GPU.
Tuning notes for 16GB
- Cache type per direction: the K-vs-V sensitivity story is more complicated than the older "V is more sensitive" folklore. Recent KV-quant work (e.g. KVTuner, "More for Keys, Less for Values") argues keys are often more sensitive — softmax can amplify K errors by reshuffling logit rankings, while small V errors average out across many attention weights. Practical advice: don't assume either direction deserves the extra bits a priori. Benchmark per model. On Mac, where only
turbo3andturbo4are usable, a defensible starting point for asymmetric configs is--cache-type-k turbo4 --cache-type-v turbo3(more bits on K). - Acceptance-rate watchdog for DFlash: on a 120 GB/s machine the target weight read dominates decode, but breakeven for speculation isn't a simple ratio — it depends on verification batch cost, drafter cost, and KV/activation overhead together. The cleanest signal is operational: watch the
profitcontroller's chosen draft-max. If it converges away from 1, speculation is paying off. If it pins at 1, the drafter is just stealing memory and you should disable it. - Reasoning-loop protection on: small models with tight context loop more often. Default
force-closeis the right behavior. Tighten--reasoning-loop-windowif you see runaway thinking blocks. - Context vs. drafter tradeoff: with
turbo3KV on Qwen 3 8B (28.1 KiB/token), 1 GiB buys roughly 37K tokens of context. The f32 host hidden-state ring tax (~2 GiB on a 32-layer model) is therefore paying ~74K tokens of potential context for a drafter that may or may not earn it back. Verify with theprofitcontroller before committing. If your workload is long-context (large repos, long chats) and not latency-bound, TurboQuant alone usually wins over DFlash + smaller context. - mmap stays on: always. Same reason as in the upstream Mac Mini playbook.
- Per-request overrides: Bee accepts dotted JSON keys
speculative.n_maxandspeculative.branch_budgetin the request body (seetools/server/server-task.cpp). Useful for routing — interactive chat with low budget, batch generation with higher.
What I'd actually do
If I had a Mac Mini M4 16GB and wanted to test Bee tomorrow, the order would be:
- Build Bee. Run a Qwen 3 8B Q4 with no speculative decoding and
turbo3KV (skip TCQ on Metal). Measure tok/s and verify quality on real tasks. - Add a DFlash drafter. Compare against (1). If Bee's
profitcontroller settles to draft-max > 1 and tok/s actually goes up, keep it. If not, the drafter is just stealing memory. - Try the 35B-A3B MoE with TurboQuant for long-context use cases (codebase Q&A, doc summarization). Speculative decoding is off the table here on 16GB.
- Switch to CopySpec for repetitive workloads (code edits, structured output) and compare against DFlash.
Two honest caveats. First, I have not run Bee on Metal yet — there's a non-zero chance some of the speculative paths have CUDA-only code that hasn't been ported. The author flags DDTree's tree kernels as work-in-progress, so Metal coverage there is poor at best. Second, even at the upper bound (Qwen 8B + DFlash drafter), 16GB is tight enough that the constants matter — a few hundred MB of leftover Safari tabs can be the difference between zero swap and visible stutter.
Useful links
- Anbeeld/beellama.cpp — the fork
- Bee quickstart — Qwen 3.6 27B + DFlash (3090/4090 target, useful as a reference config)
- docs/beellama-features.md — full feature comparison vs. upstream and other forks
- docs/beellama-args.md — argument reference
- Original r/LocalLLaMA announcement
- TheTom/llama-cpp-turboquant — TurboQuant origin
- spiritbuun/buun-llama-cpp — TCQ + original DFlash port
- Closing the Gap: TCQ for KV Cache at 2-3 Bits (paper)
- DFlash and DDTree: 8x Faster LLM Inference — companion explainer
- Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM — upstream-llama.cpp baseline for the same hardware