The Best Fully-Local Voice Stack for Home Assistant (2026)

Local LLMs Home Assistant Voice AI
June 17, 2026

In December 2024, Home Assistant shipped the Voice Preview Edition and called it "the era of open voice." Eighteen months on, that pitch has actually landed. You can build a voice assistant that hears you, understands you, and controls your house — with every stage running on hardware you own and nothing leaving your network.

I went through what people are actually running right now, cut the hype, and this is the opinionated answer: the stack that works in 2026, the three forks where you have to make a real decision, and the one setting almost every guide skips.

The stack, in one sentence

A fully-local Home Assistant voice pipeline is four stages wired together by the Assist pipeline over Wyoming, a small open protocol from the Rhasspy project:

wake word  →  speech-to-text  →  conversation agent  →  text-to-speech
openWakeWord   faster-whisper      Ollama + Qwen3          Piper / Kokoro
               or Speech-to-Phrase  (LLM, optional)

Every box has an open, self-hostable option, and in 2026 they're genuinely good. Built-in intents and Speech-to-Phrase answer in well under a second; an LLM-backed pipeline with Whisper depends on your GPU and model and lands in a few seconds. Either way the TTS is natural and the Home Assistant integration is tight. The tools to run this yourself have been production-ready for a couple of years; what changed is that the models finally got small and fast enough to be pleasant.

Why bother: privacy is the obvious reason, but the underrated one is latency and reliability. A local intent never round-trips to someone's data center, never gets deprecated, and never stops working because a subscription lapsed. "I'm throwing Alexa in the trash" — NetworkChuck's popular fully-local build is the cultural anchor for why people do this.

Fork 1: the conversation agent (the LLM)

This is the optional-but-transformative stage. Without an LLM, Home Assistant's built-in intent engine handles fixed commands ("turn off the kitchen lights"). With one, you can say "it's cold in here" and have it figure out the thermostat. The catch: the model has to call tools reliably, because that's how it actually flips your switches.

The single most important rule: pick by tool-calling reliability, not parameter count. As XDA put it, the biggest model on your machine is useless if it can't call a single tool. The 2026 community default is the Qwen3 family, for exactly that reason.

Model	Fits on	Tool calling	Notes
Qwen 3.5 9B	8GB GPU / Apple Silicon	Excellent	The single-GPU sweet spot
Qwen3.6 27B / 35B-A3B	24GB+ GPU	Most reliable	Best if you have the headroom
Gemma 4 26B-A4B	24GB+ GPU	Good (with a fix)	Forces a reasoning trace — see below
GPT-OSS 20B	16GB+ GPU	Good	Solid open alternative

Community reports and early guides favor the Qwen3.6 27B and 35B-A3B variants as tool-callers, though public benchmark coverage for them is still thin (InsiderLLM). The safer, well-trodden pick is Qwen 3.5 9B: it fits comfortably on a single 8GB card and reliably handles real tool calls. If you're on Apple Silicon, that 9B is the one I'd reach for first (see my Mac Mini local LLM setup for the inference side of this).

You host whichever you pick in Ollama, which has a direct Home Assistant integration:

# On your inference box
ollama pull qwen3.5:9b
# Ollama serves an OpenAI-compatible API at http://localhost:11434
# In Home Assistant: Settings → Devices & Services → Ollama,
# point it at http://<your-box>:11434 and select the model
# as your Assist conversation agent.

The Gemma 4 gotcha: Gemma 4 is, per XDA, "not the smartest local LLM but the one I reach for most." But it forces a reasoning trace by default, so tool-call output can land in the reasoning_content field instead of content — and Home Assistant never sees the tool call. In the Home Assistant Ollama integration, turn off the "Think before responding" option and verify tool calls actually fire. If you serve the model directly with llama.cpp or a compatible server, the equivalent is disabling thinking in the chat template (--jinja --chat-template-kwargs '{"enable_thinking":false}'). If your Gemma agent "understands" you but never acts, this is why.

Fork 2: speech-to-text (speed vs flexibility)

Home Assistant gives you two real STT options, and the right pick depends entirely on whether you're running an LLM.

Engine	Speed	Vocabulary	Use when
Speech-to-Phrase	<1s on Pi4, ~150ms on Pi5	Fixed command set	Pure intent control, no LLM
faster-whisper	Sub-second on a GPU	Open-ended speech	You've added an LLM agent

Speech-to-Phrase trades flexibility for raw speed. It only knows a small built-in set — lights, media players, timers, weather — but it transcribes in under a second on a Raspberry Pi 4. If your assistant only ever needs to control the house, this is the faster, lighter choice.

Whisper (run it as faster-whisper, medium model) is the one to use the moment you bolt on an LLM and want to say arbitrary things. It needs real hardware to stay snappy, but it's the open-ended ear. Deploy it as a Wyoming add-on or a Docker container:

# faster-whisper as a Wyoming service (Docker)
docker run -d --name whisper \
  -p 10300:10300 \
  -v whisper-data:/data \
  rhasspy/wyoming-whisper \
  --model medium --language en
# Then add it in Home Assistant:
# Settings → Devices & Services → Wyoming Protocol → host:10300

Fork 3: text-to-speech (speed vs naturalness)

The voice your house talks back in. Same shape of decision: fast and fine, or slower and lovely.

Piper is the default and the speed pick. It's a fast neural TTS optimized for the Pi 4, generating audio 3-5x faster than realtime on CPU alone, per OfflineTTS. It sounds surprisingly good for something running on a $50 board, and it has a first-class Home Assistant integration.

Kokoro is where people go when Piper isn't natural enough. Its 82M-parameter StyleTTS2 model produces noticeably warmer speech with 54 voices across 9 languages. The practical reason to switch, beyond vanity: builders who use voice for readouts find Piper mangles currency amounts, phone numbers, and addresses, while Kokoro handles them cleanly. The cost is more compute and a slightly heavier setup.

On the TTS hype: your feed is full of viral "this free TTS destroys ElevenLabs" videos right now — Qwen-3 TTS, IndexTTS2, voice cloning from 6 seconds of audio. They're impressive, but they're content-creator tools, not yet wired into the Assist pipeline. For Home Assistant in 2026, the real choice is still Piper vs Kokoro. Don't let the cloning demos pull you off the boring, working path.

Wake word

The best open option in 2026 is still openWakeWord. The pre-trained words ("Hey Jarvis", "Okay Nabu") are what most people use. A custom word is trained through the project's Colab synthetic-data workflow (thousands of generated samples, not a handful of your own recordings), and a small personal verifier model on top can cut false activations in your specific room. The official Voice Preview Edition hardware ships with this built in, which is the painless route if you don't want to assemble a satellite yourself.

The setting everyone skips: prefer_local_intents

If you take one thing from this post, take this. Once you attach an LLM, the temptation is to route everything through it. Don't. Home Assistant can try its built-in intent recognition first, before touching the model:

# In your Assist pipeline / conversation agent config
prefer_local_intents: true

With this on, "turn off the lights" is matched locally and executed instantly — no LLM round-trip, no latency, no chance of the model hallucinating a different entity. The LLM only gets invoked for the genuinely open-ended requests it's actually needed for. Per Botmonster's 2026 writeup, this is the single most impactful setting most guides leave out. It's the difference between a voice assistant that feels instant and one that feels like it's thinking too hard about turning on a lamp.

My recommended stack, by hardware tier

Tier	Wake	STT	Agent	TTS
Pi 5 / HA Green (no GPU)	openWakeWord	Speech-to-Phrase	Built-in intents	Piper
8GB GPU / Apple Silicon	openWakeWord	faster-whisper	Qwen 3.5 9B (Ollama)	Piper
24GB+ GPU	openWakeWord	faster-whisper (medium)	Qwen3.6 27B (Ollama)	Kokoro

Start at the tier your hardware allows and turn on prefer_local_intents from day one. The cheapest tier is genuinely useful — it controls the house with sub-second latency and zero cloud. Add the LLM when you want conversation, not just commands. That's the whole "era of open voice" promise, finally delivered.

Free API options (no GPU, but not private)

If you don't have a GPU and don't want to babysit a model server, you can keep the wake word and the Assist pipeline local and offload just the heavy stages to a free API tier. Be honest about the trade-off: the moment audio or text leaves your network, this stops being a private assistant. It's still cheaper and lighter than buying hardware, and a reasonable middle ground. The nice part is you can mix and match — local wake word, free-API STT, free-API LLM, free-API TTS — all wired through the same Assist pipeline.

Stage	Free option	How to wire it	Catch
LLM agent	Google Gemini	Official integration, key from AI Studio	Free tier has rate limits; pick a current Flash/Flash-Lite model (e.g. Gemini 3.1 Flash-Lite or 3.5 Flash), not a deprecated preview model
LLM agent	Groq Cloud	HACS integration; serves Llama, Gemma, Qwen	Free tier, lightning fast, generous limits
LLM agent	OpenRouter (:free models)	OpenAI-compatible endpoint	Several zero-priced `:free` models available; the lineup changes and quality varies
STT	Groq Whisper	HACS integration into the Assist pipeline	Free up to ~28,800 audio seconds/day, no card
TTS	Microsoft Edge TTS	HACS integration, no API key needed	Unofficial (uses Edge's TTS endpoint); neural voices, free

The standout free pick in 2026 is Groq: it's an inference host with a genuinely usable free tier, and it's the fastest thing in production for Whisper STT, plus it serves open LLMs (Llama, Gemma, Qwen) fast enough that the round-trip barely costs you latency. Pair it with Edge TTS (free, no key, surprisingly natural neural voices) and you have a no-GPU stack that costs $0. A common, robust pattern per this homelab comparison is Edge TTS first with Google as a fallback for reliability.

For the LLM specifically, Google's Gemini integration is the easiest zero-config entry — it's an official Home Assistant integration and adds conversation, STT, and TTS entities in one shot. Watch the model name: per HowToGeek, the free limits tightened in 2026, so pick a current Flash or Flash-Lite model the integration supports and don't pin a deprecated preview model. Check the live list in Google's model docs before you commit to one.

The honest framing: "free API" and "local" solve different problems. Free APIs solve "I have no GPU and don't want to run a server." Local solves "nothing about my home leaves my house." If privacy is why you came to Home Assistant in the first place, the free APIs are a convenient on-ramp, not the destination — keep the local stack as the goal.

FAQ

Can a local voice assistant replace Alexa or Google Home? For device control, timers, and routines — yes, completely. For open-ended general knowledge it still trails the cloud, but for a smart home the local stack is fully sufficient and private.

Which LLM should I actually start with? Qwen 3.5 9B in Ollama. It fits on an 8GB card or Apple Silicon and calls tools reliably, which is the only thing that matters for an Assist agent.

Do I need a GPU? Not for basic control — a Pi 5 with Speech-to-Phrase and Piper works with no LLM at all. You only need a GPU (or Apple Silicon) once you want LLM-backed conversation.

Are there free API options if I have no GPU? Yes. You can offload the heavy stages to free tiers — Google Gemini (official integration), Groq Cloud for the LLM and Whisper STT, OpenRouter's free models, and Microsoft Edge TTS — while keeping the wake word and Assist pipeline local. The trade-off is privacy: that audio and text leaves your network, so it's a convenient on-ramp rather than a truly local assistant.

The stack, in one sentence

Fork 1: the conversation agent (the LLM)

Fork 2: speech-to-text (speed vs flexibility)

Fork 3: text-to-speech (speed vs naturalness)

Wake word

The setting everyone skips: prefer_local_intents

My recommended stack, by hardware tier

Free API options (no GPU, but not private)

FAQ

Useful links