hook your own custom logic and scripts when certain phrases are recognized, without writing any code.
Since I wrote that article, a few things have changed:
- When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve
worked on [supporting Alexa as well](https://github.com/BlackLight/platypush/issues/80). Feel free to use the
`assistant.echo` integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the
existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it
won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the
transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as
output. It could also experience some minor audio glitches, at least on RasbperryPi.
- Although deprecated, a new release of the Google Assistant
Library [has been made available](https://github.com/googlesamples/assistant-sdk-python/releases/tag/0.6.0) to fix the
segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s
been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other
SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy
and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently
on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But
at least one of the best options out there to build a voice assistant will still work for a while. Those interested in
building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
- In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more
state-of-art alternatives. I’ve been a long-time fan of [Snowboy](https://snowboy.kitt.ai/), which has a
well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for
a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory.
I’ve also experimented with
[Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) and [PicoVoice](https://github.com/Picovoice) products,
for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview
of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.
- **EDIT January 2021**: Unfortunately, as of Dec 31st,
2020 [Snowboy has been officially shut down](https://github.com/Kitt-AI/snowboy/). The GitHub repository is still
there, you can still clone it and either use the example models provided under `resources/models`, train a model
using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the
website that could be used to browse and generate user models is no longer available. It's really a shame - the user
models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained
open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the
time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work
if you download and install the code from the repo.
## The Case for DIY Voice Assistants
Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:
- **Privacy**. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private
company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the
lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in
platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice
interactions over a privately-owned channel through a privately-owned box.
- **Compatibility**. A Google Assistant device will only work with devices that support Google Assistant. The same goes
for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily,
depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or
other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to
interact with, and does not depend on business decisions.
- **Flexibility**. Even when a device works with your assistant, you’re still bound to the features that have been
agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky.
In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or
IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability
to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex
matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services (
Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
- **Hardware constraints**. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker
in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of
experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any
device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should
be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as
long as that device has a way to communicate with the outside world. The logic to control that device should be able
to run on the same network that the device belongs to.
- **Cloud vs. local processing**. Most of the commercial voice assistants operate by regularly capturing streams of
audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another
connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In
some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice
assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they
exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets
are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to
process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you
want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible
of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but,
regardless of the technology, we should always be provided with a choice between decentralized and centralized
computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or
on-cloud, depending on the use case and depending on the user’s preference.
- **Scalability**. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy
of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to
buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a
RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I
need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi,
without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the
possibility of becoming a smart device.
## Overview of the voice assistant integrations
A voice assistant usually consists of two components:
- An **audio recorder** that captures frames from an audio input device
- A **speech engine** that keeps track of the current context.
There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of
specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text
transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a
far higher overhead than just running hotword detection, which only has to compare the captured speech against the,
usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead
of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if
you say *“Can I have a small double-shot espresso with a lot of sugar and some milk”* they may return something like `{"