748 lines
37 KiB
Markdown
748 lines
37 KiB
Markdown
[//]: # (title: Build custom voice assistants)
|
||
[//]: # (description: An overview of the current technologies and how to leverage Platypush to build your customized assistant.)
|
||
[//]: # (image: /img/voice-1.jpg)
|
||
[//]: # (author: Fabio Manganiello <fabio@platypush.tech>)
|
||
[//]: # (published: 2020-03-08)
|
||
|
||
I wrote [an article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) a while
|
||
ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and
|
||
a microphone.
|
||
|
||
It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok
|
||
Google”, or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to
|
||
hook your own custom logic and scripts when certain phrases are recognized, without writing any code.
|
||
|
||
Since I wrote that article, a few things have changed:
|
||
|
||
- When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve
|
||
worked on [supporting Alexa as well](https://github.com/BlackLight/platypush/issues/80). Feel free to use the
|
||
`assistant.echo` integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the
|
||
existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it
|
||
won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the
|
||
transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as
|
||
output. It could also experience some minor audio glitches, at least on RasbperryPi.
|
||
|
||
- Although deprecated, a new release of the Google Assistant
|
||
Library [has been made available](https://github.com/googlesamples/assistant-sdk-python/releases/tag/0.6.0) to fix the
|
||
segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s
|
||
been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other
|
||
SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy
|
||
and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently
|
||
on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But
|
||
at least one of the best options out there to build a voice assistant will still work for a while. Those interested in
|
||
building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
|
||
|
||
- In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more
|
||
state-of-art alternatives. I’ve been a long-time fan of [Snowboy](https://snowboy.kitt.ai/), which has a
|
||
well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for
|
||
a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory.
|
||
I’ve also experimented with
|
||
[Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) and [PicoVoice](https://github.com/Picovoice) products,
|
||
for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview
|
||
of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.
|
||
|
||
- **EDIT January 2021**: Unfortunately, as of Dec 31st,
|
||
2020 [Snowboy has been officially shut down](https://github.com/Kitt-AI/snowboy/). The GitHub repository is still
|
||
there, you can still clone it and either use the example models provided under `resources/models`, train a model
|
||
using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the
|
||
website that could be used to browse and generate user models is no longer available. It's really a shame - the user
|
||
models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained
|
||
open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the
|
||
time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work
|
||
if you download and install the code from the repo.
|
||
|
||
## The Case for DIY Voice Assistants
|
||
|
||
Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:
|
||
|
||
- **Privacy**. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private
|
||
company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the
|
||
lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in
|
||
platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice
|
||
interactions over a privately-owned channel through a privately-owned box.
|
||
|
||
- **Compatibility**. A Google Assistant device will only work with devices that support Google Assistant. The same goes
|
||
for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily,
|
||
depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or
|
||
other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to
|
||
interact with, and does not depend on business decisions.
|
||
|
||
- **Flexibility**. Even when a device works with your assistant, you’re still bound to the features that have been
|
||
agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky.
|
||
In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or
|
||
IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability
|
||
to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex
|
||
matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services (
|
||
Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
|
||
|
||
- **Hardware constraints**. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker
|
||
in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of
|
||
experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any
|
||
device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should
|
||
be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as
|
||
long as that device has a way to communicate with the outside world. The logic to control that device should be able
|
||
to run on the same network that the device belongs to.
|
||
|
||
- **Cloud vs. local processing**. Most of the commercial voice assistants operate by regularly capturing streams of
|
||
audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another
|
||
connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In
|
||
some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice
|
||
assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they
|
||
exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets
|
||
are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to
|
||
process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you
|
||
want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible
|
||
of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but,
|
||
regardless of the technology, we should always be provided with a choice between decentralized and centralized
|
||
computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or
|
||
on-cloud, depending on the use case and depending on the user’s preference.
|
||
|
||
- **Scalability**. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy
|
||
of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to
|
||
buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a
|
||
RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I
|
||
need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi,
|
||
without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the
|
||
possibility of becoming a smart device.
|
||
|
||
## Overview of the voice assistant integrations
|
||
|
||
A voice assistant usually consists of two components:
|
||
|
||
- An **audio recorder** that captures frames from an audio input device
|
||
- A **speech engine** that keeps track of the current context.
|
||
|
||
There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of
|
||
specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text
|
||
transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a
|
||
far higher overhead than just running hotword detection, which only has to compare the captured speech against the,
|
||
usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead
|
||
of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if
|
||
you say *“Can I have a small double-shot espresso with a lot of sugar and some milk”* they may return something like `{"
|
||
type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}`).
|
||
|
||
In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and
|
||
engines. Let’s go through some of the available integrations, and evaluate their pros and cons.
|
||
|
||
## Native Google Assistant library
|
||
|
||
### Integrations
|
||
|
||
- [`assistant.google`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.google.html) plugin (to
|
||
programmatically start/stop conversations)
|
||
and [`assistant.google`](https://docs.platypush.tech/en/latest/platypush/backend/assistant.google.html) backend
|
||
(for continuous hotword detection).
|
||
|
||
### Configuration
|
||
|
||
- Create a Google project and download the `credentials.json` file from
|
||
the [Google developers console](https://console.cloud.google.com/apis/credentials).
|
||
|
||
- Install the `google-oauthlib-tool`:
|
||
|
||
```shell
|
||
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
|
||
```
|
||
|
||
- Authenticate to use the `assistant-sdk-prototype` scope:
|
||
|
||
```shell
|
||
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
|
||
|
||
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
|
||
--scope https://www.googleapis.com/auth/gcm \
|
||
--save --headless --client-secrets $CREDENTIALS_FILE
|
||
```
|
||
|
||
- Install Platypush with the HTTP backend and Google Assistant library support:
|
||
|
||
```shell
|
||
[sudo] pip install 'platypush[http,google-assistant-legacy]'
|
||
```
|
||
|
||
- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:
|
||
|
||
```yaml
|
||
backend.http:
|
||
enabled: True
|
||
|
||
backend.assistant.google:
|
||
enabled: True
|
||
|
||
assistant.google:
|
||
enabled: True
|
||
```
|
||
|
||
- Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on `http://your-rpi:8008` you should be
|
||
able to see your voice interactions in real-time.
|
||
|
||
### Features
|
||
|
||
- *Hotword detection*: **YES** (“Ok Google” or “Hey Google).
|
||
- *Speech detection*: **YES** (once the hotword is detected).
|
||
- *Detection runs locally*: **NO** (hotword detection [seems to] run locally, but once it's detected a channel is open
|
||
with Google servers for the interaction).
|
||
|
||
### Pros
|
||
|
||
- It implements most of the features that you’d find in any Google Assistant products. That includes native support for
|
||
timers, calendars, customized responses on the basis of your profile and location, native integration with the devices
|
||
configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks
|
||
on e.g. speech detected or conversation start/end events.
|
||
|
||
- Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.
|
||
|
||
- Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based
|
||
devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes
|
||
around 2–3% of the CPU on a RaspberryPi 4.
|
||
|
||
### Cons
|
||
|
||
- The Google Assistant library used as a backend by the integration has
|
||
been [deprecated by Google](https://developers.google.com/assistant/sdk/reference/library/python). It still works on
|
||
most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained
|
||
by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative.
|
||
|
||
- If your main goal is to operate voice-enabled services within a secure environment with no processing happening on
|
||
someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less
|
||
like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and,
|
||
potentially, review.
|
||
|
||
## Google Assistant Push-To-Talk Integration
|
||
|
||
### Integrations
|
||
|
||
- [`assistant.google.pushtotalk`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.google.pushtotalk.html)
|
||
plugin.
|
||
|
||
### Configuration
|
||
|
||
- Create a Google project and download the `credentials.json` file from
|
||
the [Google developers console](https://console.cloud.google.com/apis/credentials).
|
||
|
||
- Install the `google-oauthlib-tool`:
|
||
|
||
```shell
|
||
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
|
||
```
|
||
|
||
- Authenticate to use the `assistant-sdk-prototype` scope:
|
||
|
||
```shell
|
||
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
|
||
|
||
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
|
||
--scope https://www.googleapis.com/auth/gcm \
|
||
--save --headless --client-secrets $CREDENTIALS_FILE
|
||
```
|
||
|
||
- Install Platypush with the HTTP backend and Google Assistant SDK support:
|
||
|
||
```shell
|
||
[sudo] pip install 'platypush[http,google-assistant]'
|
||
```
|
||
|
||
- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:
|
||
|
||
```yaml
|
||
backend.http:
|
||
enabled: True
|
||
|
||
assistant.google.pushtotalk:
|
||
language: en-US
|
||
```
|
||
|
||
- Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword
|
||
detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks,
|
||
procedures, or through the HTTP API:
|
||
|
||
```shell
|
||
curl -XPOST \
|
||
-H "Authorization: Bearer $PP_TOKEN" \
|
||
-H 'Content-Type: application/json' -d '
|
||
{
|
||
"type":"request",
|
||
"action":"assistant.google.pushtotalk.start_conversation"
|
||
}' http://your-rpi:8008/execute
|
||
```
|
||
|
||
### Features
|
||
|
||
- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
|
||
hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant).
|
||
|
||
- *Speech detection*: **YES**.
|
||
|
||
- *Detection runs locally*: **NO** (you can customize the hotword engine and how to trigger the assistant, but once a
|
||
conversation is started a channel is opened with Google servers).
|
||
|
||
### Pros
|
||
|
||
- It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection
|
||
isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or
|
||
alarms).
|
||
|
||
- Rock-solid speech detection, using the same speech model used by Google Assistant products.
|
||
|
||
- Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes
|
||
it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses
|
||
resources only when you call `start_conversation`.
|
||
|
||
- It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between
|
||
your mic and Google’s servers. The connection is only opened upon `start_conversation`. This makes it a good option if
|
||
privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword
|
||
engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or
|
||
assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press,
|
||
motion sensor event or web call.
|
||
|
||
### Cons
|
||
|
||
- I’ve built this integration after the deprecation of the Google Assistant library occurred with no official
|
||
alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples (
|
||
[`pushtotalk.py`](https://github.com/googlesamples/assistant-sdk-python/blob/master/google-assistant-sdk/googlesamples/assistant/grpc/pushtotalk.py))
|
||
and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to
|
||
be replaced by Google.
|
||
|
||
- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
|
||
|
||
## Alexa Integration
|
||
|
||
### Integrations
|
||
|
||
- [`assistant.echo`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.echo.html) plugin.
|
||
|
||
### Configuration
|
||
|
||
- Install Platypush with the HTTP backend and Alexa support:
|
||
|
||
```shell
|
||
[sudo] pip install 'platypush[http,alexa]'
|
||
```
|
||
|
||
- Run `alexa-auth`. It will start a local web server on your machine on `http://your-rpi:3000`. Open it in your browser
|
||
and authenticate with your Amazon account. A credentials file should be generated under `~/.avs.json`.
|
||
|
||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
|
||
integration:
|
||
|
||
```yaml
|
||
backend.http:
|
||
enabled: True
|
||
|
||
assistant.echo:
|
||
enabled: True
|
||
```
|
||
|
||
- Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end
|
||
conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:
|
||
|
||
```shell
|
||
curl -XPOST \
|
||
-H "Authorization: Bearer $PP_TOKEN" \
|
||
-H 'Content-Type: application/json' -d '
|
||
{
|
||
"type":"request",
|
||
"action":"assistant.echo.start_conversation"
|
||
}' http://your-rpi:8008/execute
|
||
```
|
||
|
||
### Features
|
||
|
||
- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
|
||
hotword integration like Snowboy or PicoVoice to trigger or stop the assistant).
|
||
|
||
- *Speech detection*: **YES** (although limited: transcription of the processed audio won’t be provided).
|
||
|
||
- *Detection runs locally*: **NO**.
|
||
|
||
### Pros
|
||
|
||
- It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t
|
||
available. Also, the support for skills or media control may be limited.
|
||
|
||
- Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.
|
||
|
||
- Good performance even on low-power devices. No hotword engine running means it uses resources only when you call
|
||
start_conversation.
|
||
|
||
- It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and
|
||
Amazon’s servers. The connection is only opened upon start_conversation.
|
||
|
||
### Cons
|
||
|
||
- The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa
|
||
Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google
|
||
assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio
|
||
responses. It means that text transcription, either for the request or the response, won’t be available. That limits
|
||
what you can build with it. For example, you won’t be able to capture custom requests through event hooks.
|
||
|
||
- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
|
||
|
||
## Snowboy Integration
|
||
|
||
### Integrations
|
||
|
||
- [`assistant.snowboy`](https://docs.platypush.tech/en/latest/platypush/backend/assistant.snowboy.html) backend.
|
||
|
||
### Configuration
|
||
|
||
- Install Platypush with the HTTP backend and Snowboy support:
|
||
|
||
```shell
|
||
[sudo] pip install 'platypush[http,snowboy]'
|
||
```
|
||
|
||
- Choose your hotword model(s). Some are available under `SNOWBOY_INSTALL_DIR/resources/models`. Otherwise, you can
|
||
train or download models from the [Snowboy website](https://snowboy.kitt.ai/).
|
||
|
||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
|
||
integration:
|
||
|
||
```yaml
|
||
backend.http:
|
||
enabled: True
|
||
|
||
backend.assistant.snowboy:
|
||
audio_gain: 1.2
|
||
models:
|
||
# Trigger the Google assistant in Italian when I say "computer"
|
||
computer:
|
||
voice_model_file: ~/models/computer.umdl
|
||
assistant_plugin: assistant.google.pushtotalk
|
||
assistant_language: it-IT
|
||
detect_sound: ~/sounds/bell.wav
|
||
sensitivity: 0.4
|
||
|
||
# Trigger the Google assistant in English when I say "OK Google"
|
||
ok_google:
|
||
voice_model_file: ~/models/OK Google.pmdl
|
||
assistant_plugin: assistant.google.pushtotalk
|
||
assistant_language: en-US
|
||
detect_sound: ~/sounds/bell.wav
|
||
sensitivity: 0.4
|
||
|
||
# Trigger Alexa when I say "Alexa"
|
||
alexa:
|
||
voice_model_file: ~/models/Alexa.pmdl
|
||
assistant_plugin: assistant.echo
|
||
assistant_language: en-US
|
||
detect_sound: ~/sounds/bell.wav
|
||
sensitivity: 0.5
|
||
```
|
||
|
||
- Start Platypush. Say the hotword associated with one of your models, check on the logs that the
|
||
[`HotwordDetectedEvent`](https://docs.platypush.tech/en/latest/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
|
||
is triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly
|
||
started.
|
||
|
||
### Features
|
||
|
||
- *Hotword detection*: **YES**.
|
||
- *Speech detection*: **NO**.
|
||
- *Detection runs locally*: **YES**.
|
||
|
||
### Pros
|
||
|
||
- I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning.
|
||
You can download any hotword models for free from their website, provided that you record three audio samples of you
|
||
saying that word in order to help improve the model. You can also create your custom hotword model, and if enough
|
||
people are interested in using it then they’ll contribute with their samples, and the model will become more robust
|
||
over time. I believe that more machine learning projects out there could really benefit from this “use it for free as
|
||
long as you help improve the model” paradigm.
|
||
|
||
- Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can
|
||
natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make
|
||
a multi-language and multi-hotword voice assistant.
|
||
|
||
- Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk
|
||
integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never
|
||
exceeded 20–25%.
|
||
|
||
- The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection
|
||
to run and no data exchanged with any cloud.
|
||
|
||
### Cons
|
||
|
||
- Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up,
|
||
the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as
|
||
expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform
|
||
quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who
|
||
aren’t native English speakers).
|
||
|
||
## Mozilla DeepSpeech
|
||
|
||
### Integrations
|
||
|
||
- [`stt.deepspeech`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.deepspeech.html) plugin
|
||
and [`stt.deepspeech`](https://docs.platypush.tech/en/latest/platypush/backend/stt.deepspeech.html) backend (for
|
||
continuous detection).
|
||
|
||
### Configuration
|
||
|
||
- Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that
|
||
gets installed:
|
||
|
||
```shell
|
||
[sudo] pip install 'platypush[http,deepspeech]'
|
||
```
|
||
|
||
- Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while
|
||
depending on your connection:
|
||
|
||
```shell
|
||
export MODELS_DIR=~/models
|
||
export DEEPSPEECH_VERSION=0.6.1
|
||
|
||
wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
|
||
|
||
tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
|
||
x deepspeech-0.6.1-models/
|
||
x deepspeech-0.6.1-models/lm.binary
|
||
x deepspeech-0.6.1-models/output_graph.pbmm
|
||
x deepspeech-0.6.1-models/output_graph.pb
|
||
x deepspeech-0.6.1-models/trie
|
||
x deepspeech-0.6.1-models/output_graph.tflite
|
||
|
||
mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR
|
||
```
|
||
|
||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
|
||
integration:
|
||
|
||
```yaml
|
||
backend.http:
|
||
enabled: True
|
||
|
||
stt.deepspeech:
|
||
model_file: ~/models/output_graph.pbmm
|
||
lm_file: ~/models/lm.binary
|
||
trie_file: ~/models/trie
|
||
|
||
# Custom list of hotwords
|
||
hotwords:
|
||
- computer
|
||
- alexa
|
||
- hello
|
||
|
||
conversation_timeout: 5
|
||
|
||
backend.stt.deepspeech:
|
||
enabled: True
|
||
```
|
||
|
||
- Start Platypush. Speech detection will start running on startup.
|
||
[`SpeechDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.SpeechDetectedEvent)
|
||
will be triggered when you talk.
|
||
[`HotwordDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.HotwordDetectedEvent)
|
||
will be triggered when you say one of the configured hotwords.
|
||
[`ConversationDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.ConversationDetectedEvent)
|
||
will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the
|
||
continuous detection and only start it programmatically by calling `stt.deepspeech.start_detection` and
|
||
`stt.deepspeech.stop_detection`. You can also use it to perform offline speech transcription from audio files:
|
||
|
||
```shell
|
||
curl -XPOST \
|
||
-H "Authorization: Bearer $PP_TOKEN" \
|
||
-H 'Content-Type: application/json' -d '
|
||
{
|
||
"type":"request",
|
||
"action":"stt.deepspeech.detect",
|
||
"args": {
|
||
"audio_file": "~/audio.wav"
|
||
}
|
||
}' http://your-rpi:8008/execute
|
||
|
||
# Example response
|
||
{
|
||
"type":"response",
|
||
"target":"http",
|
||
"response": {
|
||
"errors":[],
|
||
"output": {
|
||
"speech": "This is a test"
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Features
|
||
|
||
- *Hotword detection*: **YES**.
|
||
- *Speech detection*: **YES**.
|
||
- *Detection runs locally*: **YES**.
|
||
|
||
### Pros
|
||
|
||
- I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version
|
||
0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party
|
||
services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also
|
||
very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can
|
||
easily extend the Tensorflow model by training it with your own samples.
|
||
|
||
- Speech-to-text transcription of audio files can be a very useful feature.
|
||
|
||
### Cons
|
||
|
||
- DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in
|
||
my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on
|
||
less powerful machines.
|
||
|
||
- DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as
|
||
small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In
|
||
reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.
|
||
|
||
- DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something
|
||
where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.”
|
||
“This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine
|
||
to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text
|
||
transcription purposes but, in such ambiguous cases, it lacks some semantic context.
|
||
|
||
- Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not
|
||
how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only
|
||
intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is
|
||
probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the
|
||
speech detection part.
|
||
|
||
## PicoVoice
|
||
|
||
[PicoVoice](https://github.com/Picovoice/) is a very promising company that has released several products for performing
|
||
voice detection on-device. Among them:
|
||
|
||
- [*Porcupine*](https://github.com/Picovoice/porcupine), a hotword engine.
|
||
- [*Leopard*](https://github.com/Picovoice/leopard), a speech-to-text offline transcription engine.
|
||
- [*Cheetah*](https://github.com/Picovoice/cheetah), a speech-to-text engine for real-time applications.
|
||
- [*Rhino*](https://github.com/Picovoice/rhino), a speech-to-intent engine.
|
||
|
||
So far, Platypush provides integrations with Porcupine and Cheetah.
|
||
|
||
### Integrations
|
||
|
||
- *Hotword engine*:
|
||
[`stt.picovoice.hotword`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.picovoice.hotword.html)
|
||
plugin and
|
||
[`stt.picovoice.hotword`](https://docs.platypush.tech/en/latest/platypush/backend/stt.picovoice.hotword.html)
|
||
backend (for continuous detection).
|
||
|
||
- *Speech engine*:
|
||
[`stt.picovoice.speech`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.picovoice.speech.html)
|
||
plugin and
|
||
[`stt.picovoice.speech`](https://docs.platypush.tech/en/latest/platypush/backend/stt.picovoice.speech.html)
|
||
backend (for continuous detection).
|
||
|
||
### Configuration
|
||
|
||
- Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:
|
||
|
||
```shell
|
||
[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'
|
||
```
|
||
|
||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
|
||
integration:
|
||
|
||
```yaml
|
||
stt.picovoice.hotword:
|
||
# Custom list of hotwords
|
||
hotwords:
|
||
- computer
|
||
- alexa
|
||
- hello
|
||
|
||
# Enable continuous hotword detection
|
||
backend.stt.picovoice.hotword:
|
||
enabled: True
|
||
|
||
# Enable continuous speech detection
|
||
# backend.stt.picovoice.speech:
|
||
# enabled: True
|
||
|
||
# Or start speech detection when a hotword is detected
|
||
event.hook.OnHotwordDetected:
|
||
if:
|
||
type: platypush.message.event.stt.HotwordDetectedEvent
|
||
then:
|
||
# Start a timer that stops the detection in 10 seconds
|
||
- action: utils.set_timeout
|
||
args:
|
||
seconds: 10
|
||
name: StopSpeechDetection
|
||
actions:
|
||
- action: stt.picovoice.speech.stop_detection
|
||
|
||
- action: stt.picovoice.speech.start_detection
|
||
```
|
||
|
||
- Start Platypush and enjoy your on-device voice assistant.
|
||
|
||
### Features
|
||
|
||
- *Hotword detection*: **YES**.
|
||
- *Speech detection*: **YES**.
|
||
- *Detection runs locally*: **YES**.
|
||
|
||
### Pros
|
||
|
||
- When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword
|
||
engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much
|
||
less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on
|
||
older models of RaspberryPi.
|
||
|
||
### Cons
|
||
|
||
- While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into
|
||
how they’ve solved the problem.
|
||
|
||
- Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you
|
||
want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll
|
||
have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of
|
||
charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any
|
||
chance to extend the model or use a different model, is only possible through a commercial license. While I understand
|
||
their point and their business model, I’d have been super-happy to just pay for a license through a more friendly
|
||
process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you”
|
||
paradigm.
|
||
|
||
- Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent
|
||
detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using
|
||
Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent
|
||
instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush.
|
||
|
||
## Conclusions
|
||
|
||
The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out
|
||
there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice
|
||
at all. But at least some solutions are emerging to bring speech detection to all devices.
|
||
|
||
I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to
|
||
businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice
|
||
integrations in the same product — and especially having voice integrations that expose all the same API and generate
|
||
the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech
|
||
recognition from the business logic that can be run by voice commands.
|
||
|
||
Check out
|
||
[my previous article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) to
|
||
learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop
|
||
events.
|
||
|
||
To summarize my findings so far:
|
||
|
||
- Use the native **Google Assistant** integration if you want to have a full Google experience, and if you’re ok with
|
||
Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant
|
||
library won’t work anymore.
|
||
|
||
- Use the **Google push-to-talk** integration if you only want to have the assistant, without hotword detection, or you
|
||
want your assistant to be triggered by alternative hotwords.
|
||
|
||
- Use the **Alexa** integration if you already have an Amazon-powered ecosystem and you’re ok with having less
|
||
flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.
|
||
|
||
- Use **Snowboy** if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs
|
||
on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not
|
||
be that accurate.
|
||
|
||
- Use **Mozilla DeepSpeech** if you want a fully on-device open-source engine powered by a robust Tensorflow model, even
|
||
if it takes more CPU load and a bit more latency.
|
||
|
||
- Use **PicoVoice** solutions if you want a full voice solution that runs on-device and it’s both accurate and
|
||
performant, even though you’ll need a commercial license for using it on some devices or extend/change the model.
|
||
|
||
Let me know your thoughts on these solutions and your experience with these integrations!
|