blog/static/pages/Build-custom-voice-assistants.md

749 lines
37 KiB
Markdown
Raw Normal View History

2021-01-31 13:07:59 +01:00
[//]: # (title: Build custom voice assistants)
[//]: # (description: An overview of the current technologies and how to leverage Platypush to build your customized assistant.)
[//]: # (image: /img/voice-1.jpg)
[//]: # (author: Fabio Manganiello <fabio@platypush.tech>)
[//]: # (published: 2020-03-08)
I wrote [an article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) a while
ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and
a microphone.
It also showed how to make your own custom hotword model that triggers the assistant if you dont want to say “Ok
2021-01-31 13:09:50 +01:00
Google”, or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to
2021-01-31 13:07:59 +01:00
hook your own custom logic and scripts when certain phrases are recognized, without writing any code.
Since I wrote that article, a few things have changed:
- When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, Ive
worked on [supporting Alexa as well](https://github.com/BlackLight/platypush/issues/80). Feel free to use the
`assistant.echo` integration in Platypush if youre an Alexa fan, but bear in mind that its more limited than the
existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it
wont provide the transcript of the detected text, which means its not possible to insert custom hooks or the
transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as
output. It could also experience some minor audio glitches, at least on RasbperryPi.
- Although deprecated, a new release of the Google Assistant
Library [has been made available](https://github.com/googlesamples/assistant-sdk-python/releases/tag/0.6.0) to fix the
segmentation fault issue on RaspberryPi 4. Ive buzzed the developers often over the past year and Im glad that its
been done! Its good news because the Assistant library has the best engine for hotword detection Ive seen. No other
SDK Ive tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy
and performance. The news isnt all good, however: The library is still deprecated, with no alternative is currently
on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But
at least one of the best options out there to build a voice assistant will still work for a while. Those interested in
building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
- In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more
state-of-art alternatives. Ive been a long-time fan of [Snowboy](https://snowboy.kitt.ai/), which has a
well-supported platypush integration, and Ive used it as a hotword engine to trigger other assistant integrations for
a long time. However, when it comes to accuracy in real-time scenarios, even its best models arent that satisfactory.
Ive also experimented with
[Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) and [PicoVoice](https://github.com/Picovoice) products,
for voice detection and built integrations in Platypush. In this article, Ill try to provide a comprehensive overview
of whats currently possible with DIY voice assistants and a comparison of the integrations Ive built.
- **EDIT January 2021**: Unfortunately, as of Dec 31st,
2020 [Snowboy has been officially shut down](https://github.com/Kitt-AI/snowboy/). The GitHub repository is still
there, you can still clone it and either use the example models provided under `resources/models`, train a model
using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the
website that could be used to browse and generate user models is no longer available. It's really a shame - the user
models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained
open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the
time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work
if you download and install the code from the repo.
## The Case for DIY Voice Assistants
Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:
- **Privacy**. The easiest one to guess! Im not sure if a microphone in the house, active 24/7, connected to a private
company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the
lightbulbs, turn on the thermostat, or play a Spotify playlist. Ive built the voice assistant integrations in
platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice
interactions over a privately-owned channel through a privately-owned box.
- **Compatibility**. A Google Assistant device will only work with devices that support Google Assistant. The same goes
for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily,
depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or
other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to
interact with, and does not depend on business decisions.
- **Flexibility**. Even when a device works with your assistant, youre still bound to the features that have been
agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky.
In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or
IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability
to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex
matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services (
Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
- **Hardware constraints**. Ive never understood the case for selling plastic boxes that embed a microphone and a speaker
in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of
experiments, its probably time to expect the industry to provide a voice assistant experience that can run on any
device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should
be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as
long as that device has a way to communicate with the outside world. The logic to control that device should be able
to run on the same network that the device belongs to.
- **Cloud vs. local processing**. Most of the commercial voice assistants operate by regularly capturing streams of
audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another
connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In
some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice
assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they
exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets
are low-power devices that operate within a fast network and you dont need much flexibility. But if you can afford to
process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you
want to do things that you usually cant do with off-the-shelf solutions, you may want to process as much as possible
of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but,
regardless of the technology, we should always be provided with a choice between decentralized and centralized
computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or
on-cloud, depending on the use case and depending on the users preference.
- **Scalability**. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy
of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and its done. Without having to
buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a
RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I
need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi,
without having to worry about whether its supported in my Google Home or Alexa app. Any device should be given the
possibility of becoming a smart device.
## Overview of the voice assistant integrations
A voice assistant usually consists of two components:
- An **audio recorder** that captures frames from an audio input device
- A **speech engine** that keeps track of the current context.
There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of
specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text
transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a
far higher overhead than just running hotword detection, which only has to compare the captured speech against the,
usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoices Rhino. Instead
of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if
you say *“Can I have a small double-shot espresso with a lot of sugar and some milk”* they may return something like `{"
type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}`).
In Platypush, Ive built integrations to provide users with a wide choice when it comes to speech-to-text processors and
engines. Lets go through some of the available integrations, and evaluate their pros and cons.
## Native Google Assistant library
### Integrations
- [`assistant.google`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.google.html) plugin (to
2021-01-31 13:07:59 +01:00
programmatically start/stop conversations)
and [`assistant.google`](https://docs.platypush.tech/en/latest/platypush/backend/assistant.google.html) backend
2021-01-31 13:07:59 +01:00
(for continuous hotword detection).
### Configuration
- Create a Google project and download the `credentials.json` file from
the [Google developers console](https://console.cloud.google.com/apis/credentials).
- Install the `google-oauthlib-tool`:
```shell
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
```
- Authenticate to use the `assistant-sdk-prototype` scope:
```shell
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
--scope https://www.googleapis.com/auth/gcm \
--save --headless --client-secrets $CREDENTIALS_FILE
```
- Install Platypush with the HTTP backend and Google Assistant library support:
```shell
[sudo] pip install 'platypush[http,google-assistant-legacy]'
```
- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:
```yaml
backend.http:
enabled: True
backend.assistant.google:
enabled: True
assistant.google:
enabled: True
```
- Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on `http://your-rpi:8008` you should be
able to see your voice interactions in real-time.
### Features
- *Hotword detection*: **YES** (“Ok Google” or “Hey Google).
- *Speech detection*: **YES** (once the hotword is detected).
- *Detection runs locally*: **NO** (hotword detection [seems to] run locally, but once it's detected a channel is open
with Google servers for the interaction).
### Pros
- It implements most of the features that youd find in any Google Assistant products. That includes native support for
timers, calendars, customized responses on the basis of your profile and location, native integration with the devices
configured in your Google Home, and so on. For more complex features, youll have to write your custom platypush hooks
on e.g. speech detected or conversation start/end events.
- Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.
- Good performance even on older RaspberryPi models (the library isnt available for the Zero model or other arm6-based
devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes
around 23% of the CPU on a RaspberryPi 4.
### Cons
- The Google Assistant library used as a backend by the integration has
been [deprecated by Google](https://developers.google.com/assistant/sdk/reference/library/python). It still works on
most of the devices Ive tried, as long as the latest version is used, but keep in mind that its no longer maintained
by Google and it could break in the future. Unfortunately, Im still waiting for an official alternative.
- If your main goal is to operate voice-enabled services within a secure environment with no processing happening on
someone elses cloud, then this is not your best option. The assistant library makes your computer behave more or less
like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and,
potentially, review.
## Google Assistant Push-To-Talk Integration
### Integrations
- [`assistant.google.pushtotalk`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.google.pushtotalk.html)
2021-01-31 13:07:59 +01:00
plugin.
### Configuration
- Create a Google project and download the `credentials.json` file from
the [Google developers console](https://console.cloud.google.com/apis/credentials).
- Install the `google-oauthlib-tool`:
```shell
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
```
- Authenticate to use the `assistant-sdk-prototype` scope:
```shell
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
--scope https://www.googleapis.com/auth/gcm \
--save --headless --client-secrets $CREDENTIALS_FILE
```
- Install Platypush with the HTTP backend and Google Assistant SDK support:
```shell
[sudo] pip install 'platypush[http,google-assistant]'
```
- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:
```yaml
backend.http:
enabled: True
assistant.google.pushtotalk:
language: en-US
```
- Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesnt come with a hotword
detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks,
procedures, or through the HTTP API:
```shell
curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
2021-01-31 13:07:59 +01:00
{
"type":"request",
"action":"assistant.google.pushtotalk.start_conversation"
}' http://your-rpi:8008/execute
2021-01-31 13:07:59 +01:00
```
### Features
- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant).
- *Speech detection*: **YES**.
- *Detection runs locally*: **NO** (you can customize the hotword engine and how to trigger the assistant, but once a
conversation is started a channel is opened with Google servers).
### Pros
- It implements many of the features youd find in any Google Assistant product out there, even though hotword detection
isnt available and some of the features currently available on the assistant library arent provided (like timers or
alarms).
- Rock-solid speech detection, using the same speech model used by Google Assistant products.
- Relatively good performance even on older RaspberryPi models. Its also available for arm6 architecture, which makes
it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses
resources only when you call `start_conversation`.
- It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between
your mic and Googles servers. The connection is only opened upon `start_conversation`. This makes it a good option if
privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword
engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or
assistants that arent triggered by a hotword at all — for example, you can call start_conversation upon button press,
motion sensor event or web call.
### Cons
- Ive built this integration after the deprecation of the Google Assistant library occurred with no official
alternatives being provided. Ive built it by refactoring the poorly refined code provided by Google in its samples (
[`pushtotalk.py`](https://github.com/googlesamples/assistant-sdk-python/blob/master/google-assistant-sdk/googlesamples/assistant/grpc/pushtotalk.py))
and making a proper plugin out of it. It works, but keep in mind that its based on some ugly code thats waiting to
be replaced by Google.
- No hotword support. Youll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
## Alexa Integration
### Integrations
- [`assistant.echo`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.echo.html) plugin.
2021-01-31 13:07:59 +01:00
### Configuration
- Install Platypush with the HTTP backend and Alexa support:
```shell
[sudo] pip install 'platypush[http,alexa]'
```
- Run `alexa-auth`. It will start a local web server on your machine on `http://your-rpi:3000`. Open it in your browser
and authenticate with your Amazon account. A credentials file should be generated under `~/.avs.json`.
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
integration:
```yaml
backend.http:
enabled: True
assistant.echo:
enabled: True
```
- Start Platypush. The Alexa integration doesnt come with a hotword detection engine. You can initiate or end
conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:
```shell
curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
2021-01-31 13:07:59 +01:00
{
"type":"request",
"action":"assistant.echo.start_conversation"
}' http://your-rpi:8008/execute
2021-01-31 13:07:59 +01:00
```
### Features
- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
hotword integration like Snowboy or PicoVoice to trigger or stop the assistant).
- *Speech detection*: **YES** (although limited: transcription of the processed audio wont be provided).
- *Detection runs locally*: **NO**.
### Pros
- It implements many of the features that youd find in any Alexa product out there, even though hotword detection isnt
available. Also, the support for skills or media control may be limited.
- Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.
- Good performance even on low-power devices. No hotword engine running means it uses resources only when you call
start_conversation.
- It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and
Amazons servers. The connection is only opened upon start_conversation.
### Cons
- The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa
Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google
assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio
responses. It means that text transcription, either for the request or the response, wont be available. That limits
what you can build with it. For example, you wont be able to capture custom requests through event hooks.
- No hotword support. Youll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
## Snowboy Integration
### Integrations
- [`assistant.snowboy`](https://docs.platypush.tech/en/latest/platypush/backend/assistant.snowboy.html) backend.
2021-01-31 13:07:59 +01:00
### Configuration
- Install Platypush with the HTTP backend and Snowboy support:
```shell
[sudo] pip install 'platypush[http,snowboy]'
```
- Choose your hotword model(s). Some are available under `SNOWBOY_INSTALL_DIR/resources/models`. Otherwise, you can
train or download models from the [Snowboy website](https://snowboy.kitt.ai/).
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
integration:
```yaml
backend.http:
enabled: True
backend.assistant.snowboy:
audio_gain: 1.2
models:
# Trigger the Google assistant in Italian when I say "computer"
computer:
voice_model_file: ~/models/computer.umdl
assistant_plugin: assistant.google.pushtotalk
assistant_language: it-IT
detect_sound: ~/sounds/bell.wav
sensitivity: 0.4
# Trigger the Google assistant in English when I say "OK Google"
ok_google:
voice_model_file: ~/models/OK Google.pmdl
assistant_plugin: assistant.google.pushtotalk
assistant_language: en-US
detect_sound: ~/sounds/bell.wav
sensitivity: 0.4
# Trigger Alexa when I say "Alexa"
alexa:
voice_model_file: ~/models/Alexa.pmdl
assistant_plugin: assistant.echo
assistant_language: en-US
detect_sound: ~/sounds/bell.wav
sensitivity: 0.5
```
- Start Platypush. Say the hotword associated with one of your models, check on the logs that the
[`HotwordDetectedEvent`](https://docs.platypush.tech/en/latest/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
2021-01-31 13:07:59 +01:00
is triggered and, if theres an assistant plugin associated with the hotword, the corresponding assistant is correctly
started.
### Features
- *Hotword detection*: **YES**.
- *Speech detection*: **NO**.
- *Detection runs locally*: **YES**.
### Pros
- I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning.
You can download any hotword models for free from their website, provided that you record three audio samples of you
saying that word in order to help improve the model. You can also create your custom hotword model, and if enough
people are interested in using it then theyll contribute with their samples, and the model will become more robust
over time. I believe that more machine learning projects out there could really benefit from this “use it for free as
long as you help improve the model” paradigm.
- Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can
natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make
a multi-language and multi-hotword voice assistant.
- Good performance, even on low-power devices. Ive used Snowboy in combination with the Google Assistant push-to-talk
integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never
exceeded 2025%.
- The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection
to run and no data exchanged with any cloud.
### Cons
- Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up,
the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as
expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform
quite poorly with speech recorded from any individuals that dont fit within that category (and also with people who
arent native English speakers).
## Mozilla DeepSpeech
### Integrations
- [`stt.deepspeech`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.deepspeech.html) plugin
and [`stt.deepspeech`](https://docs.platypush.tech/en/latest/platypush/backend/stt.deepspeech.html) backend (for
2021-01-31 13:07:59 +01:00
continuous detection).
### Configuration
- Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that
gets installed:
```shell
[sudo] pip install 'platypush[http,deepspeech]'
```
- Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while
depending on your connection:
```shell
export MODELS_DIR=~/models
export DEEPSPEECH_VERSION=0.6.1
wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite
mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR
```
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
integration:
```yaml
backend.http:
enabled: True
stt.deepspeech:
model_file: ~/models/output_graph.pbmm
lm_file: ~/models/lm.binary
trie_file: ~/models/trie
# Custom list of hotwords
hotwords:
- computer
- alexa
- hello
conversation_timeout: 5
backend.stt.deepspeech:
enabled: True
```
- Start Platypush. Speech detection will start running on startup.
[`SpeechDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.SpeechDetectedEvent)
2021-01-31 13:07:59 +01:00
will be triggered when you talk.
[`HotwordDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.HotwordDetectedEvent)
2021-01-31 13:07:59 +01:00
will be triggered when you say one of the configured hotwords.
[`ConversationDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.ConversationDetectedEvent)
2021-01-31 13:07:59 +01:00
will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the
continuous detection and only start it programmatically by calling `stt.deepspeech.start_detection` and
`stt.deepspeech.stop_detection`. You can also use it to perform offline speech transcription from audio files:
```shell
curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
2021-01-31 13:07:59 +01:00
{
"type":"request",
"action":"stt.deepspeech.detect",
"args": {
"audio_file": "~/audio.wav"
}
}' http://your-rpi:8008/execute
2021-01-31 13:07:59 +01:00
# Example response
{
"type":"response",
"target":"http",
"response": {
"errors":[],
"output": {
"speech": "This is a test"
}
}
}
```
### Features
- *Hotword detection*: **YES**.
- *Speech detection*: **YES**.
- *Detection runs locally*: **YES**.
### Pros
- Ive been honestly impressed by the features of DeepSpeech and the progress theyve made starting from the version
0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party
services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also
very good. Its amazing that theyve released the whole thing for free to the community. It also means that you can
easily extend the Tensorflow model by training it with your own samples.
- Speech-to-text transcription of audio files can be a very useful feature.
### Cons
- DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in
my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on
less powerful machines.
- DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as
small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In
reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.
- DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (thats something
where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.”
“This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine
to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text
transcription purposes but, in such ambiguous cases, it lacks some semantic context.
- Even though its possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that its not
how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only
intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is
probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the
speech detection part.
## PicoVoice
[PicoVoice](https://github.com/Picovoice/) is a very promising company that has released several products for performing
voice detection on-device. Among them:
- [*Porcupine*](https://github.com/Picovoice/porcupine), a hotword engine.
- [*Leopard*](https://github.com/Picovoice/leopard), a speech-to-text offline transcription engine.
- [*Cheetah*](https://github.com/Picovoice/cheetah), a speech-to-text engine for real-time applications.
- [*Rhino*](https://github.com/Picovoice/rhino), a speech-to-intent engine.
So far, Platypush provides integrations with Porcupine and Cheetah.
### Integrations
- *Hotword engine*:
[`stt.picovoice.hotword`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.picovoice.hotword.html)
2021-01-31 13:07:59 +01:00
plugin and
[`stt.picovoice.hotword`](https://docs.platypush.tech/en/latest/platypush/backend/stt.picovoice.hotword.html)
2021-01-31 13:07:59 +01:00
backend (for continuous detection).
- *Speech engine*:
[`stt.picovoice.speech`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.picovoice.speech.html)
2021-01-31 13:07:59 +01:00
plugin and
[`stt.picovoice.speech`](https://docs.platypush.tech/en/latest/platypush/backend/stt.picovoice.speech.html)
2021-01-31 13:07:59 +01:00
backend (for continuous detection).
### Configuration
- Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:
```shell
[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'
```
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
integration:
```yaml
stt.picovoice.hotword:
# Custom list of hotwords
hotwords:
- computer
- alexa
- hello
# Enable continuous hotword detection
backend.stt.picovoice.hotword:
enabled: True
# Enable continuous speech detection
# backend.stt.picovoice.speech:
# enabled: True
# Or start speech detection when a hotword is detected
event.hook.OnHotwordDetected:
if:
type: platypush.message.event.stt.HotwordDetectedEvent
then:
# Start a timer that stops the detection in 10 seconds
- action: utils.set_timeout
args:
seconds: 10
name: StopSpeechDetection
actions:
- action: stt.picovoice.speech.stop_detection
- action: stt.picovoice.speech.start_detection
```
- Start Platypush and enjoy your on-device voice assistant.
### Features
- *Hotword detection*: **YES**.
- *Speech detection*: **YES**.
- *Detection runs locally*: **YES**.
### Pros
- When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword
engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much
less delay than DeepSpeech and its also much less power-hungry — it will still run well and with low latency even on
older models of RaspberryPi.
### Cons
- While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldnt dig much into
how theyve solved the problem.
- Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you
want to expand the set of keywords provided by default, or add more samples to train the existing models, then youll
have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of
charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any
chance to extend the model or use a different model, is only possible through a commercial license. While I understand
their point and their business model, Id have been super-happy to just pay for a license through a more friendly
process, instead of relying on the old-fashioned “contact us for a commercial license/well reach back to you”
paradigm.
- Cheetahs speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent
detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using
Rhino, PicoVoices speech-to-intent engine, which will provide a structured representation of the speech intent
instead of a letter-by-letter transcription. However, I havent yet worked on integrating Rhino into platypush.
## Conclusions
The democratization of voice technology has long been dreamed about, and its finally (slowly) coming. The situation out
there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice
at all. But at least some solutions are emerging to bring speech detection to all devices.
Ive built integrations in Platypush for all of these services because I believe that its up to users, not to
businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice
integrations in the same product — and especially having voice integrations that expose all the same API and generate
the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech
recognition from the business logic that can be run by voice commands.
Check out
[my previous article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) to
learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop
events.
To summarize my findings so far:
- Use the native **Google Assistant** integration if you want to have a full Google experience, and if youre ok with
Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant
library wont work anymore.
- Use the **Google push-to-talk** integration if you only want to have the assistant, without hotword detection, or you
want your assistant to be triggered by alternative hotwords.
- Use the **Alexa** integration if you already have an Amazon-powered ecosystem and youre ok with having less
flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.
- Use **Snowboy** if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs
on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not
be that accurate.
- Use **Mozilla DeepSpeech** if you want a fully on-device open-source engine powered by a robust Tensorflow model, even
if it takes more CPU load and a bit more latency.
- Use **PicoVoice** solutions if you want a full voice solution that runs on-device and its both accurate and
performant, even though youll need a commercial license for using it on some devices or extend/change the model.
Let me know your thoughts on these solutions and your experience with these integrations!