Migrated 9th article
This commit is contained in:
parent
b846b928c6
commit
2c7ce5e5c9
2 changed files with 742 additions and 0 deletions
BIN
static/img/voice-1.jpg
Normal file
BIN
static/img/voice-1.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 47 KiB |
742
static/pages/Build-custom-voice-assistants.md
Normal file
742
static/pages/Build-custom-voice-assistants.md
Normal file
|
@ -0,0 +1,742 @@
|
|||
[//]: # (title: Build custom voice assistants)
|
||||
[//]: # (description: An overview of the current technologies and how to leverage Platypush to build your customized assistant.)
|
||||
[//]: # (image: /img/voice-1.jpg)
|
||||
[//]: # (author: Fabio Manganiello <fabio@platypush.tech>)
|
||||
[//]: # (published: 2020-03-08)
|
||||
|
||||
I wrote [an article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) a while
|
||||
ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and
|
||||
a microphone.
|
||||
|
||||
It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok
|
||||
Google,” or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to
|
||||
hook your own custom logic and scripts when certain phrases are recognized, without writing any code.
|
||||
|
||||
Since I wrote that article, a few things have changed:
|
||||
|
||||
- When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve
|
||||
worked on [supporting Alexa as well](https://github.com/BlackLight/platypush/issues/80). Feel free to use the
|
||||
`assistant.echo` integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the
|
||||
existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it
|
||||
won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the
|
||||
transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as
|
||||
output. It could also experience some minor audio glitches, at least on RasbperryPi.
|
||||
|
||||
- Although deprecated, a new release of the Google Assistant
|
||||
Library [has been made available](https://github.com/googlesamples/assistant-sdk-python/releases/tag/0.6.0) to fix the
|
||||
segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s
|
||||
been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other
|
||||
SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy
|
||||
and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently
|
||||
on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But
|
||||
at least one of the best options out there to build a voice assistant will still work for a while. Those interested in
|
||||
building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
|
||||
|
||||
- In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more
|
||||
state-of-art alternatives. I’ve been a long-time fan of [Snowboy](https://snowboy.kitt.ai/), which has a
|
||||
well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for
|
||||
a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory.
|
||||
I’ve also experimented with
|
||||
[Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) and [PicoVoice](https://github.com/Picovoice) products,
|
||||
for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview
|
||||
of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.
|
||||
|
||||
- **EDIT January 2021**: Unfortunately, as of Dec 31st,
|
||||
2020 [Snowboy has been officially shut down](https://github.com/Kitt-AI/snowboy/). The GitHub repository is still
|
||||
there, you can still clone it and either use the example models provided under `resources/models`, train a model
|
||||
using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the
|
||||
website that could be used to browse and generate user models is no longer available. It's really a shame - the user
|
||||
models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained
|
||||
open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the
|
||||
time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work
|
||||
if you download and install the code from the repo.
|
||||
|
||||
## The Case for DIY Voice Assistants
|
||||
|
||||
Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:
|
||||
|
||||
- **Privacy**. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private
|
||||
company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the
|
||||
lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in
|
||||
platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice
|
||||
interactions over a privately-owned channel through a privately-owned box.
|
||||
|
||||
- **Compatibility**. A Google Assistant device will only work with devices that support Google Assistant. The same goes
|
||||
for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily,
|
||||
depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or
|
||||
other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to
|
||||
interact with, and does not depend on business decisions.
|
||||
|
||||
- **Flexibility**. Even when a device works with your assistant, you’re still bound to the features that have been
|
||||
agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky.
|
||||
In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or
|
||||
IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability
|
||||
to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex
|
||||
matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services (
|
||||
Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
|
||||
|
||||
- **Hardware constraints**. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker
|
||||
in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of
|
||||
experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any
|
||||
device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should
|
||||
be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as
|
||||
long as that device has a way to communicate with the outside world. The logic to control that device should be able
|
||||
to run on the same network that the device belongs to.
|
||||
|
||||
- **Cloud vs. local processing**. Most of the commercial voice assistants operate by regularly capturing streams of
|
||||
audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another
|
||||
connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In
|
||||
some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice
|
||||
assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they
|
||||
exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets
|
||||
are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to
|
||||
process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you
|
||||
want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible
|
||||
of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but,
|
||||
regardless of the technology, we should always be provided with a choice between decentralized and centralized
|
||||
computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or
|
||||
on-cloud, depending on the use case and depending on the user’s preference.
|
||||
|
||||
- **Scalability**. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy
|
||||
of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to
|
||||
buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a
|
||||
RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I
|
||||
need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi,
|
||||
without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the
|
||||
possibility of becoming a smart device.
|
||||
|
||||
## Overview of the voice assistant integrations
|
||||
|
||||
A voice assistant usually consists of two components:
|
||||
|
||||
- An **audio recorder** that captures frames from an audio input device
|
||||
- A **speech engine** that keeps track of the current context.
|
||||
|
||||
There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of
|
||||
specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text
|
||||
transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a
|
||||
far higher overhead than just running hotword detection, which only has to compare the captured speech against the,
|
||||
usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead
|
||||
of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if
|
||||
you say *“Can I have a small double-shot espresso with a lot of sugar and some milk”* they may return something like `{"
|
||||
type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}`).
|
||||
|
||||
In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and
|
||||
engines. Let’s go through some of the available integrations, and evaluate their pros and cons.
|
||||
|
||||
## Native Google Assistant library
|
||||
|
||||
### Integrations
|
||||
|
||||
- [`assistant.google`](https://platypush.readthedocs.io/en/latest/platypush/plugins/assistant.google.html) plugin (to
|
||||
programmatically start/stop conversations)
|
||||
and [`assistant.google`](https://platypush.readthedocs.io/en/latest/platypush/backend/assistant.google.html) backend
|
||||
(for continuous hotword detection).
|
||||
|
||||
### Configuration
|
||||
|
||||
- Create a Google project and download the `credentials.json` file from
|
||||
the [Google developers console](https://console.cloud.google.com/apis/credentials).
|
||||
|
||||
- Install the `google-oauthlib-tool`:
|
||||
|
||||
```shell
|
||||
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
|
||||
```
|
||||
|
||||
- Authenticate to use the `assistant-sdk-prototype` scope:
|
||||
|
||||
```shell
|
||||
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
|
||||
|
||||
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
|
||||
--scope https://www.googleapis.com/auth/gcm \
|
||||
--save --headless --client-secrets $CREDENTIALS_FILE
|
||||
```
|
||||
|
||||
- Install Platypush with the HTTP backend and Google Assistant library support:
|
||||
|
||||
```shell
|
||||
[sudo] pip install 'platypush[http,google-assistant-legacy]'
|
||||
```
|
||||
|
||||
- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:
|
||||
|
||||
```yaml
|
||||
backend.http:
|
||||
enabled: True
|
||||
|
||||
backend.assistant.google:
|
||||
enabled: True
|
||||
|
||||
assistant.google:
|
||||
enabled: True
|
||||
```
|
||||
|
||||
- Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on `http://your-rpi:8008` you should be
|
||||
able to see your voice interactions in real-time.
|
||||
|
||||
### Features
|
||||
|
||||
- *Hotword detection*: **YES** (“Ok Google” or “Hey Google).
|
||||
- *Speech detection*: **YES** (once the hotword is detected).
|
||||
- *Detection runs locally*: **NO** (hotword detection [seems to] run locally, but once it's detected a channel is open
|
||||
with Google servers for the interaction).
|
||||
|
||||
### Pros
|
||||
|
||||
- It implements most of the features that you’d find in any Google Assistant products. That includes native support for
|
||||
timers, calendars, customized responses on the basis of your profile and location, native integration with the devices
|
||||
configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks
|
||||
on e.g. speech detected or conversation start/end events.
|
||||
|
||||
- Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.
|
||||
|
||||
- Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based
|
||||
devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes
|
||||
around 2–3% of the CPU on a RaspberryPi 4.
|
||||
|
||||
### Cons
|
||||
|
||||
- The Google Assistant library used as a backend by the integration has
|
||||
been [deprecated by Google](https://developers.google.com/assistant/sdk/reference/library/python). It still works on
|
||||
most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained
|
||||
by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative.
|
||||
|
||||
- If your main goal is to operate voice-enabled services within a secure environment with no processing happening on
|
||||
someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less
|
||||
like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and,
|
||||
potentially, review.
|
||||
|
||||
## Google Assistant Push-To-Talk Integration
|
||||
|
||||
### Integrations
|
||||
|
||||
- [`assistant.google.pushtotalk`](https://platypush.readthedocs.io/en/latest/platypush/plugins/assistant.google.pushtotalk.html)
|
||||
plugin.
|
||||
|
||||
### Configuration
|
||||
|
||||
- Create a Google project and download the `credentials.json` file from
|
||||
the [Google developers console](https://console.cloud.google.com/apis/credentials).
|
||||
|
||||
- Install the `google-oauthlib-tool`:
|
||||
|
||||
```shell
|
||||
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
|
||||
```
|
||||
|
||||
- Authenticate to use the `assistant-sdk-prototype` scope:
|
||||
|
||||
```shell
|
||||
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
|
||||
|
||||
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
|
||||
--scope https://www.googleapis.com/auth/gcm \
|
||||
--save --headless --client-secrets $CREDENTIALS_FILE
|
||||
```
|
||||
|
||||
- Install Platypush with the HTTP backend and Google Assistant SDK support:
|
||||
|
||||
```shell
|
||||
[sudo] pip install 'platypush[http,google-assistant]'
|
||||
```
|
||||
|
||||
- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:
|
||||
|
||||
```yaml
|
||||
backend.http:
|
||||
enabled: True
|
||||
|
||||
assistant.google.pushtotalk:
|
||||
language: en-US
|
||||
```
|
||||
|
||||
- Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword
|
||||
detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks,
|
||||
procedures, or through the HTTP API:
|
||||
|
||||
```shell
|
||||
curl -XPOST -H 'Content-Type: application/json' -d '
|
||||
{
|
||||
"type":"request",
|
||||
"action":"assistant.google.pushtotalk.start_conversation"
|
||||
}' -a 'username:password' http://your-rpi:8008/execute
|
||||
```
|
||||
|
||||
### Features
|
||||
|
||||
- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
|
||||
hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant).
|
||||
|
||||
- *Speech detection*: **YES**.
|
||||
|
||||
- *Detection runs locally*: **NO** (you can customize the hotword engine and how to trigger the assistant, but once a
|
||||
conversation is started a channel is opened with Google servers).
|
||||
|
||||
### Pros
|
||||
|
||||
- It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection
|
||||
isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or
|
||||
alarms).
|
||||
|
||||
- Rock-solid speech detection, using the same speech model used by Google Assistant products.
|
||||
|
||||
- Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes
|
||||
it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses
|
||||
resources only when you call `start_conversation`.
|
||||
|
||||
- It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between
|
||||
your mic and Google’s servers. The connection is only opened upon `start_conversation`. This makes it a good option if
|
||||
privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword
|
||||
engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or
|
||||
assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press,
|
||||
motion sensor event or web call.
|
||||
|
||||
### Cons
|
||||
|
||||
- I’ve built this integration after the deprecation of the Google Assistant library occurred with no official
|
||||
alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples (
|
||||
[`pushtotalk.py`](https://github.com/googlesamples/assistant-sdk-python/blob/master/google-assistant-sdk/googlesamples/assistant/grpc/pushtotalk.py))
|
||||
and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to
|
||||
be replaced by Google.
|
||||
|
||||
- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
|
||||
|
||||
## Alexa Integration
|
||||
|
||||
### Integrations
|
||||
|
||||
- [`assistant.echo`](https://platypush.readthedocs.io/en/latest/platypush/plugins/assistant.echo.html) plugin.
|
||||
|
||||
### Configuration
|
||||
|
||||
- Install Platypush with the HTTP backend and Alexa support:
|
||||
|
||||
```shell
|
||||
[sudo] pip install 'platypush[http,alexa]'
|
||||
```
|
||||
|
||||
- Run `alexa-auth`. It will start a local web server on your machine on `http://your-rpi:3000`. Open it in your browser
|
||||
and authenticate with your Amazon account. A credentials file should be generated under `~/.avs.json`.
|
||||
|
||||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
|
||||
integration:
|
||||
|
||||
```yaml
|
||||
backend.http:
|
||||
enabled: True
|
||||
|
||||
assistant.echo:
|
||||
enabled: True
|
||||
```
|
||||
|
||||
- Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end
|
||||
conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:
|
||||
|
||||
```shell
|
||||
curl -XPOST -H 'Content-Type: application/json' -d '
|
||||
{
|
||||
"type":"request",
|
||||
"action":"assistant.echo.start_conversation"
|
||||
}' -a 'username:password' http://your-rpi:8008/execute
|
||||
```
|
||||
|
||||
### Features
|
||||
|
||||
- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
|
||||
hotword integration like Snowboy or PicoVoice to trigger or stop the assistant).
|
||||
|
||||
- *Speech detection*: **YES** (although limited: transcription of the processed audio won’t be provided).
|
||||
|
||||
- *Detection runs locally*: **NO**.
|
||||
|
||||
### Pros
|
||||
|
||||
- It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t
|
||||
available. Also, the support for skills or media control may be limited.
|
||||
|
||||
- Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.
|
||||
|
||||
- Good performance even on low-power devices. No hotword engine running means it uses resources only when you call
|
||||
start_conversation.
|
||||
|
||||
- It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and
|
||||
Amazon’s servers. The connection is only opened upon start_conversation.
|
||||
|
||||
### Cons
|
||||
|
||||
- The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa
|
||||
Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google
|
||||
assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio
|
||||
responses. It means that text transcription, either for the request or the response, won’t be available. That limits
|
||||
what you can build with it. For example, you won’t be able to capture custom requests through event hooks.
|
||||
|
||||
- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
|
||||
|
||||
## Snowboy Integration
|
||||
|
||||
### Integrations
|
||||
|
||||
- [`assistant.snowboy`](https://platypush.readthedocs.io/en/latest/platypush/backend/assistant.snowboy.html) backend.
|
||||
|
||||
### Configuration
|
||||
|
||||
- Install Platypush with the HTTP backend and Snowboy support:
|
||||
|
||||
```shell
|
||||
[sudo] pip install 'platypush[http,snowboy]'
|
||||
```
|
||||
|
||||
- Choose your hotword model(s). Some are available under `SNOWBOY_INSTALL_DIR/resources/models`. Otherwise, you can
|
||||
train or download models from the [Snowboy website](https://snowboy.kitt.ai/).
|
||||
|
||||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
|
||||
integration:
|
||||
|
||||
```yaml
|
||||
backend.http:
|
||||
enabled: True
|
||||
|
||||
backend.assistant.snowboy:
|
||||
audio_gain: 1.2
|
||||
models:
|
||||
# Trigger the Google assistant in Italian when I say "computer"
|
||||
computer:
|
||||
voice_model_file: ~/models/computer.umdl
|
||||
assistant_plugin: assistant.google.pushtotalk
|
||||
assistant_language: it-IT
|
||||
detect_sound: ~/sounds/bell.wav
|
||||
sensitivity: 0.4
|
||||
|
||||
# Trigger the Google assistant in English when I say "OK Google"
|
||||
ok_google:
|
||||
voice_model_file: ~/models/OK Google.pmdl
|
||||
assistant_plugin: assistant.google.pushtotalk
|
||||
assistant_language: en-US
|
||||
detect_sound: ~/sounds/bell.wav
|
||||
sensitivity: 0.4
|
||||
|
||||
# Trigger Alexa when I say "Alexa"
|
||||
alexa:
|
||||
voice_model_file: ~/models/Alexa.pmdl
|
||||
assistant_plugin: assistant.echo
|
||||
assistant_language: en-US
|
||||
detect_sound: ~/sounds/bell.wav
|
||||
sensitivity: 0.5
|
||||
```
|
||||
|
||||
- Start Platypush. Say the hotword associated with one of your models, check on the logs that the
|
||||
[`HotwordDetectedEvent`](https://platypush.readthedocs.io/en/latest/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
|
||||
is triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly
|
||||
started.
|
||||
|
||||
### Features
|
||||
|
||||
- *Hotword detection*: **YES**.
|
||||
- *Speech detection*: **NO**.
|
||||
- *Detection runs locally*: **YES**.
|
||||
|
||||
### Pros
|
||||
|
||||
- I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning.
|
||||
You can download any hotword models for free from their website, provided that you record three audio samples of you
|
||||
saying that word in order to help improve the model. You can also create your custom hotword model, and if enough
|
||||
people are interested in using it then they’ll contribute with their samples, and the model will become more robust
|
||||
over time. I believe that more machine learning projects out there could really benefit from this “use it for free as
|
||||
long as you help improve the model” paradigm.
|
||||
|
||||
- Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can
|
||||
natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make
|
||||
a multi-language and multi-hotword voice assistant.
|
||||
|
||||
- Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk
|
||||
integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never
|
||||
exceeded 20–25%.
|
||||
|
||||
- The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection
|
||||
to run and no data exchanged with any cloud.
|
||||
|
||||
### Cons
|
||||
|
||||
- Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up,
|
||||
the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as
|
||||
expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform
|
||||
quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who
|
||||
aren’t native English speakers).
|
||||
|
||||
## Mozilla DeepSpeech
|
||||
|
||||
### Integrations
|
||||
|
||||
- [`stt.deepspeech`](https://platypush.readthedocs.io/en/latest/platypush/plugins/stt.deepspeech.html) plugin
|
||||
and [`stt.deepspeech`](https://platypush.readthedocs.io/en/latest/platypush/backend/stt.deepspeech.html) backend (for
|
||||
continuous detection).
|
||||
|
||||
### Configuration
|
||||
|
||||
- Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that
|
||||
gets installed:
|
||||
|
||||
```shell
|
||||
[sudo] pip install 'platypush[http,deepspeech]'
|
||||
```
|
||||
|
||||
- Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while
|
||||
depending on your connection:
|
||||
|
||||
```shell
|
||||
export MODELS_DIR=~/models
|
||||
export DEEPSPEECH_VERSION=0.6.1
|
||||
|
||||
wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
|
||||
|
||||
tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
|
||||
x deepspeech-0.6.1-models/
|
||||
x deepspeech-0.6.1-models/lm.binary
|
||||
x deepspeech-0.6.1-models/output_graph.pbmm
|
||||
x deepspeech-0.6.1-models/output_graph.pb
|
||||
x deepspeech-0.6.1-models/trie
|
||||
x deepspeech-0.6.1-models/output_graph.tflite
|
||||
|
||||
mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR
|
||||
```
|
||||
|
||||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
|
||||
integration:
|
||||
|
||||
```yaml
|
||||
backend.http:
|
||||
enabled: True
|
||||
|
||||
stt.deepspeech:
|
||||
model_file: ~/models/output_graph.pbmm
|
||||
lm_file: ~/models/lm.binary
|
||||
trie_file: ~/models/trie
|
||||
|
||||
# Custom list of hotwords
|
||||
hotwords:
|
||||
- computer
|
||||
- alexa
|
||||
- hello
|
||||
|
||||
conversation_timeout: 5
|
||||
|
||||
backend.stt.deepspeech:
|
||||
enabled: True
|
||||
```
|
||||
|
||||
- Start Platypush. Speech detection will start running on startup.
|
||||
[`SpeechDetectedEvents`](https://platypush.readthedocs.io/en/latest/platypush/events/stt.html#platypush.message.event.stt.SpeechDetectedEvent)
|
||||
will be triggered when you talk.
|
||||
[`HotwordDetectedEvents`](https://platypush.readthedocs.io/en/latest/platypush/events/stt.html#platypush.message.event.stt.HotwordDetectedEvent)
|
||||
will be triggered when you say one of the configured hotwords.
|
||||
[`ConversationDetectedEvents`](https://platypush.readthedocs.io/en/latest/platypush/events/stt.html#platypush.message.event.stt.ConversationDetectedEvent)
|
||||
will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the
|
||||
continuous detection and only start it programmatically by calling `stt.deepspeech.start_detection` and
|
||||
`stt.deepspeech.stop_detection`. You can also use it to perform offline speech transcription from audio files:
|
||||
|
||||
```shell
|
||||
curl -XPOST -H 'Content-Type: application/json' -d '
|
||||
{
|
||||
"type":"request",
|
||||
"action":"stt.deepspeech.detect",
|
||||
"args": {
|
||||
"audio_file": "~/audio.wav"
|
||||
}
|
||||
}' -a 'username:password' http://your-rpi:8008/execute
|
||||
|
||||
# Example response
|
||||
{
|
||||
"type":"response",
|
||||
"target":"http",
|
||||
"response": {
|
||||
"errors":[],
|
||||
"output": {
|
||||
"speech": "This is a test"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Features
|
||||
|
||||
- *Hotword detection*: **YES**.
|
||||
- *Speech detection*: **YES**.
|
||||
- *Detection runs locally*: **YES**.
|
||||
|
||||
### Pros
|
||||
|
||||
- I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version
|
||||
0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party
|
||||
services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also
|
||||
very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can
|
||||
easily extend the Tensorflow model by training it with your own samples.
|
||||
|
||||
- Speech-to-text transcription of audio files can be a very useful feature.
|
||||
|
||||
### Cons
|
||||
|
||||
- DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in
|
||||
my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on
|
||||
less powerful machines.
|
||||
|
||||
- DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as
|
||||
small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In
|
||||
reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.
|
||||
|
||||
- DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something
|
||||
where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.”
|
||||
“This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine
|
||||
to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text
|
||||
transcription purposes but, in such ambiguous cases, it lacks some semantic context.
|
||||
|
||||
- Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not
|
||||
how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only
|
||||
intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is
|
||||
probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the
|
||||
speech detection part.
|
||||
|
||||
## PicoVoice
|
||||
|
||||
[PicoVoice](https://github.com/Picovoice/) is a very promising company that has released several products for performing
|
||||
voice detection on-device. Among them:
|
||||
|
||||
- [*Porcupine*](https://github.com/Picovoice/porcupine), a hotword engine.
|
||||
- [*Leopard*](https://github.com/Picovoice/leopard), a speech-to-text offline transcription engine.
|
||||
- [*Cheetah*](https://github.com/Picovoice/cheetah), a speech-to-text engine for real-time applications.
|
||||
- [*Rhino*](https://github.com/Picovoice/rhino), a speech-to-intent engine.
|
||||
|
||||
So far, Platypush provides integrations with Porcupine and Cheetah.
|
||||
|
||||
### Integrations
|
||||
|
||||
- *Hotword engine*:
|
||||
[`stt.picovoice.hotword`](https://platypush.readthedocs.io/en/latest/platypush/plugins/stt.picovoice.hotword.html)
|
||||
plugin and
|
||||
[`stt.picovoice.hotword`](https://platypush.readthedocs.io/en/latest/platypush/backend/stt.picovoice.hotword.html)
|
||||
backend (for continuous detection).
|
||||
|
||||
- *Speech engine*:
|
||||
[`stt.picovoice.speech`](https://platypush.readthedocs.io/en/latest/platypush/plugins/stt.picovoice.speech.html)
|
||||
plugin and
|
||||
[`stt.picovoice.speech`](https://platypush.readthedocs.io/en/latest/platypush/backend/stt.picovoice.speech.html)
|
||||
backend (for continuous detection).
|
||||
|
||||
### Configuration
|
||||
|
||||
- Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:
|
||||
|
||||
```shell
|
||||
[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'
|
||||
```
|
||||
|
||||
- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
|
||||
integration:
|
||||
|
||||
```yaml
|
||||
stt.picovoice.hotword:
|
||||
# Custom list of hotwords
|
||||
hotwords:
|
||||
- computer
|
||||
- alexa
|
||||
- hello
|
||||
|
||||
# Enable continuous hotword detection
|
||||
backend.stt.picovoice.hotword:
|
||||
enabled: True
|
||||
|
||||
# Enable continuous speech detection
|
||||
# backend.stt.picovoice.speech:
|
||||
# enabled: True
|
||||
|
||||
# Or start speech detection when a hotword is detected
|
||||
event.hook.OnHotwordDetected:
|
||||
if:
|
||||
type: platypush.message.event.stt.HotwordDetectedEvent
|
||||
then:
|
||||
# Start a timer that stops the detection in 10 seconds
|
||||
- action: utils.set_timeout
|
||||
args:
|
||||
seconds: 10
|
||||
name: StopSpeechDetection
|
||||
actions:
|
||||
- action: stt.picovoice.speech.stop_detection
|
||||
|
||||
- action: stt.picovoice.speech.start_detection
|
||||
```
|
||||
|
||||
- Start Platypush and enjoy your on-device voice assistant.
|
||||
|
||||
### Features
|
||||
|
||||
- *Hotword detection*: **YES**.
|
||||
- *Speech detection*: **YES**.
|
||||
- *Detection runs locally*: **YES**.
|
||||
|
||||
### Pros
|
||||
|
||||
- When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword
|
||||
engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much
|
||||
less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on
|
||||
older models of RaspberryPi.
|
||||
|
||||
### Cons
|
||||
|
||||
- While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into
|
||||
how they’ve solved the problem.
|
||||
|
||||
- Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you
|
||||
want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll
|
||||
have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of
|
||||
charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any
|
||||
chance to extend the model or use a different model, is only possible through a commercial license. While I understand
|
||||
their point and their business model, I’d have been super-happy to just pay for a license through a more friendly
|
||||
process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you”
|
||||
paradigm.
|
||||
|
||||
- Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent
|
||||
detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using
|
||||
Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent
|
||||
instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush.
|
||||
|
||||
## Conclusions
|
||||
|
||||
The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out
|
||||
there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice
|
||||
at all. But at least some solutions are emerging to bring speech detection to all devices.
|
||||
|
||||
I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to
|
||||
businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice
|
||||
integrations in the same product — and especially having voice integrations that expose all the same API and generate
|
||||
the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech
|
||||
recognition from the business logic that can be run by voice commands.
|
||||
|
||||
Check out
|
||||
[my previous article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) to
|
||||
learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop
|
||||
events.
|
||||
|
||||
To summarize my findings so far:
|
||||
|
||||
- Use the native **Google Assistant** integration if you want to have a full Google experience, and if you’re ok with
|
||||
Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant
|
||||
library won’t work anymore.
|
||||
|
||||
- Use the **Google push-to-talk** integration if you only want to have the assistant, without hotword detection, or you
|
||||
want your assistant to be triggered by alternative hotwords.
|
||||
|
||||
- Use the **Alexa** integration if you already have an Amazon-powered ecosystem and you’re ok with having less
|
||||
flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.
|
||||
|
||||
- Use **Snowboy** if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs
|
||||
on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not
|
||||
be that accurate.
|
||||
|
||||
- Use **Mozilla DeepSpeech** if you want a fully on-device open-source engine powered by a robust Tensorflow model, even
|
||||
if it takes more CPU load and a bit more latency.
|
||||
|
||||
- Use **PicoVoice** solutions if you want a full voice solution that runs on-device and it’s both accurate and
|
||||
performant, even though you’ll need a commercial license for using it on some devices or extend/change the model.
|
||||
|
||||
Let me know your thoughts on these solutions and your experience with these integrations!
|
Loading…
Add table
Reference in a new issue