diff --git a/static/img/voice-1.jpg b/static/img/voice-1.jpg new file mode 100644 index 0000000..0f10595 Binary files /dev/null and b/static/img/voice-1.jpg differ diff --git a/static/pages/Build-custom-voice-assistants.md b/static/pages/Build-custom-voice-assistants.md new file mode 100644 index 0000000..04c5486 --- /dev/null +++ b/static/pages/Build-custom-voice-assistants.md @@ -0,0 +1,742 @@ +[//]: # (title: Build custom voice assistants) +[//]: # (description: An overview of the current technologies and how to leverage Platypush to build your customized assistant.) +[//]: # (image: /img/voice-1.jpg) +[//]: # (author: Fabio Manganiello ) +[//]: # (published: 2020-03-08) + +I wrote [an article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) a while +ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and +a microphone. + +It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok +Google,” or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to +hook your own custom logic and scripts when certain phrases are recognized, without writing any code. + +Since I wrote that article, a few things have changed: + +- When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve + worked on [supporting Alexa as well](https://github.com/BlackLight/platypush/issues/80). Feel free to use the + `assistant.echo` integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the + existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it + won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the + transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as + output. It could also experience some minor audio glitches, at least on RasbperryPi. + +- Although deprecated, a new release of the Google Assistant + Library [has been made available](https://github.com/googlesamples/assistant-sdk-python/releases/tag/0.6.0) to fix the + segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s + been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other + SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy + and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently + on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But + at least one of the best options out there to build a voice assistant will still work for a while. Those interested in + building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article. + +- In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more + state-of-art alternatives. I’ve been a long-time fan of [Snowboy](https://snowboy.kitt.ai/), which has a + well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for + a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory. + I’ve also experimented with + [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) and [PicoVoice](https://github.com/Picovoice) products, + for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview + of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built. + +- **EDIT January 2021**: Unfortunately, as of Dec 31st, + 2020 [Snowboy has been officially shut down](https://github.com/Kitt-AI/snowboy/). The GitHub repository is still + there, you can still clone it and either use the example models provided under `resources/models`, train a model + using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the + website that could be used to browse and generate user models is no longer available. It's really a shame - the user + models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained + open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the + time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work + if you download and install the code from the repo. + +## The Case for DIY Voice Assistants + +Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons: + +- **Privacy**. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private + company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the + lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in + platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice + interactions over a privately-owned channel through a privately-owned box. + +- **Compatibility**. A Google Assistant device will only work with devices that support Google Assistant. The same goes + for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily, + depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or + other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to + interact with, and does not depend on business decisions. + +- **Flexibility**. Even when a device works with your assistant, you’re still bound to the features that have been + agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky. + In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or + IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability + to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex + matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services ( + Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords. + +- **Hardware constraints**. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker + in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of + experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any + device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should + be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as + long as that device has a way to communicate with the outside world. The logic to control that device should be able + to run on the same network that the device belongs to. + +- **Cloud vs. local processing**. Most of the commercial voice assistants operate by regularly capturing streams of + audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another + connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In + some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice + assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they + exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets + are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to + process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you + want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible + of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but, + regardless of the technology, we should always be provided with a choice between decentralized and centralized + computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or + on-cloud, depending on the use case and depending on the user’s preference. + +- **Scalability**. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy + of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to + buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a + RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I + need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi, + without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the + possibility of becoming a smart device. + +## Overview of the voice assistant integrations + +A voice assistant usually consists of two components: + +- An **audio recorder** that captures frames from an audio input device +- A **speech engine** that keeps track of the current context. + +There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of +specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text +transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a +far higher overhead than just running hotword detection, which only has to compare the captured speech against the, +usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead +of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if +you say *“Can I have a small double-shot espresso with a lot of sugar and some milk”* they may return something like `{" +type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}`). + +In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and +engines. Let’s go through some of the available integrations, and evaluate their pros and cons. + +## Native Google Assistant library + +### Integrations + +- [`assistant.google`](https://platypush.readthedocs.io/en/latest/platypush/plugins/assistant.google.html) plugin (to + programmatically start/stop conversations) + and [`assistant.google`](https://platypush.readthedocs.io/en/latest/platypush/backend/assistant.google.html) backend + (for continuous hotword detection). + +### Configuration + +- Create a Google project and download the `credentials.json` file from + the [Google developers console](https://console.cloud.google.com/apis/credentials). + +- Install the `google-oauthlib-tool`: + +```shell +[sudo] pip install --upgrade 'google-auth-oauthlib[tool]' +``` + +- Authenticate to use the `assistant-sdk-prototype` scope: + +```shell +export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json + +google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \ + --scope https://www.googleapis.com/auth/gcm \ + --save --headless --client-secrets $CREDENTIALS_FILE +``` + +- Install Platypush with the HTTP backend and Google Assistant library support: + +```shell +[sudo] pip install 'platypush[http,google-assistant-legacy]' +``` + +- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration: + +```yaml +backend.http: + enabled: True + +backend.assistant.google: + enabled: True + +assistant.google: + enabled: True +``` + +- Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on `http://your-rpi:8008` you should be + able to see your voice interactions in real-time. + +### Features + +- *Hotword detection*: **YES** (“Ok Google” or “Hey Google). +- *Speech detection*: **YES** (once the hotword is detected). +- *Detection runs locally*: **NO** (hotword detection [seems to] run locally, but once it's detected a channel is open + with Google servers for the interaction). + +### Pros + +- It implements most of the features that you’d find in any Google Assistant products. That includes native support for + timers, calendars, customized responses on the basis of your profile and location, native integration with the devices + configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks + on e.g. speech detected or conversation start/end events. + +- Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities. + +- Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based + devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes + around 2–3% of the CPU on a RaspberryPi 4. + +### Cons + +- The Google Assistant library used as a backend by the integration has + been [deprecated by Google](https://developers.google.com/assistant/sdk/reference/library/python). It still works on + most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained + by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative. + +- If your main goal is to operate voice-enabled services within a secure environment with no processing happening on + someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less + like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and, + potentially, review. + +## Google Assistant Push-To-Talk Integration + +### Integrations + +- [`assistant.google.pushtotalk`](https://platypush.readthedocs.io/en/latest/platypush/plugins/assistant.google.pushtotalk.html) + plugin. + +### Configuration + +- Create a Google project and download the `credentials.json` file from + the [Google developers console](https://console.cloud.google.com/apis/credentials). + +- Install the `google-oauthlib-tool`: + +```shell +[sudo] pip install --upgrade 'google-auth-oauthlib[tool]' +``` + +- Authenticate to use the `assistant-sdk-prototype` scope: + +```shell +export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json + +google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \ + --scope https://www.googleapis.com/auth/gcm \ + --save --headless --client-secrets $CREDENTIALS_FILE +``` + +- Install Platypush with the HTTP backend and Google Assistant SDK support: + +```shell +[sudo] pip install 'platypush[http,google-assistant]' +``` + +- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration: + +```yaml +backend.http: + enabled: True + +assistant.google.pushtotalk: + language: en-US +``` + +- Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword + detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, + procedures, or through the HTTP API: + +```shell +curl -XPOST -H 'Content-Type: application/json' -d ' +{ + "type":"request", + "action":"assistant.google.pushtotalk.start_conversation" +}' -a 'username:password' http://your-rpi:8008/execute +``` + +### Features + +- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a + hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant). + +- *Speech detection*: **YES**. + +- *Detection runs locally*: **NO** (you can customize the hotword engine and how to trigger the assistant, but once a + conversation is started a channel is opened with Google servers). + +### Pros + +- It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection + isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or + alarms). + +- Rock-solid speech detection, using the same speech model used by Google Assistant products. + +- Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes + it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses + resources only when you call `start_conversation`. + +- It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between + your mic and Google’s servers. The connection is only opened upon `start_conversation`. This makes it a good option if + privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword + engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or + assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press, + motion sensor event or web call. + +### Cons + +- I’ve built this integration after the deprecation of the Google Assistant library occurred with no official + alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples ( + [`pushtotalk.py`](https://github.com/googlesamples/assistant-sdk-python/blob/master/google-assistant-sdk/googlesamples/assistant/grpc/pushtotalk.py)) + and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to + be replaced by Google. + +- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support. + +## Alexa Integration + +### Integrations + +- [`assistant.echo`](https://platypush.readthedocs.io/en/latest/platypush/plugins/assistant.echo.html) plugin. + +### Configuration + +- Install Platypush with the HTTP backend and Alexa support: + +```shell +[sudo] pip install 'platypush[http,alexa]' +``` + +- Run `alexa-auth`. It will start a local web server on your machine on `http://your-rpi:3000`. Open it in your browser + and authenticate with your Amazon account. A credentials file should be generated under `~/.avs.json`. + +- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant + integration: + +```yaml +backend.http: + enabled: True + +assistant.echo: + enabled: True +``` + +- Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end + conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API: + +```shell +curl -XPOST -H 'Content-Type: application/json' -d ' +{ + "type":"request", + "action":"assistant.echo.start_conversation" +}' -a 'username:password' http://your-rpi:8008/execute +``` + +### Features + +- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a + hotword integration like Snowboy or PicoVoice to trigger or stop the assistant). + +- *Speech detection*: **YES** (although limited: transcription of the processed audio won’t be provided). + +- *Detection runs locally*: **NO**. + +### Pros + +- It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t + available. Also, the support for skills or media control may be limited. + +- Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy. + +- Good performance even on low-power devices. No hotword engine running means it uses resources only when you call + start_conversation. + +- It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and + Amazon’s servers. The connection is only opened upon start_conversation. + +### Cons + +- The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa + Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google + assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio + responses. It means that text transcription, either for the request or the response, won’t be available. That limits + what you can build with it. For example, you won’t be able to capture custom requests through event hooks. + +- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support. + +## Snowboy Integration + +### Integrations + +- [`assistant.snowboy`](https://platypush.readthedocs.io/en/latest/platypush/backend/assistant.snowboy.html) backend. + +### Configuration + +- Install Platypush with the HTTP backend and Snowboy support: + +```shell +[sudo] pip install 'platypush[http,snowboy]' +``` + +- Choose your hotword model(s). Some are available under `SNOWBOY_INSTALL_DIR/resources/models`. Otherwise, you can + train or download models from the [Snowboy website](https://snowboy.kitt.ai/). + +- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant + integration: + +```yaml +backend.http: + enabled: True + +backend.assistant.snowboy: + audio_gain: 1.2 + models: + # Trigger the Google assistant in Italian when I say "computer" + computer: + voice_model_file: ~/models/computer.umdl + assistant_plugin: assistant.google.pushtotalk + assistant_language: it-IT + detect_sound: ~/sounds/bell.wav + sensitivity: 0.4 + + # Trigger the Google assistant in English when I say "OK Google" + ok_google: + voice_model_file: ~/models/OK Google.pmdl + assistant_plugin: assistant.google.pushtotalk + assistant_language: en-US + detect_sound: ~/sounds/bell.wav + sensitivity: 0.4 + + # Trigger Alexa when I say "Alexa" + alexa: + voice_model_file: ~/models/Alexa.pmdl + assistant_plugin: assistant.echo + assistant_language: en-US + detect_sound: ~/sounds/bell.wav + sensitivity: 0.5 +``` + +- Start Platypush. Say the hotword associated with one of your models, check on the logs that the + [`HotwordDetectedEvent`](https://platypush.readthedocs.io/en/latest/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent) + is triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly + started. + +### Features + +- *Hotword detection*: **YES**. +- *Speech detection*: **NO**. +- *Detection runs locally*: **YES**. + +### Pros + +- I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning. + You can download any hotword models for free from their website, provided that you record three audio samples of you + saying that word in order to help improve the model. You can also create your custom hotword model, and if enough + people are interested in using it then they’ll contribute with their samples, and the model will become more robust + over time. I believe that more machine learning projects out there could really benefit from this “use it for free as + long as you help improve the model” paradigm. + +- Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can + natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make + a multi-language and multi-hotword voice assistant. + +- Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk + integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never + exceeded 20–25%. + +- The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection + to run and no data exchanged with any cloud. + +### Cons + +- Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up, + the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as + expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform + quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who + aren’t native English speakers). + +## Mozilla DeepSpeech + +### Integrations + +- [`stt.deepspeech`](https://platypush.readthedocs.io/en/latest/platypush/plugins/stt.deepspeech.html) plugin + and [`stt.deepspeech`](https://platypush.readthedocs.io/en/latest/platypush/backend/stt.deepspeech.html) backend (for + continuous detection). + +### Configuration + +- Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that + gets installed: + +```shell +[sudo] pip install 'platypush[http,deepspeech]' +``` + +- Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while + depending on your connection: + +```shell +export MODELS_DIR=~/models +export DEEPSPEECH_VERSION=0.6.1 + +wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz + +tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz +x deepspeech-0.6.1-models/ +x deepspeech-0.6.1-models/lm.binary +x deepspeech-0.6.1-models/output_graph.pbmm +x deepspeech-0.6.1-models/output_graph.pb +x deepspeech-0.6.1-models/trie +x deepspeech-0.6.1-models/output_graph.tflite + +mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR +``` + +- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech + integration: + +```yaml +backend.http: + enabled: True + +stt.deepspeech: + model_file: ~/models/output_graph.pbmm + lm_file: ~/models/lm.binary + trie_file: ~/models/trie + + # Custom list of hotwords + hotwords: + - computer + - alexa + - hello + + conversation_timeout: 5 + +backend.stt.deepspeech: + enabled: True +``` + +- Start Platypush. Speech detection will start running on startup. + [`SpeechDetectedEvents`](https://platypush.readthedocs.io/en/latest/platypush/events/stt.html#platypush.message.event.stt.SpeechDetectedEvent) + will be triggered when you talk. + [`HotwordDetectedEvents`](https://platypush.readthedocs.io/en/latest/platypush/events/stt.html#platypush.message.event.stt.HotwordDetectedEvent) + will be triggered when you say one of the configured hotwords. + [`ConversationDetectedEvents`](https://platypush.readthedocs.io/en/latest/platypush/events/stt.html#platypush.message.event.stt.ConversationDetectedEvent) + will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the + continuous detection and only start it programmatically by calling `stt.deepspeech.start_detection` and + `stt.deepspeech.stop_detection`. You can also use it to perform offline speech transcription from audio files: + +```shell +curl -XPOST -H 'Content-Type: application/json' -d ' +{ + "type":"request", + "action":"stt.deepspeech.detect", + "args": { + "audio_file": "~/audio.wav" + } +}' -a 'username:password' http://your-rpi:8008/execute + +# Example response +{ + "type":"response", + "target":"http", + "response": { + "errors":[], + "output": { + "speech": "This is a test" + } + } +} +``` + +### Features + +- *Hotword detection*: **YES**. +- *Speech detection*: **YES**. +- *Detection runs locally*: **YES**. + +### Pros + +- I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version + 0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party + services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also + very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can + easily extend the Tensorflow model by training it with your own samples. + +- Speech-to-text transcription of audio files can be a very useful feature. + +### Cons + +- DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in + my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on + less powerful machines. + +- DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as + small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In + reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection. + +- DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something + where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.” + “This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine + to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text + transcription purposes but, in such ambiguous cases, it lacks some semantic context. + +- Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not + how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only + intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is + probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the + speech detection part. + +## PicoVoice + +[PicoVoice](https://github.com/Picovoice/) is a very promising company that has released several products for performing +voice detection on-device. Among them: + +- [*Porcupine*](https://github.com/Picovoice/porcupine), a hotword engine. +- [*Leopard*](https://github.com/Picovoice/leopard), a speech-to-text offline transcription engine. +- [*Cheetah*](https://github.com/Picovoice/cheetah), a speech-to-text engine for real-time applications. +- [*Rhino*](https://github.com/Picovoice/rhino), a speech-to-intent engine. + +So far, Platypush provides integrations with Porcupine and Cheetah. + +### Integrations + +- *Hotword engine*: + [`stt.picovoice.hotword`](https://platypush.readthedocs.io/en/latest/platypush/plugins/stt.picovoice.hotword.html) + plugin and + [`stt.picovoice.hotword`](https://platypush.readthedocs.io/en/latest/platypush/backend/stt.picovoice.hotword.html) + backend (for continuous detection). + +- *Speech engine*: + [`stt.picovoice.speech`](https://platypush.readthedocs.io/en/latest/platypush/plugins/stt.picovoice.speech.html) + plugin and + [`stt.picovoice.speech`](https://platypush.readthedocs.io/en/latest/platypush/backend/stt.picovoice.speech.html) + backend (for continuous detection). + +### Configuration + +- Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration: + +```shell +[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]' +``` + +- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech + integration: + +```yaml +stt.picovoice.hotword: + # Custom list of hotwords + hotwords: + - computer + - alexa + - hello + +# Enable continuous hotword detection +backend.stt.picovoice.hotword: + enabled: True + +# Enable continuous speech detection +# backend.stt.picovoice.speech: +# enabled: True + +# Or start speech detection when a hotword is detected +event.hook.OnHotwordDetected: + if: + type: platypush.message.event.stt.HotwordDetectedEvent + then: + # Start a timer that stops the detection in 10 seconds + - action: utils.set_timeout + args: + seconds: 10 + name: StopSpeechDetection + actions: + - action: stt.picovoice.speech.stop_detection + + - action: stt.picovoice.speech.start_detection +``` + +- Start Platypush and enjoy your on-device voice assistant. + +### Features + +- *Hotword detection*: **YES**. +- *Speech detection*: **YES**. +- *Detection runs locally*: **YES**. + +### Pros + +- When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword + engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much + less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on + older models of RaspberryPi. + +### Cons + +- While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into + how they’ve solved the problem. + +- Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you + want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll + have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of + charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any + chance to extend the model or use a different model, is only possible through a commercial license. While I understand + their point and their business model, I’d have been super-happy to just pay for a license through a more friendly + process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you” + paradigm. + +- Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent + detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using + Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent + instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush. + +## Conclusions + +The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out +there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice +at all. But at least some solutions are emerging to bring speech detection to all devices. + +I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to +businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice +integrations in the same product — and especially having voice integrations that expose all the same API and generate +the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech +recognition from the business logic that can be run by voice commands. + +Check out +[my previous article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) to +learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop +events. + +To summarize my findings so far: + +- Use the native **Google Assistant** integration if you want to have a full Google experience, and if you’re ok with + Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant + library won’t work anymore. + +- Use the **Google push-to-talk** integration if you only want to have the assistant, without hotword detection, or you + want your assistant to be triggered by alternative hotwords. + +- Use the **Alexa** integration if you already have an Amazon-powered ecosystem and you’re ok with having less + flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS. + +- Use **Snowboy** if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs + on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not + be that accurate. + +- Use **Mozilla DeepSpeech** if you want a fully on-device open-source engine powered by a robust Tensorflow model, even + if it takes more CPU load and a bit more latency. + +- Use **PicoVoice** solutions if you want a full voice solution that runs on-device and it’s both accurate and + performant, even though you’ll need a commercial license for using it on some devices or extend/change the model. + +Let me know your thoughts on these solutions and your experience with these integrations!