blog/static/pages/Build-custom-voice-assistants.md

[//]: # (title: Build custom voice assistants)
[//]: # (description: An overview of the current technologies and how to leverage Platypush to build your customized assistant.)
[//]: # (image: /img/voice-1.jpg)
[//]: # (author: Fabio Manganiello <fabio@platypush.tech>)
[//]: # (published: 2020-03-08)

I wrote [an article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) a while
ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and
a microphone.

It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok
Google”, or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to
hook your own custom logic and scripts when certain phrases are recognized, without writing any code.

Since I wrote that article, a few things have changed:

- When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve
  worked on [supporting Alexa as well](https://github.com/BlackLight/platypush/issues/80). Feel free to use the
  `assistant.echo` integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the
  existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it
  won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the
  transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as
  output. It could also experience some minor audio glitches, at least on RasbperryPi.

- Although deprecated, a new release of the Google Assistant
  Library [has been made available](https://github.com/googlesamples/assistant-sdk-python/releases/tag/0.6.0) to fix the
  segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s
  been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other
  SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy
  and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently
  on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But
  at least one of the best options out there to build a voice assistant will still work for a while. Those interested in
  building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.

- In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more
  state-of-art alternatives. I’ve been a long-time fan of [Snowboy](https://snowboy.kitt.ai/), which has a
  well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for
  a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory.
  I’ve also experimented with
  [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) and [PicoVoice](https://github.com/Picovoice) products,
  for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview
  of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.

- **EDIT January 2021**: Unfortunately, as of Dec 31st,
  2020 [Snowboy has been officially shut down](https://github.com/Kitt-AI/snowboy/). The GitHub repository is still
  there, you can still clone it and either use the example models provided under `resources/models`, train a model
  using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the
  website that could be used to browse and generate user models is no longer available. It's really a shame - the user
  models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained
  open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the
  time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work
  if you download and install the code from the repo.

## The Case for DIY Voice Assistants

Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:

- **Privacy**. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private
  company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the
  lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in
  platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice
  interactions over a privately-owned channel through a privately-owned box.

- **Compatibility**. A Google Assistant device will only work with devices that support Google Assistant. The same goes
  for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily,
  depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or
  other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to
  interact with, and does not depend on business decisions.

- **Flexibility**. Even when a device works with your assistant, you’re still bound to the features that have been
  agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky.
  In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or
  IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability
  to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex
  matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services (
  Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.

- **Hardware constraints**. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker
  in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of
  experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any
  device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should
  be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as
  long as that device has a way to communicate with the outside world. The logic to control that device should be able
  to run on the same network that the device belongs to.

- **Cloud vs. local processing**. Most of the commercial voice assistants operate by regularly capturing streams of
  audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another
  connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In
  some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice
  assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they
  exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets
  are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to
  process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you
  want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible
  of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but,
  regardless of the technology, we should always be provided with a choice between decentralized and centralized
  computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or
  on-cloud, depending on the use case and depending on the user’s preference.

- **Scalability**. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy
  of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to
  buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a
  RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I
  need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi,
  without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the
  possibility of becoming a smart device.

## Overview of the voice assistant integrations

A voice assistant usually consists of two components:

- An **audio recorder** that captures frames from an audio input device
- A **speech engine** that keeps track of the current context.

There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of
specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text
transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a
far higher overhead than just running hotword detection, which only has to compare the captured speech against the,
usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead
of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if
you say *“Can I have a small double-shot espresso with a lot of sugar and some milk”* they may return something like `{"
type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}`).

In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and
engines. Let’s go through some of the available integrations, and evaluate their pros and cons.

## Native Google Assistant library

### Integrations

- [`assistant.google`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.google.html) plugin (to
  programmatically start/stop conversations)
  and [`assistant.google`](https://docs.platypush.tech/en/latest/platypush/backend/assistant.google.html) backend
  (for continuous hotword detection).

### Configuration

- Create a Google project and download the `credentials.json` file from
  the [Google developers console](https://console.cloud.google.com/apis/credentials).

- Install the `google-oauthlib-tool`:

```shell
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
```

- Authenticate to use the `assistant-sdk-prototype` scope:

```shell
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json

google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
      --scope https://www.googleapis.com/auth/gcm \
      --save --headless --client-secrets $CREDENTIALS_FILE
```

- Install Platypush with the HTTP backend and Google Assistant library support:

```shell
[sudo] pip install 'platypush[http,google-assistant-legacy]'
```

- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:

```yaml
backend.http:
    enabled: True
    
backend.assistant.google:
    enabled: True
    
assistant.google:
    enabled: True
```

- Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on `http://your-rpi:8008` you should be
  able to see your voice interactions in real-time.

### Features

- *Hotword detection*: **YES** (“Ok Google” or “Hey Google).
- *Speech detection*: **YES** (once the hotword is detected).
- *Detection runs locally*: **NO** (hotword detection [seems to] run locally, but once it's detected a channel is open
  with Google servers for the interaction).

### Pros

- It implements most of the features that you’d find in any Google Assistant products. That includes native support for
  timers, calendars, customized responses on the basis of your profile and location, native integration with the devices
  configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks
  on e.g. speech detected or conversation start/end events.

- Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.

- Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based
  devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes
  around 2–3% of the CPU on a RaspberryPi 4.

### Cons

- The Google Assistant library used as a backend by the integration has
  been [deprecated by Google](https://developers.google.com/assistant/sdk/reference/library/python). It still works on
  most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained
  by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative.

- If your main goal is to operate voice-enabled services within a secure environment with no processing happening on
  someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less
  like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and,
  potentially, review.

## Google Assistant Push-To-Talk Integration

### Integrations

- [`assistant.google.pushtotalk`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.google.pushtotalk.html)
  plugin.

### Configuration

- Create a Google project and download the `credentials.json` file from
  the [Google developers console](https://console.cloud.google.com/apis/credentials).

- Install the `google-oauthlib-tool`:

```shell
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
```

- Authenticate to use the `assistant-sdk-prototype` scope:

```shell
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json

google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
      --scope https://www.googleapis.com/auth/gcm \
      --save --headless --client-secrets $CREDENTIALS_FILE
```

- Install Platypush with the HTTP backend and Google Assistant SDK support:

```shell
[sudo] pip install 'platypush[http,google-assistant]'
```

- Create or add the lines to `~/.config/platypush/config.yaml` to enable the webserver and the assistant integration:

```yaml
backend.http:
    enabled: True
    
assistant.google.pushtotalk:
    language: en-US
```

- Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword
  detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks,
  procedures, or through the HTTP API:

```shell
curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"assistant.google.pushtotalk.start_conversation"
}' http://your-rpi:8008/execute
```

### Features

- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
  hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant).

- *Speech detection*: **YES**.

- *Detection runs locally*: **NO** (you can customize the hotword engine and how to trigger the assistant, but once a
  conversation is started a channel is opened with Google servers).

### Pros

- It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection
  isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or
  alarms).

- Rock-solid speech detection, using the same speech model used by Google Assistant products.

- Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes
  it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses
  resources only when you call `start_conversation`.

- It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between
  your mic and Google’s servers. The connection is only opened upon `start_conversation`. This makes it a good option if
  privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword
  engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or
  assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press,
  motion sensor event or web call.

### Cons

- I’ve built this integration after the deprecation of the Google Assistant library occurred with no official
  alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples (
  [`pushtotalk.py`](https://github.com/googlesamples/assistant-sdk-python/blob/master/google-assistant-sdk/googlesamples/assistant/grpc/pushtotalk.py))
  and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to
  be replaced by Google.

- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.

## Alexa Integration

### Integrations

- [`assistant.echo`](https://docs.platypush.tech/en/latest/platypush/plugins/assistant.echo.html) plugin.

### Configuration

- Install Platypush with the HTTP backend and Alexa support:

```shell
[sudo] pip install 'platypush[http,alexa]'
```

- Run `alexa-auth`. It will start a local web server on your machine on `http://your-rpi:3000`. Open it in your browser
  and authenticate with your Amazon account. A credentials file should be generated under `~/.avs.json`.

- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
  integration:

```yaml
backend.http:
    enabled: True
    
assistant.echo:
    enabled: True
```

- Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end
  conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:

```shell
curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"assistant.echo.start_conversation"
}' http://your-rpi:8008/execute
```

### Features

- *Hotword detection*: **NO** (call `start_conversation` or `stop_conversation` from your logic or from the context of a
  hotword integration like Snowboy or PicoVoice to trigger or stop the assistant).

- *Speech detection*: **YES** (although limited: transcription of the processed audio won’t be provided).

- *Detection runs locally*: **NO**.

### Pros

- It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t
  available. Also, the support for skills or media control may be limited.

- Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.

- Good performance even on low-power devices. No hotword engine running means it uses resources only when you call
  start_conversation.

- It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and
  Amazon’s servers. The connection is only opened upon start_conversation.

### Cons

- The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa
  Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google
  assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio
  responses. It means that text transcription, either for the request or the response, won’t be available. That limits
  what you can build with it. For example, you won’t be able to capture custom requests through event hooks.

- No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.

## Snowboy Integration

### Integrations

- [`assistant.snowboy`](https://docs.platypush.tech/en/latest/platypush/backend/assistant.snowboy.html) backend.

### Configuration

- Install Platypush with the HTTP backend and Snowboy support:

```shell
[sudo] pip install 'platypush[http,snowboy]'
```

- Choose your hotword model(s). Some are available under `SNOWBOY_INSTALL_DIR/resources/models`. Otherwise, you can
  train or download models from the [Snowboy website](https://snowboy.kitt.ai/).

- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the assistant
  integration:

```yaml
backend.http:
    enabled: True
    
backend.assistant.snowboy:
    audio_gain: 1.2
    models:
        # Trigger the Google assistant in Italian when I say "computer"
        computer:
            voice_model_file: ~/models/computer.umdl
            assistant_plugin: assistant.google.pushtotalk
            assistant_language: it-IT
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.4

        # Trigger the Google assistant in English when I say "OK Google"
        ok_google:
            voice_model_file: ~/models/OK Google.pmdl
            assistant_plugin: assistant.google.pushtotalk
            assistant_language: en-US
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.4

        # Trigger Alexa when I say "Alexa"
        alexa:
            voice_model_file: ~/models/Alexa.pmdl
            assistant_plugin: assistant.echo
            assistant_language: en-US
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.5
```

- Start Platypush. Say the hotword associated with one of your models, check on the logs that the
  [`HotwordDetectedEvent`](https://docs.platypush.tech/en/latest/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
  is triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly
  started.

### Features

- *Hotword detection*: **YES**.
- *Speech detection*: **NO**.
- *Detection runs locally*: **YES**.

### Pros

- I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning.
  You can download any hotword models for free from their website, provided that you record three audio samples of you
  saying that word in order to help improve the model. You can also create your custom hotword model, and if enough
  people are interested in using it then they’ll contribute with their samples, and the model will become more robust
  over time. I believe that more machine learning projects out there could really benefit from this “use it for free as
  long as you help improve the model” paradigm.

- Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can
  natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make
  a multi-language and multi-hotword voice assistant.

- Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk
  integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never
  exceeded 20–25%.

- The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection
  to run and no data exchanged with any cloud.

### Cons

- Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up,
  the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as
  expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform
  quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who
  aren’t native English speakers).

## Mozilla DeepSpeech

### Integrations

- [`stt.deepspeech`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.deepspeech.html) plugin
  and [`stt.deepspeech`](https://docs.platypush.tech/en/latest/platypush/backend/stt.deepspeech.html) backend (for
  continuous detection).

### Configuration

- Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that
  gets installed:

```shell
[sudo] pip install 'platypush[http,deepspeech]'
```

- Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while
  depending on your connection:

```shell
export MODELS_DIR=~/models
export DEEPSPEECH_VERSION=0.6.1

wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz

tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite

mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR
```

- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
  integration:

```yaml
backend.http:
    enabled: True
    
stt.deepspeech:
    model_file: ~/models/output_graph.pbmm
    lm_file: ~/models/lm.binary
    trie_file: ~/models/trie

    # Custom list of hotwords
    hotwords:
        - computer
        - alexa
        - hello

    conversation_timeout: 5
          
backend.stt.deepspeech:
    enabled: True
```

- Start Platypush. Speech detection will start running on startup.
  [`SpeechDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.SpeechDetectedEvent)
  will be triggered when you talk.
  [`HotwordDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.HotwordDetectedEvent)
  will be triggered when you say one of the configured hotwords.
  [`ConversationDetectedEvents`](https://docs.platypush.tech/en/latest/platypush/events/stt.html#platypush.message.event.stt.ConversationDetectedEvent)
  will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the
  continuous detection and only start it programmatically by calling `stt.deepspeech.start_detection` and
  `stt.deepspeech.stop_detection`. You can also use it to perform offline speech transcription from audio files:

```shell
curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"stt.deepspeech.detect",
    "args": {
        "audio_file": "~/audio.wav"
    }
}' http://your-rpi:8008/execute

# Example response
{
    "type":"response",
    "target":"http",
    "response": {
        "errors":[],
        "output": {
            "speech": "This is a test"
        }
    }
}
```

### Features

- *Hotword detection*: **YES**.
- *Speech detection*: **YES**.
- *Detection runs locally*: **YES**.

### Pros

- I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version
  0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party
  services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also
  very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can
  easily extend the Tensorflow model by training it with your own samples.
  
- Speech-to-text transcription of audio files can be a very useful feature.

### Cons

- DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in
  my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on
  less powerful machines.

- DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as
  small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In
  reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.

- DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something
  where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.”
  “This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine
  to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text
  transcription purposes but, in such ambiguous cases, it lacks some semantic context.

- Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not
  how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only
  intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is
  probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the
  speech detection part.

## PicoVoice

[PicoVoice](https://github.com/Picovoice/) is a very promising company that has released several products for performing
voice detection on-device. Among them:

- [*Porcupine*](https://github.com/Picovoice/porcupine), a hotword engine.
- [*Leopard*](https://github.com/Picovoice/leopard), a speech-to-text offline transcription engine.
- [*Cheetah*](https://github.com/Picovoice/cheetah), a speech-to-text engine for real-time applications.
- [*Rhino*](https://github.com/Picovoice/rhino), a speech-to-intent engine.

So far, Platypush provides integrations with Porcupine and Cheetah.

### Integrations

- *Hotword engine*:
  [`stt.picovoice.hotword`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.picovoice.hotword.html)
  plugin and
  [`stt.picovoice.hotword`](https://docs.platypush.tech/en/latest/platypush/backend/stt.picovoice.hotword.html)
  backend (for continuous detection).

- *Speech engine*:
  [`stt.picovoice.speech`](https://docs.platypush.tech/en/latest/platypush/plugins/stt.picovoice.speech.html)
  plugin and
  [`stt.picovoice.speech`](https://docs.platypush.tech/en/latest/platypush/backend/stt.picovoice.speech.html)
  backend (for continuous detection).

### Configuration

- Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:

```shell
[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'
```

- Create or add the lines to your `~/.config/platypush/config.yaml` to enable the webserver and the DeepSpeech
  integration:

```yaml
stt.picovoice.hotword:
    # Custom list of hotwords
    hotwords:
        - computer
        - alexa
        - hello
        
# Enable continuous hotword detection
backend.stt.picovoice.hotword:
    enabled: True
  
# Enable continuous speech detection
# backend.stt.picovoice.speech:
#     enabled: True

# Or start speech detection when a hotword is detected
event.hook.OnHotwordDetected:
    if:
        type: platypush.message.event.stt.HotwordDetectedEvent
    then:
        # Start a timer that stops the detection in 10 seconds
        - action: utils.set_timeout
          args:
              seconds: 10
              name: StopSpeechDetection
              actions:
                  - action: stt.picovoice.speech.stop_detection

        - action: stt.picovoice.speech.start_detection
```

- Start Platypush and enjoy your on-device voice assistant.

### Features

- *Hotword detection*: **YES**.
- *Speech detection*: **YES**.
- *Detection runs locally*: **YES**.

### Pros

- When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword
  engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much
  less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on
  older models of RaspberryPi.

### Cons

- While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into
  how they’ve solved the problem.

- Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you
  want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll
  have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of
  charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any
  chance to extend the model or use a different model, is only possible through a commercial license. While I understand
  their point and their business model, I’d have been super-happy to just pay for a license through a more friendly
  process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you”
  paradigm.

- Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent
  detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using
  Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent
  instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush.

## Conclusions

The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out
there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice
at all. But at least some solutions are emerging to bring speech detection to all devices.

I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to
businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice
integrations in the same product — and especially having voice integrations that expose all the same API and generate
the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech
recognition from the business logic that can be run by voice commands.

Check out
[my previous article](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) to
learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop
events.

To summarize my findings so far:

- Use the native **Google Assistant** integration if you want to have a full Google experience, and if you’re ok with
  Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant
  library won’t work anymore.

- Use the **Google push-to-talk** integration if you only want to have the assistant, without hotword detection, or you
  want your assistant to be triggered by alternative hotwords.

- Use the **Alexa** integration if you already have an Amazon-powered ecosystem and you’re ok with having less
  flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.

- Use **Snowboy** if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs
  on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not
  be that accurate.

- Use **Mozilla DeepSpeech** if you want a fully on-device open-source engine powered by a robust Tensorflow model, even
  if it takes more CPU load and a bit more latency.

- Use **PicoVoice** solutions if you want a full voice solution that runs on-device and it’s both accurate and
  performant, even though you’ll need a commercial license for using it on some devices or extend/change the model.

Let me know your thoughts on these solutions and your experience with these integrations!