blog/markdown/Build-custom-voice-assistants.md at fediverse-article

Fabio Manganiello f34f1f6232 Refactored Platypush blog repo.

Removed all the Python logic + templates and styles.
Those have now been moved to a stand-alone project (madblog),
therefore this repo should only contain the static blog
pages and images.

2022-01-12 00:56:39 +01:00

37 KiB

Raw Permalink Blame History

I wrote an article a while ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and a microphone.

It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok Google”, or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to hook your own custom logic and scripts when certain phrases are recognized, without writing any code.

Since I wrote that article, a few things have changed:

When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve worked on supporting Alexa as well. Feel free to use the assistant.echo integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as output. It could also experience some minor audio glitches, at least on RasbperryPi.
Although deprecated, a new release of the Google Assistant Library has been made available to fix the segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But at least one of the best options out there to build a voice assistant will still work for a while. Those interested in building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more state-of-art alternatives. I’ve been a long-time fan of Snowboy, which has a well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory. I’ve also experimented with Mozilla DeepSpeech and PicoVoice products, for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.
EDIT January 2021: Unfortunately, as of Dec 31st, 2020 Snowboy has been officially shut down. The GitHub repository is still there, you can still clone it and either use the example models provided under resources/models, train a model using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the website that could be used to browse and generate user models is no longer available. It's really a shame - the user models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work if you download and install the code from the repo.

The Case for DIY Voice Assistants

Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:

Privacy. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice interactions over a privately-owned channel through a privately-owned box.
Compatibility. A Google Assistant device will only work with devices that support Google Assistant. The same goes for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily, depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to interact with, and does not depend on business decisions.
Flexibility. Even when a device works with your assistant, you’re still bound to the features that have been agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky. In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services ( Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
Hardware constraints. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as long as that device has a way to communicate with the outside world. The logic to control that device should be able to run on the same network that the device belongs to.
Cloud vs. local processing. Most of the commercial voice assistants operate by regularly capturing streams of audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but, regardless of the technology, we should always be provided with a choice between decentralized and centralized computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or on-cloud, depending on the use case and depending on the user’s preference.
Scalability. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi, without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the possibility of becoming a smart device.

Overview of the voice assistant integrations

A voice assistant usually consists of two components:

An audio recorder that captures frames from an audio input device
A speech engine that keeps track of the current context.

There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a far higher overhead than just running hotword detection, which only has to compare the captured speech against the, usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if you say “Can I have a small double-shot espresso with a lot of sugar and some milk” they may return something like {" type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}).

In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and engines. Let’s go through some of the available integrations, and evaluate their pros and cons.

Native Google Assistant library

Integrations

assistant.google plugin (to programmatically start/stop conversations) and assistant.google backend (for continuous hotword detection).

Configuration

Create a Google project and download the credentials.json file from the Google developers console.
Install the google-oauthlib-tool:

[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'

Authenticate to use the assistant-sdk-prototype scope:

export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json

google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
      --scope https://www.googleapis.com/auth/gcm \
      --save --headless --client-secrets $CREDENTIALS_FILE

Install Platypush with the HTTP backend and Google Assistant library support:

[sudo] pip install 'platypush[http,google-assistant-legacy]'

Create or add the lines to ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True
    
backend.assistant.google:
    enabled: True
    
assistant.google:
    enabled: True

Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on http://your-rpi:8008 you should be able to see your voice interactions in real-time.

Features

Hotword detection: YES (“Ok Google” or “Hey Google).
Speech detection: YES (once the hotword is detected).
Detection runs locally: NO (hotword detection [seems to] run locally, but once it's detected a channel is open with Google servers for the interaction).

Pros

It implements most of the features that you’d find in any Google Assistant products. That includes native support for timers, calendars, customized responses on the basis of your profile and location, native integration with the devices configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks on e.g. speech detected or conversation start/end events.
Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.
Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes around 2–3% of the CPU on a RaspberryPi 4.

Cons

The Google Assistant library used as a backend by the integration has been deprecated by Google. It still works on most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative.
If your main goal is to operate voice-enabled services within a secure environment with no processing happening on someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and, potentially, review.

Google Assistant Push-To-Talk Integration

Integrations

assistant.google.pushtotalk plugin.

Configuration

Create a Google project and download the credentials.json file from the Google developers console.
Install the google-oauthlib-tool:

[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'

Authenticate to use the assistant-sdk-prototype scope:

export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json

google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
      --scope https://www.googleapis.com/auth/gcm \
      --save --headless --client-secrets $CREDENTIALS_FILE

Install Platypush with the HTTP backend and Google Assistant SDK support:

[sudo] pip install 'platypush[http,google-assistant]'

Create or add the lines to ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True
    
assistant.google.pushtotalk:
    language: en-US

Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:

curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"assistant.google.pushtotalk.start_conversation"
}' http://your-rpi:8008/execute

Features

Hotword detection: NO (call start_conversation or stop_conversation from your logic or from the context of a hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant).
Speech detection: YES.
Detection runs locally: NO (you can customize the hotword engine and how to trigger the assistant, but once a conversation is started a channel is opened with Google servers).

Pros

It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or alarms).
Rock-solid speech detection, using the same speech model used by Google Assistant products.
Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses resources only when you call start_conversation.
It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between your mic and Google’s servers. The connection is only opened upon start_conversation. This makes it a good option if privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press, motion sensor event or web call.

Cons

I’ve built this integration after the deprecation of the Google Assistant library occurred with no official alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples ( pushtotalk.py) and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to be replaced by Google.
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.

Alexa Integration

Integrations

assistant.echo plugin.

Configuration

Install Platypush with the HTTP backend and Alexa support:

[sudo] pip install 'platypush[http,alexa]'

Run alexa-auth. It will start a local web server on your machine on http://your-rpi:3000. Open it in your browser and authenticate with your Amazon account. A credentials file should be generated under ~/.avs.json.
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True
    
assistant.echo:
    enabled: True

Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:

curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"assistant.echo.start_conversation"
}' http://your-rpi:8008/execute

Features

Hotword detection: NO (call start_conversation or stop_conversation from your logic or from the context of a hotword integration like Snowboy or PicoVoice to trigger or stop the assistant).
Speech detection: YES (although limited: transcription of the processed audio won’t be provided).
Detection runs locally: NO.

Pros

It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t available. Also, the support for skills or media control may be limited.
Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.
Good performance even on low-power devices. No hotword engine running means it uses resources only when you call start_conversation.
It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and Amazon’s servers. The connection is only opened upon start_conversation.

Cons

The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio responses. It means that text transcription, either for the request or the response, won’t be available. That limits what you can build with it. For example, you won’t be able to capture custom requests through event hooks.
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.

Snowboy Integration

Integrations

assistant.snowboy backend.

Configuration

Install Platypush with the HTTP backend and Snowboy support:

[sudo] pip install 'platypush[http,snowboy]'

Choose your hotword model(s). Some are available under SNOWBOY_INSTALL_DIR/resources/models. Otherwise, you can train or download models from the Snowboy website.
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True
    
backend.assistant.snowboy:
    audio_gain: 1.2
    models:
        # Trigger the Google assistant in Italian when I say "computer"
        computer:
            voice_model_file: ~/models/computer.umdl
            assistant_plugin: assistant.google.pushtotalk
            assistant_language: it-IT
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.4

        # Trigger the Google assistant in English when I say "OK Google"
        ok_google:
            voice_model_file: ~/models/OK Google.pmdl
            assistant_plugin: assistant.google.pushtotalk
            assistant_language: en-US
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.4

        # Trigger Alexa when I say "Alexa"
        alexa:
            voice_model_file: ~/models/Alexa.pmdl
            assistant_plugin: assistant.echo
            assistant_language: en-US
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.5

Start Platypush. Say the hotword associated with one of your models, check on the logs that the HotwordDetectedEvent is triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly started.

Features

Hotword detection: YES.
Speech detection: NO.
Detection runs locally: YES.

Pros

I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning. You can download any hotword models for free from their website, provided that you record three audio samples of you saying that word in order to help improve the model. You can also create your custom hotword model, and if enough people are interested in using it then they’ll contribute with their samples, and the model will become more robust over time. I believe that more machine learning projects out there could really benefit from this “use it for free as long as you help improve the model” paradigm.
Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make a multi-language and multi-hotword voice assistant.
Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never exceeded 20–25%.
The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection to run and no data exchanged with any cloud.

Cons

Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up, the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who aren’t native English speakers).

Mozilla DeepSpeech

Integrations

stt.deepspeech plugin and stt.deepspeech backend (for continuous detection).

Configuration

Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that gets installed:

[sudo] pip install 'platypush[http,deepspeech]'

Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while depending on your connection:

export MODELS_DIR=~/models
export DEEPSPEECH_VERSION=0.6.1

wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz

tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite

mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR

Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the DeepSpeech integration:

backend.http:
    enabled: True
    
stt.deepspeech:
    model_file: ~/models/output_graph.pbmm
    lm_file: ~/models/lm.binary
    trie_file: ~/models/trie

    # Custom list of hotwords
    hotwords:
        - computer
        - alexa
        - hello

    conversation_timeout: 5
          
backend.stt.deepspeech:
    enabled: True

Start Platypush. Speech detection will start running on startup. SpeechDetectedEvents will be triggered when you talk. HotwordDetectedEvents will be triggered when you say one of the configured hotwords. ConversationDetectedEvents will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the continuous detection and only start it programmatically by calling stt.deepspeech.start_detection and stt.deepspeech.stop_detection. You can also use it to perform offline speech transcription from audio files:

curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"stt.deepspeech.detect",
    "args": {
        "audio_file": "~/audio.wav"
    }
}' http://your-rpi:8008/execute

# Example response
{
    "type":"response",
    "target":"http",
    "response": {
        "errors":[],
        "output": {
            "speech": "This is a test"
        }
    }
}

Features

Hotword detection: YES.
Speech detection: YES.
Detection runs locally: YES.

Pros

I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version 0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can easily extend the Tensorflow model by training it with your own samples.
Speech-to-text transcription of audio files can be a very useful feature.

Cons

DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on less powerful machines.
DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.
DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.” “This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text transcription purposes but, in such ambiguous cases, it lacks some semantic context.
Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the speech detection part.

PicoVoice

PicoVoice is a very promising company that has released several products for performing voice detection on-device. Among them:

Porcupine, a hotword engine.
Leopard, a speech-to-text offline transcription engine.
Cheetah, a speech-to-text engine for real-time applications.
Rhino, a speech-to-intent engine.

So far, Platypush provides integrations with Porcupine and Cheetah.

Integrations

Hotword engine: stt.picovoice.hotword plugin and stt.picovoice.hotword backend (for continuous detection).
Speech engine: stt.picovoice.speech plugin and stt.picovoice.speech backend (for continuous detection).

Configuration

Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:

[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'

Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the DeepSpeech integration:

stt.picovoice.hotword:
    # Custom list of hotwords
    hotwords:
        - computer
        - alexa
        - hello
        
# Enable continuous hotword detection
backend.stt.picovoice.hotword:
    enabled: True
  
# Enable continuous speech detection
# backend.stt.picovoice.speech:
#     enabled: True

# Or start speech detection when a hotword is detected
event.hook.OnHotwordDetected:
    if:
        type: platypush.message.event.stt.HotwordDetectedEvent
    then:
        # Start a timer that stops the detection in 10 seconds
        - action: utils.set_timeout
          args:
              seconds: 10
              name: StopSpeechDetection
              actions:
                  - action: stt.picovoice.speech.stop_detection

        - action: stt.picovoice.speech.start_detection

Start Platypush and enjoy your on-device voice assistant.

Features

Hotword detection: YES.
Speech detection: YES.
Detection runs locally: YES.

Pros

When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on older models of RaspberryPi.

Cons

While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into how they’ve solved the problem.
Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any chance to extend the model or use a different model, is only possible through a commercial license. While I understand their point and their business model, I’d have been super-happy to just pay for a license through a more friendly process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you” paradigm.
Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush.

Conclusions

The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice at all. But at least some solutions are emerging to bring speech detection to all devices.

I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice integrations in the same product — and especially having voice integrations that expose all the same API and generate the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech recognition from the business logic that can be run by voice commands.

Check out my previous article to learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop events.

To summarize my findings so far:

Use the native Google Assistant integration if you want to have a full Google experience, and if you’re ok with Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant library won’t work anymore.
Use the Google push-to-talk integration if you only want to have the assistant, without hotword detection, or you want your assistant to be triggered by alternative hotwords.
Use the Alexa integration if you already have an Amazon-powered ecosystem and you’re ok with having less flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.
Use Snowboy if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not be that accurate.
Use Mozilla DeepSpeech if you want a fully on-device open-source engine powered by a robust Tensorflow model, even if it takes more CPU load and a bit more latency.
Use PicoVoice solutions if you want a full voice solution that runs on-device and it’s both accurate and performant, even though you’ll need a commercial license for using it on some devices or extend/change the model.

Let me know your thoughts on these solutions and your experience with these integrations!

37 KiB Raw Permalink Blame History Unescape Escape

The Case for DIY Voice Assistants

Overview of the voice assistant integrations

Native Google Assistant library

Integrations

Configuration

Features

Pros

Cons

Google Assistant Push-To-Talk Integration

Integrations

Configuration

Features

Pros

Cons

Alexa Integration

Integrations

Configuration

Features

Pros

Cons

Snowboy Integration

Integrations

Configuration

Features

Pros

Cons

Mozilla DeepSpeech

Integrations

Configuration

Features

Pros

Cons

PicoVoice

Integrations

Configuration

Features

Pros

Cons

Conclusions

37 KiB

Raw Permalink Blame History