blog/markdown/The-state-of-voice-assistant-integrations-in-2024.md
2024-06-03 16:44:07 +02:00

52 KiB

Those who have been following my blog or used Platypush for a while probably know that I've put quite some efforts to get voice assistants rights over the past few years.

I built my first (very primitive) voice assistant that used DCT+Markov models back in 2008, when the concept was still pretty much a science fiction novelty.

Then I wrote an article in 2019 and one in 2020 on how to use several voice integrations in Platypush to create custom voice assistants.

Everyone in those pictures is now dead

Quite a few things have changed in this industry niche since I wrote my previous article. Most of the solutions that I covered back in the day, unfortunately, are gone in a way or another:

  • The assistant.snowboy integration is gone because unfortunately Snowboy is gone. For a while you could still run the Snowboy code with models that either you had previously downloaded from their website or trained yourself, but my latest experience proved to be quite unfruitful - it's been more than 4 years since the last commit on Snowboy, and it's hard to get the code to even run.

  • The assistant.alexa integration is also gone, as Amazon has stopped maintaining the AVS SDK. And I have literally no clue of what Amazon's plans with the development of Alexa skills are (if there are any plans at all).

  • The stt.deepspeech integration is also gone: the project hasn't seen a commit in 3 years and I even struggled to get the latest code to run. Given the current financial situation at Mozilla, and the fact that they're trying to cut as much as possible on what they don't consider part of their core product, it's very unlikely that DeepSpeech will be revived any time soon.

  • The assistant.google integration is still there, but I can't make promises on how long it can be maintained. It uses the google-assistant-library, which was deprecated in 2019. Google replaced it with the conversational actions, which was also deprecated last year. <rant>Put here your joke about Google building products with the shelf life of a summer hit.</rant>

  • The tts.mimic3 integration, a text model based on mimic3, part of the Mycroft initiative, is still there, but only because it's still possible to spin up a Docker image that runs mimic3. The whole Mycroft project, however, is now defunct, and the story of how it went bankrupt is a very sad story about the power that patent trolls have on startups. The Mycroft initiative however seems to have been picked up by the community, and something seems to move in the space of fully open source and on-device voice models. I'll definitely be looking with interest at what happens in that space, but the project seems to be at a stage that is still a bit immature to justify an investment into a new Platypush integration.

But not all hope is lost

assistant.google

assistant.google may be relying on a dead library, but it's not dead (yet). The code still works, but you're a bit constrained on the hardware side - the assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3 and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other ARMv7-compatible devices has proved to be a challenge in some cases. Given the state of the library, it's safe to say that it'll never be supported on other platforms, but if you want to run your assistant on a device that is still supported then it should still work fine.

I had however to do a few dirty packaging tricks to ensure that the assistant library code doesn't break badly on newer versions of Python. That code hasn't been touched in 5 years and it's starting to rot. It depends on ancient and deprecated Python libraries like enum34 and it needs some hammering to work - without breaking the whole Python environment in the process.

For now, pip install 'platypush[assistant.google]' should do all the dirty work and get all of your assistant dependencies installed. But I can't promise I can maintain that code forever.

assistant.picovoice

Picovoice has been a nice surprise in an industry niche where all the products that were available just 4 years ago are now dead.

I described some of their products in my previous articles, and I even built a couple of stt.picovoice.* plugins for Platypush back in the day, but I didn't really put much effort in it.

Their business model seemed a bit weird - along the lines of "you can test our products on x86_64, if you need an ARM build you should contact us as a business partner". And the quality of their products was also a bit disappointing compared to other mainstream offerings.

I'm glad to see that the situation has changed quite a bit now. They still have a "sign up with a business email" model, but at least now you can just sign up on their website and start using their products rather than sending emails around. And I'm also quite impressed to see the progress on their website. You can now train hotword models, customize speech-to-text models and build your own intent rules directly from their website - a feature that was also available in the beloved Snowboy and that went missing from any major product offerings out there after Snowboy was gone. I feel like the quality of their models has also greatly improved compared to the last time I checked them - predictions are still slower than the Google Assistant, definitely less accurate with non-native accents, but the gap with the Google Assistant when it comes to native accents isn't very wide.

assistant.openai

OpenAI has filled many gaps left by all the casualties in the voice assistants market. Platypush now provides a new assistant.openai plugin that stitches together several of their APIs to provide a voice assistant experience that honestly feels much more natural than anything I've tried in all these years.

Let's explore how to use these integrations to build our on-device voice assistant with custom rules.

Feature comparison

As some of you may know, voice assistant often aren't monolithic products. Unless explicitly designed as all-in-one packages (like the google-assistant-library), voice assistant integrations in Platypush are usually built on top of four distinct APIs:

  1. Hotword detection: This is the component that continuously listens on your microphone until you speak "Ok Google", "Alexa" or any other wake-up word used to start a conversation. Since it's a continuously listening component that needs to take decisions fast, and it only has to recognize one word (or in a few cases 3-4 more at most), it usually doesn't need to run on a full language model. It needs small models, often a couple of MBs heavy at most.

  2. Speech-to-text (STT): This is the component that will capture audio from the microphone and use some API to transcribe it to text.

  3. Response engine: Once you have the transcription of what the user said, you need to feed it to some model that will generate some human-like response for the question.

  4. Text-to-speech (TTS): Once you have your AI response rendered as a text string, you need a text-to-speech model to speak it out loud on your speakers or headphones.

On top of these basic building blocks for a voice assistant, some integrations may also provide two extra features.

Speech-to-intent

In this mode, the user's prompt, instead of being transcribed directly to text, is transcribed into a structured intent that can be more easily processed by a downstream integration with no need for extra text parsing, regular expressions etc.

For instance, a voice command like "turn off the bedroom lights" could be translated into an intent such as:

{
  "intent": "lights_ctrl",
  "slots": {
    "state": "off",
    "lights": "bedroom"
  }
}

Offline speech-to-text

a.k.a. offline text transcriptions. Some assistant integrations may offer you the ability to pass some audio file and transcribe their content as text.

Features summary

This table summarizes how the assistant integrations available in Platypush compare when it comes to what I would call the foundational blocks:

Plugin Hotword STT AI responses TTS
assistant.google
assistant.openai
assistant.picovoice

And this is how they compare in terms of extra features:

Plugin Intents Offline SST
assistant.google
assistant.openai
assistant.picovoice

Let's see a few configuration examples to better understand the pros and cons of each of these integrations.

Configuration

Hardware requirements

  1. A computer, a Raspberry Pi, an old tablet, or anything in between, as long as it can run Python. At least 1GB of RAM is advised for smooth audio processing experience.

  2. A microphone.

  3. Speaker/headphones.

Installation notes

Platypush 1.0.0 has recently been released, and new installation procedures with it.

There's now official support for several package managers, a better Docker installation process, and more powerful ways to install plugins - via pip extras, Web interface, Docker and virtual environments.

The optional dependencies for any Platypush plugins can be installed via pip extras in the simplest case:

$ pip install 'platypush[plugin1,plugin2,...]'

For example, if you want to install Platypush with the dependencies for assistant.openai and assistant.picovoice:

$ pip install 'platypush[assistant.openai,assistant.picovoice]'

Some plugins however may require extra system dependencies that are not available via pip - for instance, both the OpenAI and Picovoice integrations require the ffmpeg binary to be installed, as it is used for audio conversion and exporting purposes. You can check the plugins documentation for any system dependencies required by some integrations, or install them automatically through the Web interface or the platydock command for Docker containers.

A note on the hooks

All the custom actions in this article are built through event hooks triggered by SpeechRecognizedEvent (or IntentRecognizedEvent for intents). When an intent event is triggered, or a speech event with a condition on a phrase, the assistant integrations in Platypush will prevent the default assistant response. That's to avoid cases where e.g. you say "turn off the lights", your hook takes care of running the actual action, while your voice assistant fetches a response from Google or ChatGPT along the lines of "sorry, I can't control your lights".

If you want to render a custom response from an event hook, you can do so by calling event.assistant.render_response(text), and it will be spoken using the available text-to-speech integration.

If you want to disable this behaviour, and you want the default assistant response to always be rendered, even if it matches a hook with a phrase or an intent, you can do so by setting the stop_conversation_on_speech_match parameter to false in your assistant plugin configuration.

Text-to-speech

Each of the available assistant plugins has it own default tts plugin associated:

  • assistant.google: tts, but tts.google is also available. The difference is that tts uses the (unofficial) Google Translate frontend API - it requires no extra configuration, but besides setting the input language it isn't very configurable. tts.google on the other hand uses the Google Cloud Translation API. It is much more versatile, but it requires an extra API registered to your Google project and an extra credentials file.

  • assistant.openai: tts.openai, which leverages the OpenAI text-to-speech API.

  • assistant.picovoice: tts.picovoice, which uses the (still experimental, at the time of writing) Picovoice Orca engine.

Any text rendered via assistant*.render_response will be rendered using the associated TTS plugin. You can however customize it by setting tts_plugin on your assistant plugin configuration - e.g. you can render responses from the OpenAI assistant through the Google or Picovoice engine, or the other way around.

tts plugins also expose a say action that can be called outside of an assistant context to render custom text at runtime - for example, from other event hooks, procedures, cronjobs or API calls. For example:

$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
  "type": "request",
  "action": "tts.openai.say",
  "args": {
    "text": "What a wonderful day!"
  }
}
' http://localhost:8008/execute

assistant.google

This is the oldest voice integration in Platypush - and one of the use-cases that actually motivated me into forking the previous project into what is now Platypush.

As mentioned in the previous section, this integration is built on top of a deprecated library (with no available alternatives) that just so happens to still work with a bit of hammering on x86_64 and Raspberry Pi 3/4.

Personally it's the voice assistant I still use on most of my devices, but it's definitely not guaranteed that it will keep working in the future.

Once you have installed Platypush with the dependencies for this integration, you can configure it through these steps:

  1. Create a new project on the Google developers console and generate a new set of credentials for it. Download the credentials secrets as JSON.
  2. Generate scoped credentials from your secrets.json.
  3. Configure the integration in your config.yaml for Platypush (see the configuration page for more details):
assistant.google:
  # Default: ~/.config/google-oauthlib-tool/credentials.json
  # or <PLATYPUSH_WORKDIR>/credentials/google/assistant.json
  credentials_file: /path/to/credentials.json
  # Default: no sound is played when "Ok Google" is detected
  conversation_start_sound: /path/to/sound.mp3

Restart the service, say "Ok Google" or "Hey Google" while the microphone is active, and everything should work out of the box.

You can now start creating event hooks to execute your custom voice commands. For example, if you configured a lights plugin (e.g. light.hue) and a music plugin (e.g. music.mopidy), you can start building voice commands like these:

# Content of e.g. /path/to/config_yaml/scripts/assistant.py

from platypush import run, when
from platypush.events.assistant import (
  ConversationStartEvent, SpeechRecognizedEvent
)

light_plugin = "light.hue"
music_plugin = "music.mopidy"

@when(ConversationStartEvent)
def pause_music_when_conversation_starts():
  run(f"{music_plugin}.pause_if_playing")

# Note: (limited) support for regular expressions on `phrase`
# This hook will match any phrase containing either "turn on the lights"
# or "turn off the lights"
@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def lights_on_command():
  run(f"{light_plugin}.on")
  # Or, with arguments:
  # run(f"{light_plugin}.on", groups=["Bedroom"])

@when(SpeechRecognizedEvent, phrase="turn off (the)? lights")
def lights_off_command():
  run(f"{light_plugin}.off")

@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music_command():
  run(f"{music_plugin}.play")

@when(SpeechRecognizedEvent, phrase="stop (the)? music")
def stop_music_command():
  run(f"{music_plugin}.stop")

Or, via YAML:

# Add to your config.yaml, or to one of the files included in it

event.hook.pause_music_when_conversation_starts:
  if:
    type: platypush.message.event.ConversationStartEvent

  then:
    - action: music.mopidy.pause_if_playing

event.hook.lights_on_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "turn on (the)? lights"

  then:
    - action: light.hue.on
    # args:
    #   groups:
    #     - Bedroom

event.hook.lights_off_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "turn off (the)? lights"

  then:
    - action: light.hue.off

event.hook.play_music_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "play (the)? music"

  then:
    - action: music.mopidy.play

event.hook.stop_music_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "stop (the)? music"

  then:
    - action: music.mopidy.stop

Parameters are also supported on the phrase event argument through the ${} template construct. For example:

from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}')
def on_play_track_command(
    event: SpeechRecognizedEvent, title: str, artist: str
):
    results = run(
        "music.mopidy.search",
        filter={"title": title, "artist": artist}
    )

    if not results:
        event.assistant.render_response(f"Couldn't find {title} by {artist}")
        return

    run("music.mopidy.play", resource=results[0]["uri"])

Pros

  • 👍 Very fast and robust API.
  • 👍 Easy to install and configure.
  • 👍 It comes with almost all the features of a voice assistant installed on Google hardware - except some actions native to Android-based devices and video/display features. This means that features such as timers, alarms, weather forecast, setting the volume or controlling Chromecasts on the same network are all supported out of the box.
  • 👍 It connects to your Google account (can be configured from your Google settings), so things like location-based suggestions and calendar events are available. Support for custom actions and devices configured in your Google Home app is also available out of the box, although I haven't tested it in a while.
  • 👍 Good multi-language support. In most of the cases the assistant seems quite capable of understanding questions in multiple language and respond in the input language without any further configuration.

Cons

  • 👎 Based on a deprecated API that could break at any moment.
  • 👎 Limited hardware support (only x86_64 and RPi 3/4).
  • 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available.
  • 👎 Not possible to configure the output voice - it can only use the stock Google Assistant voice.
  • 👎 No support for intents - something similar was available (albeit tricky to configure) through the Actions SDK, but that has also been abandoned by Google.
  • 👎 Not very modular. Both assistant.picovoice and assistant.openai have been built by stitching together different independent APIs. Those plugins are therefore quite modular. You can choose for instance to run only the hotword engine of assistant.picovoice, which in turn will trigger the conversation engine of assistant.openai, and maybe use tts.google to render the responses. By contrast, given the relatively monolithic nature of google-assistant-library, which runs the whole service locally, if your instance runs assistant.google then it can't run other assistant plugins.

assistant.picovoice

The assistant.picovoice integration is available from Platypush 1.0.0.

Previous versions had some outdated sst.picovoice.* plugins for the individual products, but they weren't properly tested and they weren't combined together into a single integration that implements the Platypush' assistant API.

This integration is built on top of the voice products developed by Picovoice. These include:

  • Porcupine: a fast and customizable engine for hotword/wake-word detection. It can be enabled by setting hotword_enabled to true in the assistant.picovoice plugin configuration.

  • Cheetah: a speech-to-text engine optimized for real-time transcriptions. It can be enabled by setting stt_enabled to true in the assistant.picovoice plugin configuration.

  • Leopard: a speech-to-text engine optimized for offline transcriptions of audio files.

  • Rhino: a speech-to-intent engine.

  • Orca: a text-to-speech engine.

You can get your personal access key by signing up at the Picovoice console. You may be asked to submit a reason for using the service (feel free to mention a personal Platypush integration), and you will receive your personal access key.

If prompted to select the products you want to use, make sure to select the ones from the Picovoice suite that you want to use with the assistant.picovoice plugin.

A basic plugin configuration would like this:

assistant.picovoice:
  access_key: YOUR_ACCESS_KEY

  # Keywords that the assistant should listen for
  keywords:
    - alexa
    - computer
    - ok google

  # Paths to custom keyword files
  # keyword_paths:
  #   - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn

  # Enable/disable the hotword engine
  hotword_enabled: true
  # Enable the STT engine
  stt_enabled: true

  # conversation_start_sound: ...

  # Path to a custom model to be used to speech-to-text
  # speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv

  # Path to an intent model. At least one custom intent model is required if
  # you want to enable intent detection.
  # intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn

Hotword detection

If enabled through the hotword_enabled parameter (default: True), the assistant will listen for a specific wake word before starting the speech-to-text or intent recognition engines. You can specify custom models for your hotword (e.g. on the same device you may use "Alexa" to trigger the speech-to-text engine in English, "Computer" to trigger the speech-to-text engine in Italian, and "Ok Google" to trigger the intent recognition engine).

You can also create your custom hotword models using the Porcupine console.

If hotword_enabled is set to True, you must also specify the keywords parameter with the list of keywords that you want to listen for, and optionally the keyword_paths parameter with the paths to the any custom hotword models that you want to use. If hotword_enabled is set to False, then the assistant won't start listening for speech after the plugin is started, and you will need to programmatically start the conversation by calling the assistant.picovoice.start_conversation action.

When a wake-word is detected, the assistant will emit a HotwordDetectedEvent that you can use to build your custom logic.

By default, the assistant will start listening for speech after the hotword if either stt_enabled or intent_model_path are set. If you don't want the assistant to start listening for speech after the hotword is detected (for example because you want to build your custom response flows, or trigger the speech detection using different models depending on the hotword that is used, or because you just want to detect hotwords but not speech), then you can also set the start_conversation_on_hotword parameter to false. If that is the case, then you can programmatically start the conversation by calling the assistant.picovoice.start_conversation method in your event hooks:

from platypush import when, run
from platypush.message.event.assistant import HotwordDetectedEvent

# Start a conversation using the Italian language model when the
# "Buongiorno" hotword is detected
@when(HotwordDetectedEvent, hotword='Buongiorno')
def on_it_hotword_detected(event: HotwordDetectedEvent):
    event.assistant.start_conversation(model_file='path/to/it.pv')

Speech-to-text

If you want to build your custom STT hooks, the approach is the same seen for the assistant.google plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template.

Speech-to-intent

Intents are structured actions parsed from unstructured human-readable text.

Unlike with hotword and speech-to-text detection, you need to provide a custom model for intent detection. You can create your custom model using the Rhino console.

When an intent is detected, the assistant will emit an IntentRecognizedEvent and you can build your custom hooks on it.

For example, you can build a model to control groups of smart lights by defining the following slots on the Rhino console:

  • device_state: The new state of the device (e.g. with on or off as supported values)

  • room: The name of the room associated to the group of lights to be controlled (e.g. living room, kitchen, bedroom)

You can then define a lights_ctrl intent with the following expressions:

  • "turn $device_state:state the lights"
  • "turn $device_state:state the $room:room lights"
  • "turn the lights $device_state:state"
  • "turn the $room:room lights $device_state:state"
  • "turn $room:room lights $device_state:state"

This intent will match any of the following phrases:

  • "turn on the lights"
  • "turn off the lights"
  • "turn the lights on"
  • "turn the lights off"
  • "turn on the living room lights"
  • "turn off the living room lights"
  • "turn the living room lights on"
  • "turn the living room lights off"

And it will extract any slots that are matched in the phrases in the IntentRecognizedEvent.

Train the model, download the context file, and pass the path on the intent_model_path parameter.

You can then register a hook to listen to a specific intent:

from platypush import when, run
from platypush.events.assistant import IntentRecognizedEvent

@when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'})
def on_turn_on_lights(event: IntentRecognizedEvent):
    room = event.slots.get('room')
    if room:
        run("light.hue.on", groups=[room])
    else:
        run("light.hue.on")

Note that if both stt_enabled and intent_model_path are set, then both the speech-to-text and intent recognition engines will run in parallel when a conversation is started.

The intent engine is usually faster, as it has a smaller set of intents to match and doesn't have to run a full speech-to-text transcription. This means that, if an utterance matches both a speech-to-text phrase and an intent, the IntentRecognizedEvent event is emitted (and not SpeechRecognizedEvent).

This may not be always the case though. So, if you want to use the intent detection engine together with the speech detection, it may be a good practice to also provide a fallback SpeechRecognizedEvent hook to catch the text if the speech is not recognized as an intent:

from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context):
    if room:
        run("light.hue.on", groups=[room])
    else:
        run("light.hue.on")

Text-to-speech and response management

The text-to-speech engine, based on Orca, is provided by the tts.picovoice plugin.

However, the Picovoice integration won't provide you with automatic AI-generated responses for your queries. That's because Picovoice doesn't seem to offer (yet) any products for conversational assistants, either voice-based or text-based.

You can however leverage the render_response action to render some text as speech in response to a user command, and that in turn will leverage the Picovoice TTS plugin to render the response.

For example, the following snippet provides a hook that:

  • Listens for SpeechRecognizedEvent.

  • Matches the phrase against a list of predefined commands that shouldn't require an AI-generated response.

  • Has a fallback logic that leverages openai.get_response to generate a response through a ChatGPT model and render it as audio.

Also, note that any text rendered over the render_response action that ends with a question mark will automatically trigger a follow-up - i.e. the assistant will wait for the user to answer its question.

import re

from platypush import hook, run
from platypush.message.event.assistant import SpeechRecognizedEvent

def play_music():
    run("music.mopidy.play")

def stop_music():
    run("music.mopidy.stop")

def ai_assist(event: SpeechRecognizedEvent):
    response = run("openai.get_response", prompt=event.phrase)
    if not response:
        return

    run("assistant.picovoice.render_response", text=response)

# List of commands to match, as pairs of regex patterns and the
# corresponding actions
hooks = (
    (re.compile(r"play (the)?music", re.IGNORECASE), play_music),
    (re.compile(r"stop (the)?music", re.IGNORECASE), stop_music),
    # ...
    # Fallback to the AI assistant
    (re.compile(r".*"), ai_assist),
)

@when(SpeechRecognizedEvent)
def on_speech_recognized(event, **kwargs):
    for pattern, command in hooks:
        if pattern.search(event.phrase):
            run("logger.info", msg=f"Running voice command: {command.__name__}")
            command(event, **kwargs)
            break

Offline speech-to-text

An assistant.picovoice.transcribe action is provided for offline transcriptions of audio files, using the Leopard models.

You can easily call it from your procedures, hooks or through the API:

$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
  "type": "request",
  "action": "assistant.picovoice.transcribe",
  "args": {
    "audio_file": "/path/to/some/speech.mp3"
  }
}' http://localhost:8008/execute

{
  "transcription": "This is a test",
  "words": [
    {
      "word": "this",
      "start": 0.06400000303983688,
      "end": 0.19200000166893005,
      "confidence": 0.9626294374465942
    },
    {
      "word": "is",
      "start": 0.2879999876022339,
      "end": 0.35199999809265137,
      "confidence": 0.9781675934791565
    },
    {
      "word": "a",
      "start": 0.41600000858306885,
      "end": 0.41600000858306885,
      "confidence": 0.9764975309371948
    },
    {
      "word": "test",
      "start": 0.5120000243186951,
      "end": 0.8320000171661377,
      "confidence": 0.9511580467224121
    }
  ]
}

Pros

  • 👍 The Picovoice integration is extremely configurable. assistant.picovoice stitches together five independent products developed by a small company specialized in voice products for developers. As such, Picovoice may be the best option if you have custom use-cases. You can pick which features you need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you have plenty of flexibility in building your integrations.

  • 👍 Runs (or seems to run) (mostly) on device. This is something that we can't say about the other two integrations discussed in this article. If keeping your voice interactions 100% hidden from Google's or Microsoft's eyes is a priority, then Picovoice may be your best bet.

  • 👍 Rich features. It uses different models for different purposes - for example, Cheetah models are optimized for real-time speech detection, while Leopard is optimized for offline transcription. Moreover, Picovoice is the only integration among those analyzed in this article to support speech-to-intent.

  • 👍 It's very easy to build new models or customize existing ones. Picovoice has a powerful developers console that allows you to easily create hotword models, tweak the priority of some words in voice models, and create custom intent models.

Cons

  • 👎 The business model is still a bit weird. It's better than the earlier "write us an email with your business case and we'll reach back to you", but it still requires you to sign up with a business email and write a couple of lines on what you want to build with their products. It feels like their focus is on a B2B approach rather than "open up and let the community build stuff", and that seems to create unnecessary friction.

  • 👎 No native conversational features. At the time of writing, Picovoice doesn't offer products that generate AI responses given voice or text prompts. This means that, if you want AI-generated responses to your queries, you'll have to do requests to e.g. openai.get_response(prompt) directly in your hooks for SpeechRecognizedEvent, and render the responses through assistant.picovoice.render_response. This makes the use of assistant.picovoice alone more fit to cases where you want to mostly create voice command hooks rather than have general-purpose conversations.

  • 👎 Speech-to-text, at least on my machine, is slower than the other two integrations, and the accuracy with non-native accents is also much lower.

  • 👎 Limited support for any languages other than English. At the time of writing hotword detection with Porcupine seems to be in a relative good shape with support for 16 languages. However, both speech-to-text and text-to-speech only support English at the moment.

  • 👎 Some APIs are still quite unstable. The Orca text-to-speech API, for example, doesn't even support text that includes digits or some punctuation characters - at least not at the time of writing. The Platypush integration fills the gap with workarounds that e.g. replace words to numbers and replace punctuation characters, but you definitely have a feeling that some parts of their products are still work in progress.

assistant.openai

This integration has been released in Platypush 1.0.7.

It uses the following OpenAI APIs:

  • /audio/transcriptions for speech-to-text. At the time of writing the default model is whisper-1. It can be configured through the model setting on the assistant.openai plugin configuration. See the OpenAI documentation for a list of available models.
  • /chat/completions to get AI-generated responses using a GPT model. At the time of writing the default is gpt-3.5-turbo, but it can be configurable through the model setting on the openai plugin configuration. See the OpenAI documentation for a list of supported models.
  • /audio/speech for text-to-speech. At the time of writing the default model is tts-1 and the default voice is nova. They can be configured through the model and voice settings respectively on the tts.openai plugin. See the OpenAI documentation for a list of available models and voices.

You will need an OpenAI API key associated to your account.

A basic configuration would like this:

openai:
  api_key: YOUR_OPENAI_API_KEY  # Required
  # conversation_start_sound: ...
  # model: ...
  # context: ...
  # context_expiry: ...
  # max_tokens: ...

assistant.openai:
  # model: ...
  # tts_plugin: some.other.tts.plugin

tts.openai:
  # model: ...
  # voice: ...

If you want to build your custom hooks on speech events, the approach is the same seen for the other assistant plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template.

Hotword support

OpenAI doesn't provide an API for hotword detection, nor a small model for offline detection.

This means that, if no other assistant plugins with stand-alone hotword support are configured (only assistant.picovoice for now), a conversation can only be triggered by calling the assistant.openai.start_conversation action.

If you want hotword support, then the best bet is to add assistant.picovoice to your configuration too - but make sure to only enable hotword detection and not speech detection, which will be delegated to assistant.openai via event hook:

assistant.picovoice:
  access_key: ...
  keywords:
    - computer

  hotword_enabled: true
  stt_enabled: false
  # conversation_start_sound: ...

Then create a hook that listens for HotwordDetectedEvent and calls assistant.openai.start_conversation:

from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent

@when(HotwordDetectedEvent, hotword="computer")
def on_hotword_detected():
  run("assistant.openai.start_conversation")

Conversation contexts

The most powerful feature offered by the OpenAI assistant is the fact that it leverages the conversation contexts provided by the OpenAI API.

This means two things:

  1. Your assistant can be initialized/tuned with a static context. It is possible to provide some initialization context to the assistant that can fine tune how the assistant will behave, (e.g. what kind of tone/language/approach will have when generating the responses), as well as initialize the assistant with some predefined knowledge in the form of hypothetical past conversations. Example:
openai:
   # ...

   context:
       # `system` can be used to initialize the context for the expected tone
       # and language in the assistant responses
       - role: system
         content: >
             You are a voice assistant that responds to user queries using
             references to Lovecraftian lore.             

       # `user`/`assistant` interactions can be used to initialize the
       # conversation context with previous knowledge. `user` is used to
       # emulate previous user questions, and `assistant` models the
       # expected response.
       - role: user
         content: What is a telephone?
       - role: assistant
         content: >
             A Cthulhuian device that allows you to communicate with
             otherworldly beings. It is said that the first telephone was
             created by the Great Old Ones themselves, and that it is a
             gateway to the void beyond the stars.             

If you now start Platypush and ask a question like "how does it work?", the voice assistant may give a response along the lines of:

The telephone functions by harnessing the eldritch energies of the cosmos to
transmit vibrations through the ether, allowing communication across vast
distances with entities from beyond the veil. Its operation is shrouded in
mystery, for it relies on arcane principles incomprehensible to mortal
minds.

Note that:

  1. The style of the response is consistent with that initialized in the context through system roles.

  2. Even though a question like "how does it work?" is not very specific, the assistant treats the user/assistant entries given in the context as if they were the latest conversation prompts. Thus it realizes that "it", in this context, probably means "the telephone".

  3. The assistant has a runtime context. It will remember the recent conversations for a given amount of time (configurable through the context_expiry setting on the openai plugin configuration). So, even without explicit context initialization in the openai plugin, the plugin will remember the last interactions for (by default) 10 minutes. So if you ask "who wrote the Divine Comedy?", and a few seconds later you ask "where was its writer from?", you may get a response like "Florence, Italy" - i.e. the assistant realizes that "the writer" in this context is likely to mean "the writer of the work that I was asked about in the previous interaction" and return pertinent information.

Pros

  • 👍 Speech detection quality. The OpenAI speech-to-text features are the best among the available assistant integrations. The transcribe API so far has detected my non-native English accent right nearly 100% of the times (Google comes close to 90%, while Picovoice trails quite behind). And it even detects the speech of my young kid - something that the Google Assistant library has always failed to do right.

  • 👍 Text-to-speech quality. The voice models used by OpenAI sound much more natural and human than those of both Google and Picovoice. Google's and Picovoice's TTS models are actually already quite solid, but OpenAI outclasses them when it comes to voice modulation, inflections and sentiment. The result sounds intimidatingly realistic.

  • 👍 AI responses quality. While the scope of the Google Assistant is somewhat limited by what people expected from voice assistants until a few years ago (control some devices and gadgets, find my phone, tell me the news/weather, do basic Google searches...), usually without much room for follow-ups, assistant.openai will basically render voice responses as if you were typing them directly to ChatGPT. While Google would often respond you with a "sorry, I don't understand", or "sorry, I can't help with that", the OpenAI assistant is more likely to expose its reasoning, ask follow-up questions to refine its understanding, and in general create a much more realistic conversation.

  • 👍 Contexts. They are an extremely powerful way to initialize your assistant and customize it to speak the way you want, and know the kind of things that you want it to know. Cross-conversation contexts with configurable expiry also make it more natural to ask something, get an answer, and then ask another question about the same topic a few seconds later, without having to reintroduce the assistant to the whole context.

  • 👍 Offline transcriptions available through the openai.transcribe action.

  • 👍 Multi-language support seems to work great out of the box. Ask something to the assistant in any language, and it'll give you a response in that language.

  • 👍 Configurable voices and models.

Cons

  • 👎 The full pack of features is only available if you have an API key associated to a paid OpenAI account.

  • 👎 No hotword support. It relies on assistant.picovoice for hotword detection.

  • 👎 No intents support.

  • 👎 No native support for weather forecast, alarms, timers, integrations with other services/devices nor other features available out of the box with the Google Assistant. You can always create hooks for them though.

Weather forecast example

Both the OpenAI and Picovoice integrations lack some features available out of the box on the Google Assistant - weather forecast, news playback, timers etc. - as they rely on voice-only APIs that by default don't connect to other services.

However Platypush provides many plugins to fill those gaps, and those features can be implemented with custom event hooks.

Let's see for example how to build a simple hook that delivers the weather forecast for the next 24 hours whenever the assistant gets a phrase that contains the "weather today" string.

You'll need to enable a weather plugin in Platypush - weather.openweathermap will be used in this example. Configuration:

weather.openweathermap:
  token: OPENWEATHERMAP_API_KEY
  location: London,GB

Then drop a script named e.g. weather.py in the Platypush scripts directory (default: <CONFDIR>/scripts) with the following content:

from datetime import datetime
from textwrap import dedent
from time import time

from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='weather today')
def weather_forecast(event: SpeechRecognizedEvent):
    limit = time() + 24 * 60 * 60  # 24 hours from now
    forecast = [
        weather
        for weather in run("weather.openweathermap.get_forecast")
        if datetime.fromisoformat(weather["time"]).timestamp() < limit
    ]

    min_temp = round(
        min(weather["temperature"] for weather in forecast)
    )
    max_temp = round(
        max(weather["temperature"] for weather in forecast)
    )
    max_wind_gust = round(
        (max(weather["wind_gust"] for weather in forecast)) * 3.6
    )
    summaries = [weather["summary"] for weather in forecast]
    most_common_summary = max(summaries, key=summaries.count)
    avg_cloud_cover = round(
        sum(weather["cloud_cover"] for weather in forecast) / len(forecast)
    )

    event.assistant.render_response(
        dedent(
            f"""
            The forecast for today is: {most_common_summary}, with
            a minimum of {min_temp} and a maximum of {max_temp}
            degrees, wind gust of {max_wind_gust} km/h, and an
            average cloud cover of {avg_cloud_cover}%.
            """
        )
    )

This script will work with any of the available voice assistants.

You can also implement something similar for news playback, for example using the rss plugin to get the latest items in your subscribed feeds. Or to create custom alarms using the alarm plugin, or a timer using the utils.set_timeout action.

Conclusions

The past few years have seen a lot of things happen in the voice industry. Many products have gone out of market, been deprecated or sunset, but not all hope is lost. The OpenAI and Picovoice products, especially when combined together, can still provide a good out-of-the-box voice assistant experience. And the OpenAI products have also raised the bar on what to expect from an AI-based assistant.

I wish that there were still some fully open and on-device alternatives out there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google provide the best voice experience as of now, but of course they come with trade-offs - namely the great amount of data points you feed to these cloud-based services. Picovoice is somewhat a trade-off, as it runs at least partly on-device, but their business model is still a bit fuzzy and it's not clear whether they intend to have their products used by the wider public or if it's mostly B2B.

I'll keep an eye however on what is going to come from the ashes of Mycroft under the form of the OpenConversational project, and probably keep you up-to-date when there is a new integration to share.