From 1d90d5a317abfed9103b92a84d06d82542a3befe Mon Sep 17 00:00:00 2001 From: Fabio Manganiello Date: Mon, 3 Jun 2024 13:08:57 +0200 Subject: [PATCH] Added new article on voice assistants --- ...of-voice-assistant-integrations-in-2024.md | 1253 +++++++++++++++++ 1 file changed, 1253 insertions(+) create mode 100644 markdown/The-state-of-voice-assistant-integrations-in-2024.md diff --git a/markdown/The-state-of-voice-assistant-integrations-in-2024.md b/markdown/The-state-of-voice-assistant-integrations-in-2024.md new file mode 100644 index 0000000..dc6b2da --- /dev/null +++ b/markdown/The-state-of-voice-assistant-integrations-in-2024.md @@ -0,0 +1,1253 @@ +[//]: # (title: The state of voice assistant integrations in 2024) +[//]: # (description: How to use Platypush to build your voice assistants. Featuring Google, OpenAI and Picovoice.) +[//]: # (image: https://platypush-static.s3.nl-ams.scw.cloud/images/voice-assistant-2.png) +[//]: # (author: Fabio Manganiello ) +[//]: # (published: 2024-06-02) + +Those who have been following my blog or used Platypush for a while probably +know that I've put quite some efforts to get voice assistants rights over the +past few years. + +I built my first (very primitive) voice assistant that used DCT+Markov models +[back in 2008](https://github.com/blacklight/Voxifera), when the concept was +still pretty much a science fiction novelty. + +Then I wrote [an article in +2019](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush) +and [one in +2020](https://blog.platypush.tech/article/Build-custom-voice-assistants) on how +to use several voice integrations in [Platypush](https://platypush.tech) to +create custom voice assistants. + +## Everyone in those pictures is now dead + +Quite a few things have changed in this industry niche since I wrote my +previous article. Most of the solutions that I covered back in the day, +unfortunately, are gone in a way or another: + +- The `assistant.snowboy` integration is gone because unfortunately [Snowboy is + gone](https://github.com/Kitt-AI/snowboy). For a while you could still run + the Snowboy code with models that either you had previously downloaded from + their website or trained yourself, but my latest experience proved to be + quite unfruitful - it's been more than 4 years since the last commit on + Snowboy, and it's hard to get the code to even run. + +- The `assistant.alexa` integration is also gone, as Amazon [has stopped + maintaining the AVS SDK](https://github.com/alexa/avs-device-sdk). And I have + literally no clue of what Amazon's plans with the development of Alexa skills + are (if there are any plans at all). + +- The `stt.deepspeech` integration is also gone: [the project hasn't seen a + commit in 3 years](https://github.com/mozilla/DeepSpeech) and I even + struggled to get the latest code to run. Given the current financial + situation at Mozilla, and the fact that they're trying to cut as much as + possible on what they don't consider part of their core product, it's + very unlikely that DeepSpeech will be revived any time soon. + +- The `assistant.google` integration [is still + there](https://docs.platypush.tech/platypush/plugins/assistant.google.html), + but I can't make promises on how long it can be maintained. It uses the + [`google-assistant-library`](https://pypi.org/project/google-assistant-library/), + which was [deprecated in + 2019](https://developers.google.com/assistant/sdk/release-notes). Google + replaced it with the [conversational + actions](https://developers.google.com/assistant/sdk/), which [was also + deprecated last year](https://developers.google.com/assistant/ca-sunset). + ``Put here your joke about Google building products with the shelf life + of a summer hit.`` + +- The `tts.mimic3` integration, a text model based on + [mimic3](https://github.com/MycroftAI/mimic3), part of the + [Mycroft](https://en.wikipedia.org/wiki/Mycroft_(software)) initiative, [is + still there](https://docs.platypush.tech/platypush/plugins/tts.mimic3.html), + but only because it's still possible to [spin up a Docker + image](https://hub.docker.com/r/mycroftai/mimic3) that runs mimic3. The whole + Mycroft project, however, [is now + defunct](https://community.openconversational.ai/t/update-from-the-ceo-part-1/13268), + and [the story of how it went + bankrupt](https://www.reuters.com/legal/transactional/appeals-court-says-judge-favored-patent-plaintiff-scorched-earth-case-2022-03-04/) + is a very sad story about the power that patent trolls have on startups. The + Mycroft initiative however seems to [have been picked up by the + community](https://community.openconversational.ai/), and something seems to + move in the space of fully open source and on-device voice models. I'll + definitely be looking with interest at what happens in that space, but the + project seems to be at a stage that is still a bit immature to justify an + investment into a new Platypush integration. + +## But not all hope is lost + +### `assistant.google` + +`assistant.google` may be relying on a dead library, but it's not dead (yet). +The code still works, but you're a bit constrained on the hardware side - the +assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3 +and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other +ARMv7-compatible devices has proved to be a challenge in some cases. Given the +state of the library, it's safe to say that it'll never be supported on other +platforms, but if you want to run your assistant on a device that is still +supported then it should still work fine. + +I had however to do a few dirty packaging tricks to ensure that the assistant +library code doesn't break badly on newer versions of Python. That code hasn't +been touched in 5 years and it's starting to rot. It depends on ancient and +deprecated Python libraries like [`enum34`](https://pypi.org/project/enum34/) +and it needs some hammering to work - without breaking the whole Python +environment in the process. + +For now, `pip install 'platypush[assistant.google]'` should do all the dirty +work and get all of your assistant dependencies installed. But I can't promise +I can maintain that code forever. + +### `assistant.picovoice` + +Picovoice has been a nice surprise in an industry niche where all the +products that were available just 4 years ago are now dead. + +I described some of their products [in my previous +articles](https://blog.platypush.tech/article/Build-custom-voice-assistants), +and I even built a couple of `stt.picovoice.*` plugins for Platypush back in +the day, but I didn't really put much effort in it. + +Their business model seemed a bit weird - along the lines of "you can test our +products on x86_64, if you need an ARM build you should contact us as a +business partner". And the quality of their products was also a bit +disappointing compared to other mainstream offerings. + +I'm glad to see that the situation has changed quite a bit now. They still have +a "sign up with a business email" model, but at least now you can just sign up +on their website and start using their products rather than sending emails +around. And I'm also quite impressed to see the progress on their website. You +can now train hotword models, customize speech-to-text models and build your +own intent rules directly from their website - a feature that was also +available in the beloved Snowboy and that went missing from any major product +offerings out there after Snowboy was gone. I feel like the quality of their +models has also greatly improved compared to the last time I checked them - +predictions are still slower than the Google Assistant, definitely less +accurate with non-native accents, but the gap with the Google Assistant when it +comes to native accents isn't very wide. + +### `assistant.openai` + +OpenAI has filled many gaps left by all the casualties in the voice assistants +market. Platypush now provides a new `assistant.openai` plugin that stitches +together several of their APIs to provide a voice assistant experience that +honestly feels much more natural than anything I've tried in all these years. + +Let's explore how to use these integrations to build our on-device voice +assistant with custom rules. + +## Feature comparison + +As some of you may know, voice assistant often aren't monolithic products. +Unless explicitly designed as all-in-one packages (like the +`google-assistant-library`), voice assistant integrations in Platypush are +usually built on top of four distinct APIs: + +1. **Hotword detection**: This is the component that continuously listens on + your microphone until you speak "Ok Google", "Alexa" or any other wake-up + word used to start a conversation. Since it's a continuously listening + component that needs to take decisions fast, and it only has to recognize + one word (or in a few cases 3-4 more at most), it usually doesn't need to + run on a full language model. It needs small models, often a couple of MBs + heavy at most. + +2. **Speech-to-text** (*STT*): This is the component that will capture audio + from the microphone and use some API to transcribe it to text. + +3. **Response engine**: Once you have the transcription of what the user said, + you need to feed it to some model that will generate some human-like + response for the question. + +4. **Text-to-speech** (*TTS*): Once you have your AI response rendered as a + text string, you need a text-to-speech model to speak it out loud on your + speakers or headphones. + +On top of these basic building blocks for a voice assistant, some integrations +may also provide two extra features. + +#### Speech-to-intent + +In this mode, the user's prompt, instead of being transcribed directly to text, +is transcribed into a structured *intent* that can be more easily processed by +a downstream integration with no need for extra text parsing, regular +expressions etc. + +For instance, a voice command like "*turn off the bedroom lights*" could be +translated into an intent such as: + +```json +{ + "intent": "lights_ctrl", + "slots": { + "state": "off", + "lights": "bedroom" + } +} +``` + +#### Offline speech-to-text + +a.k.a. *offline text transcriptions*. Some assistant integrations may offer you +the ability to pass some audio file and transcribe their content as text. + +### Features summary + +This table summarizes how the `assistant` integrations available in Platypush +compare when it comes to what I would call the *foundational* blocks: + +| Plugin | Hotword | STT | AI responses | TTS | +| --------------------- | ------- | --- | ------------ | --- | +| `assistant.google` | ✅ | ✅ | ✅ | ✅ | +| `assistant.openai` | ❌ | ✅ | ✅ | ✅ | +| `assistant.picovoice` | ✅ | ✅ | ❌ | ✅ | + +And this is how they compare in terms of extra features: + +| Plugin | Intents | Offline SST | +| --------------------- | ------- | ------------| +| `assistant.google` | ❌ | ❌ | +| `assistant.openai` | ❌ | ✅ | +| `assistant.picovoice` | ✅ | ✅ | + +Let's see a few configuration examples to better understand the pros and cons +of each of these integrations. + +## Configuration + +### Hardware requirements + +1. A computer, a Raspberry Pi, an old tablet, or anything in between, as long + as it can run Python. At least 1GB of RAM is advised for smooth audio + processing experience. + +2. A microphone. + +3. Speaker/headphones. + +### Installation notes + +[Platypush +1.0.0](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-0-2024-05-26) +has [recently been +released](https://blog.platypush.tech/article/Platypush-1.0-is-out), and [new +installation procedures](https://docs.platypush.tech/wiki/Installation.html) +with it. + +There's now official support for [several package +managers](https://docs.platypush.tech/wiki/Installation.html#system-package-manager-installation), +a better [Docker installation +process](https://docs.platypush.tech/wiki/Installation.html#docker), and more +powerful ways to [install +plugins](https://docs.platypush.tech/wiki/Plugins-installation.html) - via +[`pip` extras](https://docs.platypush.tech/wiki/Plugins-installation.html#pip), +[Web +interface](https://docs.platypush.tech/wiki/Plugins-installation.html#web-interface), +[Docker](https://docs.platypush.tech/wiki/Plugins-installation.html#docker) and +[virtual +environments](https://docs.platypush.tech/wiki/Plugins-installation.html#virtual-environment). + +The optional dependencies for any Platypush plugins can be installed via `pip` +extras in the simplest case: + +``` +$ pip install 'platypush[plugin1,plugin2,...]' +``` + +For example, if you want to install Platypush with the dependencies for +`assistant.openai` and `assistant.picovoice`: + +``` +$ pip install 'platypush[assistant.openai,assistant.picovoice]' +``` + +Some plugins however may require extra system dependencies that are not +available via `pip` - for instance, both the OpenAI and Picovoice integrations +require the `ffmpeg` binary to be installed, as it is used for audio +conversion and exporting purposes. You can check the [plugins +documentation](https://docs.platypush.tech) for any system dependencies +required by some integrations, or install them automatically through the Web +interface or the `platydock` command for Docker containers. + +### A note on the hooks + +All the custom actions in this article are built through event hooks triggered +by +[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent) +(or +[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent) +for intents). When an intent event is triggered, or a speech event with a +condition on a phrase, the `assistant` integrations in Platypush will prevent +the default assistant response. That's to avoid cases where e.g. you say "*turn +off the lights*", your hook takes care of running the actual action, while your +voice assistant fetches a response from Google or ChatGPT along the lines of +"*sorry, I can't control your lights*". + +If you want to render a custom response from an event hook, you can do so by +calling `event.assistant.render_response(text)`, and it will be spoken using +the available text-to-speech integration. + +If you want to disable this behaviour, and you want the default assistant +response to always be rendered, even if it matches a hook with a phrase or an +intent, you can do so by setting the `stop_conversation_on_speech_match` +parameter to `false` in your assistant plugin configuration. + +### Text-to-speech + +Each of the available `assistant` plugins has it own default `tts` plugin associated: + +- `assistant.google`: + [`tts`](https://docs.platypush.tech/platypush/plugins/tts.html), but + [`tts.google`](https://docs.platypush.tech/platypush/plugins/tts.google.html) + is also available. The difference is that `tts` uses the (unofficial) Google + Translate frontend API - it requires no extra configuration, but besides + setting the input language it isn't very configurable. `tts.google` on the + other hand uses the [Google Cloud Translation + API](https://cloud.google.com/translate/docs/reference/rest/). It is much + more versatile, but it requires an extra API registered to your Google + project and an extra credentials file. + +- `assistant.openai`: + [`tts.openai`](https://docs.platypush.tech/platypush/plugins/tts.openai.html), + which leverages the [OpenAI + text-to-speech API](https://platform.openai.com/docs/guides/text-to-speech). + +- `assistant.picovoice`: + [`tts.picovoice`](https://docs.platypush.tech/platypush/plugins/tts.picovoice.html), + which uses the (still experimental, at the time of writing) [Picovoice Orca + engine](https://github.com/Picovoice/orca). + +Any text rendered via `assistant*.render_response` will be rendered using the +associated TTS plugin. You can however customize it by setting `tts_plugin` on +your assistant plugin configuration - e.g. you can render responses from the +OpenAI assistant through the Google or Picovoice engine, or the other way +around. + +`tts` plugins also expose a `say` action that can be called outside of an +assistant context to render custom text at runtime - for example, from other +[event +hooks](https://docs.platypush.tech/wiki/Quickstart.html#turn-on-the-lights-when-i-say-so), +[procedures](https://docs.platypush.tech/wiki/Quickstart.html#greet-me-with-lights-and-music-when-i-come-home), +[cronjobs](https://docs.platypush.tech/wiki/Quickstart.html#turn-off-the-lights-at-1-am) +or [API calls](https://docs.platypush.tech/wiki/APIs.html). For example: + +```bash +$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d ' +{ + "type": "request", + "action": "tts.openai.say", + "args": { + "text": "What a wonderful day!" + } +} +' http://localhost:8008/execute +``` + + +### `assistant.google` + +- [**Plugin documentation**](https://docs.platypush.tech/platypush/plugins/assistant.google.html) +- `pip` installation: `pip install 'platypush[assistant.google]'` + +This is the oldest voice integration in Platypush - and one of the use-cases +that actually motivated me into forking the [previous +project](https://github.com/blacklight/evesp) into what is now Platypush. + +As mentioned in the previous section, this integration is built on top of a +deprecated library (with no available alternatives) that just so happens to +still work with a bit of hammering on x86_64 and Raspberry Pi 3/4. + +Personally it's the voice assistant I still use on most of my devices, but it's +definitely not guaranteed that it will keep working in the future. + +Once you have installed Platypush with the dependencies for this integration, +you can configure it through these steps: + +1. Create a new project on the [Google developers + console](https://console.cloud.google.com) and [generate a new set of + credentials for it](https://console.cloud.google.com/apis/credentials). + Download the credentials secrets as JSON. +2. Generate [scoped + credentials](https://developers.google.com/assistant/sdk/guides/library/python/embed/install-sample#generate_credentials) + from your `secrets.json`. +3. Configure the integration in your `config.yaml` for Platypush (see the + [configuration + page](https://docs.platypush.tech/wiki/Configuration.html#configuration-file) + for more details): + +```yaml +assistant.google: + # Default: ~/.config/google-oauthlib-tool/credentials.json + # or /credentials/google/assistant.json + credentials_file: /path/to/credentials.json + # Default: no sound is played when "Ok Google" is detected + conversation_start_sound: /path/to/sound.mp3 +``` + +Restart the service, say "Ok Google" or "Hey Google" while the microphone is +active, and everything should work out of the box. + +You can now start creating event hooks to execute your custom voice commands. +For example, if you configured a lights plugin (e.g. +[`light.hue`](https://docs.platypush.tech/platypush/plugins/light.hue.html)) +and a music plugin (e.g. +[`music.mopidy`](https://docs.platypush.tech/platypush/plugins/music.mopidy.html)), +you can start building voice commands like these: + +```python +# Content of e.g. /path/to/config_yaml/scripts/assistant.py + +from platypush import run, when +from platypush.events.assistant import ( + ConversationStartEvent, SpeechRecognizedEvent +) + +light_plugin = "light.hue" +music_plugin = "music.mopidy" + +@when(ConversationStartEvent) +def pause_music_when_conversation_starts(): + run(f"{music_plugin}.pause_if_playing") + +# Note: (limited) support for regular expressions on `phrase` +# This hook will match any phrase containing either "turn on the lights" +# or "turn off the lights" +@when(SpeechRecognizedEvent, phrase="turn on (the?) lights") +def lights_on_command(): + run(f"{light_plugin}.on") + # Or, with arguments: + # run(f"{light_plugin}.on", groups=["Bedroom"]) + +@when(SpeechRecognizedEvent, phrase="turn off (the?) lights") +def lights_off_command(): + run(f"{light_plugin}.off") + +@when(SpeechRecognizedEvent, phrase="play (the?) music") +def play_music_command(): + run(f"{music_plugin}.play") + +@when(SpeechRecognizedEvent, phrase="stop (the?) music") +def stop_music_command(): + run(f"{music_plugin}.stop") +``` + +Or, via YAML: + +```yaml +# Add to your config.yaml, or to one of the files included in it + +event.hook.pause_music_when_conversation_starts: + if: + type: platypush.message.event.ConversationStartEvent + + then: + - action: music.mopidy.pause_if_playing + +event.hook.lights_on_command: + if: + type: platypush.message.event.SpeechRecognizedEvent + phrase: "turn on (the)? lights" + + then: + - action: light.hue.on + # args: + # groups: + # - Bedroom + +event.hook.lights_off_command: + if: + type: platypush.message.event.SpeechRecognizedEvent + phrase: "turn off (the)? lights" + + then: + - action: light.hue.off + +event.hook.play_music_command: + if: + type: platypush.message.event.SpeechRecognizedEvent + phrase: "play (the)? music" + + then: + - action: music.mopidy.play + +event.hook.stop_music_command: + if: + type: platypush.message.event.SpeechRecognizedEvent + phrase: "stop (the)? music" + + then: + - action: music.mopidy.stop +``` + +Parameters are also supported on the `phrase` event argument through the `${}` template construct. For example: + +```python +from platypush import when, run +from platypush.events.assistant import SpeechRecognizedEvent + +@when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}') +def on_play_track_command( + event: SpeechRecognizedEvent, title: str, artist: str +): + results = run( + "music.mopidy.search", + filter={"title": title, "artist": artist} + ) + + if not results: + event.assistant.render_response(f"Couldn't find {title} by {artist}") + return + + run("music.mopidy.play", resource=results[0]["uri"]) +``` + +#### Pros + +- 👍 Very fast and robust API. +- 👍 Easy to install and configure. +- 👍 It comes with almost all the features of a voice assistant installed on + Google hardware - except some actions native to Android-based devices and + video/display features. This means that features such as timers, alarms, + weather forecast, setting the volume or controlling Chromecasts on the same + network are all supported out of the box. +- 👍 It connects to your Google account (can be configured from your Google + settings), so things like location-based suggestions and calendar events are + available. Support for custom actions and devices configured in your Google + Home app is also available out of the box, although I haven't tested it in a + while. +- 👍 Good multi-language support. In most of the cases the assistant seems + quite capable of understanding questions in multiple language and respond in + the input language without any further configuration. + +#### Cons + +- 👎 Based on a deprecated API that could break at any moment. +- 👎 Limited hardware support (only x86_64 and RPi 3/4). +- 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available. +- 👎 Not possible to configure the output voice - it can only use the stock + Google Assistant voice. +- 👎 No support for intents - something similar was available (albeit tricky to + configure) through the Actions SDK, but that has also been abandoned by + Google. +- 👎 Not very modular. Both `assistant.picovoice` and `assistant.openai` have + been built by stitching together different independent APIs. Those plugins + are therefore quite *modular*. You can choose for instance to run only the + hotword engine of `assistant.picovoice`, which in turn will trigger the + conversation engine of `assistant.openai`, and maybe use `tts.google` to + render the responses. By contrast, given the relatively monolithic nature of + `google-assistant-library`, which runs the whole service locally, if your + instance runs `assistant.google` then it can't run other assistant plugins. + +### `assistant.picovoice` + +- [**Plugin + documentation**](https://docs.platypush.tech/platypush/plugins/assistant.picovoice.html) +- `pip` installation: `pip install 'platypush[assistant.picovoice]'` + +The `assistant.picovoice` integration is available from [Platypush +1.0.0](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-0-2024-05-26). + +Previous versions had some outdated `sst.picovoice.*` plugins for the +individual products, but they weren't properly tested and they weren't combined +together into a single integration that implements the Platypush' `assistant` +API. + +This integration is built on top of the voice products developed by +[Picovoice](https://picovoice.ai/). These include: + +- [**Porcupine**](https://picovoice.ai/platform/porcupine/): a fast and + customizable engine for hotword/wake-word detection. It can be enabled by + setting `hotword_enabled` to `true` in the `assistant.picovoice` plugin + configuration. + +- [**Cheetah**](https://picovoice.ai/docs/cheetah/): a speech-to-text engine + optimized for real-time transcriptions. It can be enabled by setting + `stt_enabled` to `true` in the `assistant.picovoice` plugin configuration. + +- [**Leopard**](https://picovoice.ai/docs/leopard/): a speech-to-text engine + optimized for offline transcriptions of audio files. + +- [**Rhino**](https://picovoice.ai/docs/rhino/): a speech-to-intent engine. + +- [**Orca**](https://picovoice.ai/docs/orca/): a text-to-speech engine. + +You can get your personal access key by signing up at the [Picovoice +console](https://console.picovoice.ai/). You may be asked to submit a reason +for using the service (feel free to mention a personal Platypush integration), +and you will receive your personal access key. + +If prompted to select the products you want to use, make sure to select +the ones from the Picovoice suite that you want to use with the +`assistant.picovoice` plugin. + +A basic plugin configuration would like this: + +```yaml +assistant.picovoice: + access_key: YOUR_ACCESS_KEY + + # Keywords that the assistant should listen for + keywords: + - alexa + - computer + - ok google + + # Paths to custom keyword files + # keyword_paths: + # - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn + + # Enable/disable the hotword engine + hotword_enabled: true + # Enable the STT engine + stt_enabled: true + + # conversation_start_sound: ... + + # Path to a custom model to be used to speech-to-text + # speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv + + # Path to an intent model. At least one custom intent model is required if + # you want to enable intent detection. + # intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn +``` + +#### Hotword detection + +If enabled through the `hotword_enabled` parameter (default: True), the +assistant will listen for a specific wake word before starting the +speech-to-text or intent recognition engines. You can specify custom models for +your hotword (e.g. on the same device you may use "Alexa" to trigger the +speech-to-text engine in English, "Computer" to trigger the speech-to-text +engine in Italian, and "Ok Google" to trigger the intent recognition engine). + +You can also create your custom hotword models using the [Porcupine +console](https://console.picovoice.ai/ppn). + +If `hotword_enabled` is set to True, you must also specify the `keywords` +parameter with the list of keywords that you want to listen for, and optionally +the `keyword_paths` parameter with the paths to the any custom hotword models +that you want to use. If `hotword_enabled` is set to False, then the assistant +won't start listening for speech after the plugin is started, and you will need +to programmatically start the conversation by calling the +`assistant.picovoice.start_conversation` action. + +When a wake-word is detected, the assistant will emit a +[`HotwordDetectedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent) +that you can use to build your custom logic. + +By default, the assistant will start listening for speech after the hotword if +either `stt_enabled` or `intent_model_path` are set. If you don't want the +assistant to start listening for speech after the hotword is detected (for +example because you want to build your custom response flows, or trigger the +speech detection using different models depending on the hotword that is used, +or because you just want to detect hotwords but not speech), then you can also +set the `start_conversation_on_hotword` parameter to `false`. If that is the +case, then you can programmatically start the conversation by calling the +`assistant.picovoice.start_conversation` method in your event hooks: + +```python +from platypush import when, run +from platypush.message.event.assistant import HotwordDetectedEvent + +# Start a conversation using the Italian language model when the +# "Buongiorno" hotword is detected +@when(HotwordDetectedEvent, hotword='Buongiorno') +def on_it_hotword_detected(event: HotwordDetectedEvent): + event.assistant.start_conversation(model_file='path/to/it.pv') +``` + +#### Speech-to-text + +If you want to build your custom STT hooks, the approach is the same seen for +the `assistant.google` plugins - create an event hook on +[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent) +with a given exact phrase, regex or template. + +#### Speech-to-intent + + +*Intents* are structured actions parsed from unstructured human-readable text. + +Unlike with hotword and speech-to-text detection, you need to provide a +custom model for intent detection. You can create your custom model using +the [Rhino console](https://console.picovoice.ai/rhn). + +When an intent is detected, the assistant will emit an +[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent) +and you can build your custom hooks on it. + +For example, you can build a model to control groups of smart lights by +defining the following slots on the Rhino console: + +- ``device_state``: The new state of the device (e.g. with ``on`` or + ``off`` as supported values) + +- ``room``: The name of the room associated to the group of lights to + be controlled (e.g. ``living room``, ``kitchen``, ``bedroom``) + +You can then define a ``lights_ctrl`` intent with the following expressions: + +- "*turn ``$device_state:state`` the lights*" +- "*turn ``$device_state:state`` the ``$room:room`` lights*" +- "*turn the lights ``$device_state:state``*" +- "*turn the ``$room:room`` lights ``$device_state:state``*" +- "*turn ``$room:room`` lights ``$device_state:state``*" + +This intent will match any of the following phrases: + +- "*turn on the lights*" +- "*turn off the lights*" +- "*turn the lights on*" +- "*turn the lights off*" +- "*turn on the living room lights*" +- "*turn off the living room lights*" +- "*turn the living room lights on*" +- "*turn the living room lights off*" + +And it will extract any slots that are matched in the phrases in the +[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent). + +Train the model, download the context file, and pass the path on the +``intent_model_path`` parameter. + +You can then register a hook to listen to a specific intent: + +```python +from platypush import when, run +from platypush.events.assistant import IntentRecognizedEvent + +@when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'}) +def on_turn_on_lights(event: IntentRecognizedEvent): + room = event.slots.get('room') + if room: + run("light.hue.on", groups=[room]) + else: + run("light.hue.on") +``` + +Note that if both `stt_enabled` and `intent_model_path` are set, then +both the speech-to-text and intent recognition engines will run in parallel +when a conversation is started. + +The intent engine is usually faster, as it has a smaller set of intents to +match and doesn't have to run a full speech-to-text transcription. This means that, +if an utterance matches both a speech-to-text phrase and an intent, the +`IntentRecognizedEvent` event is emitted (and not `SpeechRecognizedEvent`). + +This may not be always the case though. So, if you want to use the intent +detection engine together with the speech detection, it may be a good practice +to also provide a fallback `SpeechRecognizedEvent` hook to catch the text if +the speech is not recognized as an intent: + +```python +from platypush import when, run +from platypush.events.assistant import SpeechRecognizedEvent + +@when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?') +def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context): + if room: + run("light.hue.on", groups=[room]) + else: + run("light.hue.on") +``` + +#### Text-to-speech and response management + +The text-to-speech engine, based on Orca, is provided by the +[`tts.picovoice`](https://docs.platypush.tech/platypush/plugins/tts.picovoice.html) +plugin. + +However, the Picovoice integration won't provide you with automatic +AI-generated responses for your queries. That's because Picovoice doesn't seem +to offer (yet) any products for conversational assistants, either voice-based +or text-based. + +You can however leverage the `render_response` action to render some text as +speech in response to a user command, and that in turn will leverage the +Picovoice TTS plugin to render the response. + +For example, the following snippet provides a hook that: + +- Listens for `SpeechRecognizedEvent`. + +- Matches the phrase against a list of predefined commands that shouldn't + require an AI-generated response. + +- Has a fallback logic that leverages `openai.get_response` to generate a + response through a ChatGPT model and render it as audio. + +Also, note that any text rendered over the `render_response` action that ends +with a question mark will automatically trigger a follow-up - i.e. the +assistant will wait for the user to answer its question. + +```python +import re + +from platypush import hook, run +from platypush.message.event.assistant import SpeechRecognizedEvent + +def play_music(): + run("music.mopidy.play") + +def stop_music(): + run("music.mopidy.stop") + +def ai_assist(event: SpeechRecognizedEvent): + response = run("openai.get_response", prompt=event.phrase) + if not response: + return + + run("assistant.picovoice.render_response", text=response) + +# List of commands to match, as pairs of regex patterns and the +# corresponding actions +hooks = ( + (re.compile(r"play (the)?music", re.IGNORECASE), play_music), + (re.compile(r"stop (the)?music", re.IGNORECASE), stop_music), + # Fallback to the AI assistant + (re.compile(r".*"), ai_assist), +) + +@when(SpeechRecognizedEvent) +def on_speech_recognized(event, **kwargs): + for pattern, command in hooks: + if pattern.search(event.phrase): + run("logger.info", msg=f"Running voice command: {command.__name__}") + command(event, **kwargs) + break +``` + +#### Offline speech-to-text + +An [`assistant.picovoice.transcribe` +action](https://docs.platypush.tech/platypush/plugins/assistant.picovoice.html#platypush.plugins.assistant.picovoice.AssistantPicovoicePlugin.transcribe) +is provided for offline transcriptions of audio files, using the Leopard +models. + +You can easily call it from your procedures, hooks or through the API: + +```bash +$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d ' +{ + "type": "request", + "action": "assistant.picovoice.transcribe", + "args": { + "audio_file": "/path/to/some/speech.mp3" + } +}' http://localhost:8008/execute + +{ + "transcription": "This is a test", + "words": [ + { + "word": "this", + "start": 0.06400000303983688, + "end": 0.19200000166893005, + "confidence": 0.9626294374465942 + }, + { + "word": "is", + "start": 0.2879999876022339, + "end": 0.35199999809265137, + "confidence": 0.9781675934791565 + }, + { + "word": "a", + "start": 0.41600000858306885, + "end": 0.41600000858306885, + "confidence": 0.9764975309371948 + }, + { + "word": "test", + "start": 0.5120000243186951, + "end": 0.8320000171661377, + "confidence": 0.9511580467224121 + } + ] +} +``` + +#### Pros + +- 👍 The Picovoice integration is extremely configurable. `assistant.picovoice` + stitches together five independent products developed by a small company + specialized in voice products for developers. As such, Picovoice may be the + best option if you have custom use-cases. You can pick which features you + need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you + have plenty of flexibility in building your integrations. + +- 👍 Runs (or seems to run) (mostly) on device. This is something that we can't + say about the other two integrations discussed in this article. If keeping + your voice interactions 100% hidden from Google's or Microsoft's eyes is a + priority, then Picovoice may be your best bet. + +- 👍 Rich features. It uses different models for different purposes - for + example, Cheetah models are optimized for real-time speech detection, while + Leopard is optimized for offline transcription. Moreover, Picovoice is the + only integration among those analyzed in this article to support + speech-to-intent. + +- 👍 It's very easy to build new models or customize existing ones. Picovoice + has a powerful developers console that allows you to easily create hotword + models, tweak the priority of some words in voice models, and create custom + intent models. + +#### Cons + +- 👎 The business model is still a bit weird. It's better than the earlier + "*write us an email with your business case and we'll reach back to you*", + but it still requires you to sign up with a business email and write a couple + of lines on what you want to build with their products. It feels like their + focus is on a B2B approach rather than "open up and let the community build + stuff", and that seems to create unnecessary friction. + +- 👎 No native conversational features. At the time of writing, Picovoice + doesn't offer products that generate AI responses given voice or text + prompts. This means that, if you want AI-generated responses to your queries, + you'll have to do requests to e.g. + [`openai.get_response(prompt)`](https://docs.platypush.tech/platypush/plugins/openai.html#platypush.plugins.openai.OpenaiPlugin.get_response) + directly in your hooks for `SpeechRecognizedEvent`, and render the responses + through `assistant.picovoice.render_response`. This makes the use of + `assistant.picovoice` alone more fit to cases where you want to mostly create + voice command hooks rather than have general-purpose conversations. + +- 👎 Speech-to-text, at least on my machine, is slower than the other two + integrations, and the accuracy with non-native accents is also much lower. + +- 👎 Limited support for any languages other than English. At the time of + writing hotword detection with Porcupine seems to be in a relative good shape + with [support for 16 + languages](https://github.com/Picovoice/porcupine/tree/master/lib/common). + However, both speech-to-text and text-to-speech only support English at the + moment. + +- 👎 Some APIs are still quite unstable. The Orca text-to-speech API, for + example, doesn't even support text that includes digits or some punctuation + characters - at least not at the time of writing. The Platypush integration + fills the gap with workarounds that e.g. replace words to numbers and replace + punctuation characters, but you definitely have a feeling that some parts of + their products are still work in progress. + +### `assistant.openai` + +- [**Plugin + documentation**](https://docs.platypush.tech/platypush/plugins/assistant.openai.html) +- `pip` installation: `pip install 'platypush[assistant.openai]'` + +This integration has been released in [Platypush +1.0.7](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-7-2024-06-02). + +It uses the following OpenAI APIs: + +- [`/audio/transcriptions`](https://platform.openai.com/docs/guides/speech-to-text) + for speech-to-text. At the time of writing the default model is `whisper-1`. + It can be configured through the `model` setting on the `assistant.openai` + plugin configuration. See the [OpenAI + documentation](https://platform.openai.com/docs/models/whisper) for a list of + available models. +- [`/chat/completions`](https://platform.openai.com/docs/api-reference/completions/create) + to get AI-generated responses using a GPT model. At the time of writing the + default is `gpt-3.5-turbo`, but it can be configurable through the `model` + setting on the `openai` plugin configuration. See the [OpenAI + documentation](https://platform.openai.com/docs/models) for a list of supported models. +- [`/audio/speech`](https://platform.openai.com/docs/guides/text-to-speech) for + text-to-speech. At the time of writing the default model is `tts-1` and the + default voice is `nova`. They can be configured through the `model` and + `voice` settings respectively on the `tts.openai` plugin. See the OpenAI + documentation for a list of available + [models](https://platform.openai.com/docs/models/tts) and + [voices](https://platform.openai.com/docs/guides/text-to-speech/voice-options). + +You will need an [OpenAI API key](https://platform.openai.com/api-keys) +associated to your account. + +A basic configuration would like this: + +```yaml +openai: + api_key: YOUR_OPENAI_API_KEY # Required + # conversation_start_sound: ... + # model: ... + # context: ... + # context_expiry: ... + # max_tokens: ... + +assistant.openai: + # model: ... + # tts_plugin: some.other.tts.plugin + +tts.openai: + # model: ... + # voice: ... +``` + +If you want to build your custom hooks on speech events, the approach is the +same seen for the other `assistant` plugins - create an event hook on +[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent) +with a given exact phrase, regex or template. + +#### Hotword support + +OpenAI doesn't provide an API for hotword detection, nor a small model for +offline detection. + +This means that, if no other `assistant` plugins with stand-alone hotword +support are configured (only `assistant.picovoice` for now), a conversation can +only be triggered by calling the `assistant.openai.start_conversation` action. + +If you want hotword support, then the best bet is to add `assistant.picovoice` +to your configuration too - but make sure to only enable hotword detection and +not speech detection, which will be delegated to `assistant.openai` via event +hook: + +```yaml +assistant.picovoice: + access_key: ... + keywords: + - computer + + hotword_enabled: true + stt_enabled: false + # conversation_start_sound: ... +``` + +Then create a hook that listens for +[`HotwordDetectedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent) +and calls `assistant.openai.start_conversation`: + +```python +from platypush import run, when +from platypush.events.assistant import HotwordDetectedEvent + +@when(HotwordDetectedEvent, hotword="computer") +def on_hotword_detected(): + run("assistant.openai.start_conversation") +``` + +#### Conversation contexts + +The most powerful feature offered by the OpenAI assistant is the fact that it +leverages the *conversation contexts* provided by the OpenAI API. + +This means two things: + +1. Your assistant can be initialized/tuned with a *static context*. It is + possible to provide some initialization context to the assistant that can + fine tune how the assistant will behave, (e.g. what kind of + tone/language/approach will have when generating the responses), as well as + initialize the assistant with some predefined knowledge in the form of + hypothetical past conversations. Example: + + ```yaml + openai: + ... + + context: + # `system` can be used to initialize the context for the expected tone + # and language in the assistant responses + - role: system + content: > + You are a voice assistant that responds to user queries using + references to Lovecraftian lore. + + # `user`/`assistant` interactions can be used to initialize the + # conversation context with previous knowledge. `user` is used to + # emulate previous user questions, and `assistant` models the + # expected response. + - role: user + content: What is a telephone? + - role: assistant + content: > + A Cthulhuian device that allows you to communicate with + otherworldly beings. It is said that the first telephone was + created by the Great Old Ones themselves, and that it is a + gateway to the void beyond the stars. + ``` + + If you now start Platypush and ask a question like "*how does it work?*", + the voice assistant may give a response along the lines of: + + ``` + The telephone functions by harnessing the eldritch energies of the cosmos to + transmit vibrations through the ether, allowing communication across vast + distances with entities from beyond the veil. Its operation is shrouded in + mystery, for it relies on arcane principles incomprehensible to mortal + minds. + ``` + + Note that: + + 1. The style of the response is consistent with that initialized in the + `context` through `system` roles. + + 2. Even though a question like "*how does it work?*" is not very specific, + the assistant treats the `user`/`assistant` entries given in the context + as if they were the latest conversation prompts. Thus it realizes that + "*it*", in this context, probably means "*the telephone*". + +2. The assistant has a *runtime context*. It will remember the recent + conversations for a given amount of time (configurable through the + `context_expiry` setting on the `openai` plugin configuration). So, even + without explicit context initialization in the `openai` plugin, the plugin + will remember the last interactions for (by default) 10 minutes. So if you + ask "*who wrote the Divine Comedy?*", and a few seconds later you ask + "*where was its writer from?*", you may get a response like "*Florence, + Italy*" - i.e. the assistant realizes that "*the writer*" in this context is + likely to mean "*the writer of the work that I was asked about in the + previous interaction*" and return pertinent information. + +#### Pros + +- 👍 Speech detection quality. The OpenAI speech-to-text features are the best + among the available `assistant` integrations. The `transcribe` API so far has + detected my non-native English accent right nearly 100% of the times (Google + comes close to 90%, while Picovoice trails quite behind). And it even detects + the speech of my young kid - something that the Google Assistant library has + always failed to do right. + +- 👍 Text-to-speech quality. The voice models used by OpenAI sound much more + natural and human than those of both Google and Picovoice. Google's and + Picovoice's TTS models are actually already quite solid, but OpenAI + outclasses them when it comes to voice modulation, inflections and sentiment. + The result sounds intimidatingly realistic. + +- 👍 AI responses quality. While the scope of the Google Assistant is somewhat + limited by what people expected from voice assistants until a few years ago + (control some devices and gadgets, find my phone, tell me the news/weather, + do basic Google searches...), usually without much room for follow-ups, + `assistant.openai` will basically render voice responses as if you were + typing them directly to ChatGPT. While Google would often respond you with a + "*sorry, I don't understand*", or "*sorry, I can't help with that*", the + OpenAI assistant is more likely to expose its reasoning, ask follow-up + questions to refine its understanding, and in general create a much more + realistic conversation. + +- 👍 Contexts. They are an extremely powerful way to initialize your assistant + and customize it to speak the way you want, and know the kind of things that + you want it to know. Cross-conversation contexts with configurable expiry + also make it more natural to ask something, get an answer, and then ask + another question about the same topic a few seconds later, without having to + reintroduce the assistant to the whole context. + +- 👍 Offline transcriptions available through the `openai.transcribe` action. + +- 👍 Multi-language support seems to work great out of the box. Ask something + to the assistant in any language, and it'll give you a response in that + language. + +- 👍 Configurable voices and models. + +#### Cons + +- 👎 The full pack of features is only available if you have an API key + associated to a paid OpenAI account. + +- 👎 No hotword support. It relies on `assistant.picovoice` for hotword + detection. + +- 👎 No intents support. + +- 👎 No native support for weather forecast, alarms, timers, integrations with + other services/devices nor other features available out of the box with the + Google Assistant. You can always create hooks for them though. + +### Weather forecast example + +Both the OpenAI and Picovoice integrations lack some features available out of +the box on the Google Assistant - weather forecast, news playback, timers etc. - +as they rely on voice-only APIs that by default don't connect to other services. + +However Platypush provides many plugins to fill those gaps, and those features +can be implemented with custom event hooks. + +Let's see for example how to build a simple hook that delivers the weather +forecast for the next 24 hours whenever the assistant gets a phrase that +contains the "*weather today*" string. + +You'll need to enable a `weather` plugin in Platypush - +[`weather.openweathermap`](https://docs.platypush.tech/platypush/plugins/weather.openweathermap.html) +will be used in this example. Configuration: + +```yaml +weather.openweathermap: + token: OPENWEATHERMAP_API_KEY + location: London,GB +``` + +Then drop a script named e.g. `weather.py` in the Platypush scripts directory +(default: `/scripts`) with the following content: + +```python +from datetime import datetime +from textwrap import dedent +from time import time + +from platypush import run, when +from platypush.events.assistant import SpeechRecognizedEvent + +@when(SpeechRecognizedEvent, phrase='weather today') +def weather_forecast(event: SpeechRecognizedEvent): + limit = time() + 24 * 60 * 60 # 24 hours from now + forecast = [ + weather + for weather in run("weather.openweathermap.get_forecast") + if datetime.fromisoformat(weather["time"]).timestamp() < limit + ] + + min_temp = round( + min(weather["temperature"] for weather in forecast) + ) + max_temp = round( + max(weather["temperature"] for weather in forecast) + ) + max_wind_gust = round( + (max(weather["wind_gust"] for weather in forecast)) * 3.6 + ) + summaries = [weather["summary"] for weather in forecast] + most_common_summary = max(summaries, key=summaries.count) + avg_cloud_cover = round( + sum(weather["cloud_cover"] for weather in forecast) / len(forecast) + ) + + event.assistant.render_response( + dedent( + f""" + The forecast for today is: {most_common_summary}, with + a minimum of {min_temp} and a maximum of {max_temp} + degrees, wind gust of {max_wind_gust} km/h, and an + average cloud cover of {avg_cloud_cover}%. + """ + ) + ) +``` + +This script will work with any of the available voice assistants. + +You can also implement something similar for news playback, for example using +the [`rss` plugin](https://docs.platypush.tech/platypush/plugins/rss.html) to +get the latest items in your subscribed feeds. Or to create custom alarms using +the [`alarm` plugin](https://docs.platypush.tech/platypush/plugins/alarm.html), +or a timer using the [`utils.set_timeout` +action](https://docs.platypush.tech/platypush/plugins/utils.html#platypush.plugins.utils.UtilsPlugin.set_timeout). + +## Conclusions + +The past few years have seen a lot of things happen in the voice industry. +Many products have gone out of market, been deprecated or sunset, but not all +hope is lost. The OpenAI and Picovoice products, especially when combined +together, can still provide a good out-of-the-box voice assistant experience. +And the OpenAI products have also raised the bar on what to expect from an +AI-based assistant. + +I wish that there were still some fully open and on-device alternatives out +there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google +provide the best voice experience as of now, but of course they come with +trade-offs - namely the great amount of data points you feed to these +cloud-based services. Picovoice is somewhat a trade-off, as it runs at least +partly on-device, but their business model is still a bit fuzzy and it's not +clear whether they intend to have their products used by the wider public or if +it's mostly B2B. + +I'll keep an eye however on what is going to come from the ashes of Mycroft +under the form of the +[OpenConversational](https://community.openconversational.ai/) project, and +probably keep you up-to-date when there is a new integration to share.