blog/markdown/The-state-of-voice-assistant-integrations-in-2024.md
2024-06-03 16:44:07 +02:00

1254 lines
52 KiB
Markdown

[//]: # (title: The state of voice assistant integrations in 2024)
[//]: # (description: How to use Platypush to build your voice assistants. Featuring Google, OpenAI and Picovoice.)
[//]: # (image: https://platypush-static.s3.nl-ams.scw.cloud/images/voice-assistant-2.png)
[//]: # (author: Fabio Manganiello <fabio@platypush.tech>)
[//]: # (published: 2024-06-02)
Those who have been following my blog or used Platypush for a while probably
know that I've put quite some efforts to get voice assistants rights over the
past few years.
I built my first (very primitive) voice assistant that used DCT+Markov models
[back in 2008](https://github.com/blacklight/Voxifera), when the concept was
still pretty much a science fiction novelty.
Then I wrote [an article in
2019](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush)
and [one in
2020](https://blog.platypush.tech/article/Build-custom-voice-assistants) on how
to use several voice integrations in [Platypush](https://platypush.tech) to
create custom voice assistants.
## Everyone in those pictures is now dead
Quite a few things have changed in this industry niche since I wrote my
previous article. Most of the solutions that I covered back in the day,
unfortunately, are gone in a way or another:
- The `assistant.snowboy` integration is gone because unfortunately [Snowboy is
gone](https://github.com/Kitt-AI/snowboy). For a while you could still run
the Snowboy code with models that either you had previously downloaded from
their website or trained yourself, but my latest experience proved to be
quite unfruitful - it's been more than 4 years since the last commit on
Snowboy, and it's hard to get the code to even run.
- The `assistant.alexa` integration is also gone, as Amazon [has stopped
maintaining the AVS SDK](https://github.com/alexa/avs-device-sdk). And I have
literally no clue of what Amazon's plans with the development of Alexa skills
are (if there are any plans at all).
- The `stt.deepspeech` integration is also gone: [the project hasn't seen a
commit in 3 years](https://github.com/mozilla/DeepSpeech) and I even
struggled to get the latest code to run. Given the current financial
situation at Mozilla, and the fact that they're trying to cut as much as
possible on what they don't consider part of their core product, it's
very unlikely that DeepSpeech will be revived any time soon.
- The `assistant.google` integration [is still
there](https://docs.platypush.tech/platypush/plugins/assistant.google.html),
but I can't make promises on how long it can be maintained. It uses the
[`google-assistant-library`](https://pypi.org/project/google-assistant-library/),
which was [deprecated in
2019](https://developers.google.com/assistant/sdk/release-notes). Google
replaced it with the [conversational
actions](https://developers.google.com/assistant/sdk/), which [was also
deprecated last year](https://developers.google.com/assistant/ca-sunset).
`<rant>`Put here your joke about Google building products with the shelf life
of a summer hit.`</rant>`
- The `tts.mimic3` integration, a text model based on
[mimic3](https://github.com/MycroftAI/mimic3), part of the
[Mycroft](https://en.wikipedia.org/wiki/Mycroft_(software)) initiative, [is
still there](https://docs.platypush.tech/platypush/plugins/tts.mimic3.html),
but only because it's still possible to [spin up a Docker
image](https://hub.docker.com/r/mycroftai/mimic3) that runs mimic3. The whole
Mycroft project, however, [is now
defunct](https://community.openconversational.ai/t/update-from-the-ceo-part-1/13268),
and [the story of how it went
bankrupt](https://www.reuters.com/legal/transactional/appeals-court-says-judge-favored-patent-plaintiff-scorched-earth-case-2022-03-04/)
is a very sad story about the power that patent trolls have on startups. The
Mycroft initiative however seems to [have been picked up by the
community](https://community.openconversational.ai/), and something seems to
move in the space of fully open source and on-device voice models. I'll
definitely be looking with interest at what happens in that space, but the
project seems to be at a stage that is still a bit immature to justify an
investment into a new Platypush integration.
## But not all hope is lost
### `assistant.google`
`assistant.google` may be relying on a dead library, but it's not dead (yet).
The code still works, but you're a bit constrained on the hardware side - the
assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3
and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other
ARMv7-compatible devices has proved to be a challenge in some cases. Given the
state of the library, it's safe to say that it'll never be supported on other
platforms, but if you want to run your assistant on a device that is still
supported then it should still work fine.
I had however to do a few dirty packaging tricks to ensure that the assistant
library code doesn't break badly on newer versions of Python. That code hasn't
been touched in 5 years and it's starting to rot. It depends on ancient and
deprecated Python libraries like [`enum34`](https://pypi.org/project/enum34/)
and it needs some hammering to work - without breaking the whole Python
environment in the process.
For now, `pip install 'platypush[assistant.google]'` should do all the dirty
work and get all of your assistant dependencies installed. But I can't promise
I can maintain that code forever.
### `assistant.picovoice`
Picovoice has been a nice surprise in an industry niche where all the
products that were available just 4 years ago are now dead.
I described some of their products [in my previous
articles](https://blog.platypush.tech/article/Build-custom-voice-assistants),
and I even built a couple of `stt.picovoice.*` plugins for Platypush back in
the day, but I didn't really put much effort in it.
Their business model seemed a bit weird - along the lines of "you can test our
products on x86_64, if you need an ARM build you should contact us as a
business partner". And the quality of their products was also a bit
disappointing compared to other mainstream offerings.
I'm glad to see that the situation has changed quite a bit now. They still have
a "sign up with a business email" model, but at least now you can just sign up
on their website and start using their products rather than sending emails
around. And I'm also quite impressed to see the progress on their website. You
can now train hotword models, customize speech-to-text models and build your
own intent rules directly from their website - a feature that was also
available in the beloved Snowboy and that went missing from any major product
offerings out there after Snowboy was gone. I feel like the quality of their
models has also greatly improved compared to the last time I checked them -
predictions are still slower than the Google Assistant, definitely less
accurate with non-native accents, but the gap with the Google Assistant when it
comes to native accents isn't very wide.
### `assistant.openai`
OpenAI has filled many gaps left by all the casualties in the voice assistants
market. Platypush now provides a new `assistant.openai` plugin that stitches
together several of their APIs to provide a voice assistant experience that
honestly feels much more natural than anything I've tried in all these years.
Let's explore how to use these integrations to build our on-device voice
assistant with custom rules.
## Feature comparison
As some of you may know, voice assistant often aren't monolithic products.
Unless explicitly designed as all-in-one packages (like the
`google-assistant-library`), voice assistant integrations in Platypush are
usually built on top of four distinct APIs:
1. **Hotword detection**: This is the component that continuously listens on
your microphone until you speak "Ok Google", "Alexa" or any other wake-up
word used to start a conversation. Since it's a continuously listening
component that needs to take decisions fast, and it only has to recognize
one word (or in a few cases 3-4 more at most), it usually doesn't need to
run on a full language model. It needs small models, often a couple of MBs
heavy at most.
2. **Speech-to-text** (*STT*): This is the component that will capture audio
from the microphone and use some API to transcribe it to text.
3. **Response engine**: Once you have the transcription of what the user said,
you need to feed it to some model that will generate some human-like
response for the question.
4. **Text-to-speech** (*TTS*): Once you have your AI response rendered as a
text string, you need a text-to-speech model to speak it out loud on your
speakers or headphones.
On top of these basic building blocks for a voice assistant, some integrations
may also provide two extra features.
#### Speech-to-intent
In this mode, the user's prompt, instead of being transcribed directly to text,
is transcribed into a structured *intent* that can be more easily processed by
a downstream integration with no need for extra text parsing, regular
expressions etc.
For instance, a voice command like "*turn off the bedroom lights*" could be
translated into an intent such as:
```json
{
"intent": "lights_ctrl",
"slots": {
"state": "off",
"lights": "bedroom"
}
}
```
#### Offline speech-to-text
a.k.a. *offline text transcriptions*. Some assistant integrations may offer you
the ability to pass some audio file and transcribe their content as text.
### Features summary
This table summarizes how the `assistant` integrations available in Platypush
compare when it comes to what I would call the *foundational* blocks:
| Plugin | Hotword | STT | AI responses | TTS |
| --------------------- | ------- | --- | ------------ | --- |
| `assistant.google` | ✅ | ✅ | ✅ | ✅ |
| `assistant.openai` | ❌ | ✅ | ✅ | ✅ |
| `assistant.picovoice` | ✅ | ✅ | ❌ | ✅ |
And this is how they compare in terms of extra features:
| Plugin | Intents | Offline SST |
| --------------------- | ------- | ------------|
| `assistant.google` | ❌ | ❌ |
| `assistant.openai` | ❌ | ✅ |
| `assistant.picovoice` | ✅ | ✅ |
Let's see a few configuration examples to better understand the pros and cons
of each of these integrations.
## Configuration
### Hardware requirements
1. A computer, a Raspberry Pi, an old tablet, or anything in between, as long
as it can run Python. At least 1GB of RAM is advised for smooth audio
processing experience.
2. A microphone.
3. Speaker/headphones.
### Installation notes
[Platypush
1.0.0](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-0-2024-05-26)
has [recently been
released](https://blog.platypush.tech/article/Platypush-1.0-is-out), and [new
installation procedures](https://docs.platypush.tech/wiki/Installation.html)
with it.
There's now official support for [several package
managers](https://docs.platypush.tech/wiki/Installation.html#system-package-manager-installation),
a better [Docker installation
process](https://docs.platypush.tech/wiki/Installation.html#docker), and more
powerful ways to [install
plugins](https://docs.platypush.tech/wiki/Plugins-installation.html) - via
[`pip` extras](https://docs.platypush.tech/wiki/Plugins-installation.html#pip),
[Web
interface](https://docs.platypush.tech/wiki/Plugins-installation.html#web-interface),
[Docker](https://docs.platypush.tech/wiki/Plugins-installation.html#docker) and
[virtual
environments](https://docs.platypush.tech/wiki/Plugins-installation.html#virtual-environment).
The optional dependencies for any Platypush plugins can be installed via `pip`
extras in the simplest case:
```
$ pip install 'platypush[plugin1,plugin2,...]'
```
For example, if you want to install Platypush with the dependencies for
`assistant.openai` and `assistant.picovoice`:
```
$ pip install 'platypush[assistant.openai,assistant.picovoice]'
```
Some plugins however may require extra system dependencies that are not
available via `pip` - for instance, both the OpenAI and Picovoice integrations
require the `ffmpeg` binary to be installed, as it is used for audio
conversion and exporting purposes. You can check the [plugins
documentation](https://docs.platypush.tech) for any system dependencies
required by some integrations, or install them automatically through the Web
interface or the `platydock` command for Docker containers.
### A note on the hooks
All the custom actions in this article are built through event hooks triggered
by
[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent)
(or
[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent)
for intents). When an intent event is triggered, or a speech event with a
condition on a phrase, the `assistant` integrations in Platypush will prevent
the default assistant response. That's to avoid cases where e.g. you say "*turn
off the lights*", your hook takes care of running the actual action, while your
voice assistant fetches a response from Google or ChatGPT along the lines of
"*sorry, I can't control your lights*".
If you want to render a custom response from an event hook, you can do so by
calling `event.assistant.render_response(text)`, and it will be spoken using
the available text-to-speech integration.
If you want to disable this behaviour, and you want the default assistant
response to always be rendered, even if it matches a hook with a phrase or an
intent, you can do so by setting the `stop_conversation_on_speech_match`
parameter to `false` in your assistant plugin configuration.
### Text-to-speech
Each of the available `assistant` plugins has it own default `tts` plugin associated:
- `assistant.google`:
[`tts`](https://docs.platypush.tech/platypush/plugins/tts.html), but
[`tts.google`](https://docs.platypush.tech/platypush/plugins/tts.google.html)
is also available. The difference is that `tts` uses the (unofficial) Google
Translate frontend API - it requires no extra configuration, but besides
setting the input language it isn't very configurable. `tts.google` on the
other hand uses the [Google Cloud Translation
API](https://cloud.google.com/translate/docs/reference/rest/). It is much
more versatile, but it requires an extra API registered to your Google
project and an extra credentials file.
- `assistant.openai`:
[`tts.openai`](https://docs.platypush.tech/platypush/plugins/tts.openai.html),
which leverages the [OpenAI
text-to-speech API](https://platform.openai.com/docs/guides/text-to-speech).
- `assistant.picovoice`:
[`tts.picovoice`](https://docs.platypush.tech/platypush/plugins/tts.picovoice.html),
which uses the (still experimental, at the time of writing) [Picovoice Orca
engine](https://github.com/Picovoice/orca).
Any text rendered via `assistant*.render_response` will be rendered using the
associated TTS plugin. You can however customize it by setting `tts_plugin` on
your assistant plugin configuration - e.g. you can render responses from the
OpenAI assistant through the Google or Picovoice engine, or the other way
around.
`tts` plugins also expose a `say` action that can be called outside of an
assistant context to render custom text at runtime - for example, from other
[event
hooks](https://docs.platypush.tech/wiki/Quickstart.html#turn-on-the-lights-when-i-say-so),
[procedures](https://docs.platypush.tech/wiki/Quickstart.html#greet-me-with-lights-and-music-when-i-come-home),
[cronjobs](https://docs.platypush.tech/wiki/Quickstart.html#turn-off-the-lights-at-1-am)
or [API calls](https://docs.platypush.tech/wiki/APIs.html). For example:
```bash
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
"type": "request",
"action": "tts.openai.say",
"args": {
"text": "What a wonderful day!"
}
}
' http://localhost:8008/execute
```
### `assistant.google`
- [**Plugin documentation**](https://docs.platypush.tech/platypush/plugins/assistant.google.html)
- `pip` installation: `pip install 'platypush[assistant.google]'`
This is the oldest voice integration in Platypush - and one of the use-cases
that actually motivated me into forking the [previous
project](https://github.com/blacklight/evesp) into what is now Platypush.
As mentioned in the previous section, this integration is built on top of a
deprecated library (with no available alternatives) that just so happens to
still work with a bit of hammering on x86_64 and Raspberry Pi 3/4.
Personally it's the voice assistant I still use on most of my devices, but it's
definitely not guaranteed that it will keep working in the future.
Once you have installed Platypush with the dependencies for this integration,
you can configure it through these steps:
1. Create a new project on the [Google developers
console](https://console.cloud.google.com) and [generate a new set of
credentials for it](https://console.cloud.google.com/apis/credentials).
Download the credentials secrets as JSON.
2. Generate [scoped
credentials](https://developers.google.com/assistant/sdk/guides/library/python/embed/install-sample#generate_credentials)
from your `secrets.json`.
3. Configure the integration in your `config.yaml` for Platypush (see the
[configuration
page](https://docs.platypush.tech/wiki/Configuration.html#configuration-file)
for more details):
```yaml
assistant.google:
# Default: ~/.config/google-oauthlib-tool/credentials.json
# or <PLATYPUSH_WORKDIR>/credentials/google/assistant.json
credentials_file: /path/to/credentials.json
# Default: no sound is played when "Ok Google" is detected
conversation_start_sound: /path/to/sound.mp3
```
Restart the service, say "Ok Google" or "Hey Google" while the microphone is
active, and everything should work out of the box.
You can now start creating event hooks to execute your custom voice commands.
For example, if you configured a lights plugin (e.g.
[`light.hue`](https://docs.platypush.tech/platypush/plugins/light.hue.html))
and a music plugin (e.g.
[`music.mopidy`](https://docs.platypush.tech/platypush/plugins/music.mopidy.html)),
you can start building voice commands like these:
```python
# Content of e.g. /path/to/config_yaml/scripts/assistant.py
from platypush import run, when
from platypush.events.assistant import (
ConversationStartEvent, SpeechRecognizedEvent
)
light_plugin = "light.hue"
music_plugin = "music.mopidy"
@when(ConversationStartEvent)
def pause_music_when_conversation_starts():
run(f"{music_plugin}.pause_if_playing")
# Note: (limited) support for regular expressions on `phrase`
# This hook will match any phrase containing either "turn on the lights"
# or "turn off the lights"
@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def lights_on_command():
run(f"{light_plugin}.on")
# Or, with arguments:
# run(f"{light_plugin}.on", groups=["Bedroom"])
@when(SpeechRecognizedEvent, phrase="turn off (the)? lights")
def lights_off_command():
run(f"{light_plugin}.off")
@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music_command():
run(f"{music_plugin}.play")
@when(SpeechRecognizedEvent, phrase="stop (the)? music")
def stop_music_command():
run(f"{music_plugin}.stop")
```
Or, via YAML:
```yaml
# Add to your config.yaml, or to one of the files included in it
event.hook.pause_music_when_conversation_starts:
if:
type: platypush.message.event.ConversationStartEvent
then:
- action: music.mopidy.pause_if_playing
event.hook.lights_on_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "turn on (the)? lights"
then:
- action: light.hue.on
# args:
# groups:
# - Bedroom
event.hook.lights_off_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "turn off (the)? lights"
then:
- action: light.hue.off
event.hook.play_music_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "play (the)? music"
then:
- action: music.mopidy.play
event.hook.stop_music_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "stop (the)? music"
then:
- action: music.mopidy.stop
```
Parameters are also supported on the `phrase` event argument through the `${}` template construct. For example:
```python
from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}')
def on_play_track_command(
event: SpeechRecognizedEvent, title: str, artist: str
):
results = run(
"music.mopidy.search",
filter={"title": title, "artist": artist}
)
if not results:
event.assistant.render_response(f"Couldn't find {title} by {artist}")
return
run("music.mopidy.play", resource=results[0]["uri"])
```
#### Pros
- 👍 Very fast and robust API.
- 👍 Easy to install and configure.
- 👍 It comes with almost all the features of a voice assistant installed on
Google hardware - except some actions native to Android-based devices and
video/display features. This means that features such as timers, alarms,
weather forecast, setting the volume or controlling Chromecasts on the same
network are all supported out of the box.
- 👍 It connects to your Google account (can be configured from your Google
settings), so things like location-based suggestions and calendar events are
available. Support for custom actions and devices configured in your Google
Home app is also available out of the box, although I haven't tested it in a
while.
- 👍 Good multi-language support. In most of the cases the assistant seems
quite capable of understanding questions in multiple language and respond in
the input language without any further configuration.
#### Cons
- 👎 Based on a deprecated API that could break at any moment.
- 👎 Limited hardware support (only x86_64 and RPi 3/4).
- 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available.
- 👎 Not possible to configure the output voice - it can only use the stock
Google Assistant voice.
- 👎 No support for intents - something similar was available (albeit tricky to
configure) through the Actions SDK, but that has also been abandoned by
Google.
- 👎 Not very modular. Both `assistant.picovoice` and `assistant.openai` have
been built by stitching together different independent APIs. Those plugins
are therefore quite *modular*. You can choose for instance to run only the
hotword engine of `assistant.picovoice`, which in turn will trigger the
conversation engine of `assistant.openai`, and maybe use `tts.google` to
render the responses. By contrast, given the relatively monolithic nature of
`google-assistant-library`, which runs the whole service locally, if your
instance runs `assistant.google` then it can't run other assistant plugins.
### `assistant.picovoice`
- [**Plugin
documentation**](https://docs.platypush.tech/platypush/plugins/assistant.picovoice.html)
- `pip` installation: `pip install 'platypush[assistant.picovoice]'`
The `assistant.picovoice` integration is available from [Platypush
1.0.0](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-0-2024-05-26).
Previous versions had some outdated `sst.picovoice.*` plugins for the
individual products, but they weren't properly tested and they weren't combined
together into a single integration that implements the Platypush' `assistant`
API.
This integration is built on top of the voice products developed by
[Picovoice](https://picovoice.ai/). These include:
- [**Porcupine**](https://picovoice.ai/platform/porcupine/): a fast and
customizable engine for hotword/wake-word detection. It can be enabled by
setting `hotword_enabled` to `true` in the `assistant.picovoice` plugin
configuration.
- [**Cheetah**](https://picovoice.ai/docs/cheetah/): a speech-to-text engine
optimized for real-time transcriptions. It can be enabled by setting
`stt_enabled` to `true` in the `assistant.picovoice` plugin configuration.
- [**Leopard**](https://picovoice.ai/docs/leopard/): a speech-to-text engine
optimized for offline transcriptions of audio files.
- [**Rhino**](https://picovoice.ai/docs/rhino/): a speech-to-intent engine.
- [**Orca**](https://picovoice.ai/docs/orca/): a text-to-speech engine.
You can get your personal access key by signing up at the [Picovoice
console](https://console.picovoice.ai/). You may be asked to submit a reason
for using the service (feel free to mention a personal Platypush integration),
and you will receive your personal access key.
If prompted to select the products you want to use, make sure to select
the ones from the Picovoice suite that you want to use with the
`assistant.picovoice` plugin.
A basic plugin configuration would like this:
```yaml
assistant.picovoice:
access_key: YOUR_ACCESS_KEY
# Keywords that the assistant should listen for
keywords:
- alexa
- computer
- ok google
# Paths to custom keyword files
# keyword_paths:
# - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn
# Enable/disable the hotword engine
hotword_enabled: true
# Enable the STT engine
stt_enabled: true
# conversation_start_sound: ...
# Path to a custom model to be used to speech-to-text
# speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv
# Path to an intent model. At least one custom intent model is required if
# you want to enable intent detection.
# intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn
```
#### Hotword detection
If enabled through the `hotword_enabled` parameter (default: True), the
assistant will listen for a specific wake word before starting the
speech-to-text or intent recognition engines. You can specify custom models for
your hotword (e.g. on the same device you may use "Alexa" to trigger the
speech-to-text engine in English, "Computer" to trigger the speech-to-text
engine in Italian, and "Ok Google" to trigger the intent recognition engine).
You can also create your custom hotword models using the [Porcupine
console](https://console.picovoice.ai/ppn).
If `hotword_enabled` is set to True, you must also specify the `keywords`
parameter with the list of keywords that you want to listen for, and optionally
the `keyword_paths` parameter with the paths to the any custom hotword models
that you want to use. If `hotword_enabled` is set to False, then the assistant
won't start listening for speech after the plugin is started, and you will need
to programmatically start the conversation by calling the
`assistant.picovoice.start_conversation` action.
When a wake-word is detected, the assistant will emit a
[`HotwordDetectedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
that you can use to build your custom logic.
By default, the assistant will start listening for speech after the hotword if
either `stt_enabled` or `intent_model_path` are set. If you don't want the
assistant to start listening for speech after the hotword is detected (for
example because you want to build your custom response flows, or trigger the
speech detection using different models depending on the hotword that is used,
or because you just want to detect hotwords but not speech), then you can also
set the `start_conversation_on_hotword` parameter to `false`. If that is the
case, then you can programmatically start the conversation by calling the
`assistant.picovoice.start_conversation` method in your event hooks:
```python
from platypush import when, run
from platypush.message.event.assistant import HotwordDetectedEvent
# Start a conversation using the Italian language model when the
# "Buongiorno" hotword is detected
@when(HotwordDetectedEvent, hotword='Buongiorno')
def on_it_hotword_detected(event: HotwordDetectedEvent):
event.assistant.start_conversation(model_file='path/to/it.pv')
```
#### Speech-to-text
If you want to build your custom STT hooks, the approach is the same seen for
the `assistant.google` plugins - create an event hook on
[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent)
with a given exact phrase, regex or template.
#### Speech-to-intent
*Intents* are structured actions parsed from unstructured human-readable text.
Unlike with hotword and speech-to-text detection, you need to provide a
custom model for intent detection. You can create your custom model using
the [Rhino console](https://console.picovoice.ai/rhn).
When an intent is detected, the assistant will emit an
[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent)
and you can build your custom hooks on it.
For example, you can build a model to control groups of smart lights by
defining the following slots on the Rhino console:
- ``device_state``: The new state of the device (e.g. with ``on`` or
``off`` as supported values)
- ``room``: The name of the room associated to the group of lights to
be controlled (e.g. ``living room``, ``kitchen``, ``bedroom``)
You can then define a ``lights_ctrl`` intent with the following expressions:
- "*turn ``$device_state:state`` the lights*"
- "*turn ``$device_state:state`` the ``$room:room`` lights*"
- "*turn the lights ``$device_state:state``*"
- "*turn the ``$room:room`` lights ``$device_state:state``*"
- "*turn ``$room:room`` lights ``$device_state:state``*"
This intent will match any of the following phrases:
- "*turn on the lights*"
- "*turn off the lights*"
- "*turn the lights on*"
- "*turn the lights off*"
- "*turn on the living room lights*"
- "*turn off the living room lights*"
- "*turn the living room lights on*"
- "*turn the living room lights off*"
And it will extract any slots that are matched in the phrases in the
[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent).
Train the model, download the context file, and pass the path on the
``intent_model_path`` parameter.
You can then register a hook to listen to a specific intent:
```python
from platypush import when, run
from platypush.events.assistant import IntentRecognizedEvent
@when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'})
def on_turn_on_lights(event: IntentRecognizedEvent):
room = event.slots.get('room')
if room:
run("light.hue.on", groups=[room])
else:
run("light.hue.on")
```
Note that if both `stt_enabled` and `intent_model_path` are set, then
both the speech-to-text and intent recognition engines will run in parallel
when a conversation is started.
The intent engine is usually faster, as it has a smaller set of intents to
match and doesn't have to run a full speech-to-text transcription. This means that,
if an utterance matches both a speech-to-text phrase and an intent, the
`IntentRecognizedEvent` event is emitted (and not `SpeechRecognizedEvent`).
This may not be always the case though. So, if you want to use the intent
detection engine together with the speech detection, it may be a good practice
to also provide a fallback `SpeechRecognizedEvent` hook to catch the text if
the speech is not recognized as an intent:
```python
from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context):
if room:
run("light.hue.on", groups=[room])
else:
run("light.hue.on")
```
#### Text-to-speech and response management
The text-to-speech engine, based on Orca, is provided by the
[`tts.picovoice`](https://docs.platypush.tech/platypush/plugins/tts.picovoice.html)
plugin.
However, the Picovoice integration won't provide you with automatic
AI-generated responses for your queries. That's because Picovoice doesn't seem
to offer (yet) any products for conversational assistants, either voice-based
or text-based.
You can however leverage the `render_response` action to render some text as
speech in response to a user command, and that in turn will leverage the
Picovoice TTS plugin to render the response.
For example, the following snippet provides a hook that:
- Listens for `SpeechRecognizedEvent`.
- Matches the phrase against a list of predefined commands that shouldn't
require an AI-generated response.
- Has a fallback logic that leverages `openai.get_response` to generate a
response through a ChatGPT model and render it as audio.
Also, note that any text rendered over the `render_response` action that ends
with a question mark will automatically trigger a follow-up - i.e. the
assistant will wait for the user to answer its question.
```python
import re
from platypush import hook, run
from platypush.message.event.assistant import SpeechRecognizedEvent
def play_music():
run("music.mopidy.play")
def stop_music():
run("music.mopidy.stop")
def ai_assist(event: SpeechRecognizedEvent):
response = run("openai.get_response", prompt=event.phrase)
if not response:
return
run("assistant.picovoice.render_response", text=response)
# List of commands to match, as pairs of regex patterns and the
# corresponding actions
hooks = (
(re.compile(r"play (the)?music", re.IGNORECASE), play_music),
(re.compile(r"stop (the)?music", re.IGNORECASE), stop_music),
# ...
# Fallback to the AI assistant
(re.compile(r".*"), ai_assist),
)
@when(SpeechRecognizedEvent)
def on_speech_recognized(event, **kwargs):
for pattern, command in hooks:
if pattern.search(event.phrase):
run("logger.info", msg=f"Running voice command: {command.__name__}")
command(event, **kwargs)
break
```
#### Offline speech-to-text
An [`assistant.picovoice.transcribe`
action](https://docs.platypush.tech/platypush/plugins/assistant.picovoice.html#platypush.plugins.assistant.picovoice.AssistantPicovoicePlugin.transcribe)
is provided for offline transcriptions of audio files, using the Leopard
models.
You can easily call it from your procedures, hooks or through the API:
```bash
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
"type": "request",
"action": "assistant.picovoice.transcribe",
"args": {
"audio_file": "/path/to/some/speech.mp3"
}
}' http://localhost:8008/execute
{
"transcription": "This is a test",
"words": [
{
"word": "this",
"start": 0.06400000303983688,
"end": 0.19200000166893005,
"confidence": 0.9626294374465942
},
{
"word": "is",
"start": 0.2879999876022339,
"end": 0.35199999809265137,
"confidence": 0.9781675934791565
},
{
"word": "a",
"start": 0.41600000858306885,
"end": 0.41600000858306885,
"confidence": 0.9764975309371948
},
{
"word": "test",
"start": 0.5120000243186951,
"end": 0.8320000171661377,
"confidence": 0.9511580467224121
}
]
}
```
#### Pros
- 👍 The Picovoice integration is extremely configurable. `assistant.picovoice`
stitches together five independent products developed by a small company
specialized in voice products for developers. As such, Picovoice may be the
best option if you have custom use-cases. You can pick which features you
need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you
have plenty of flexibility in building your integrations.
- 👍 Runs (or seems to run) (mostly) on device. This is something that we can't
say about the other two integrations discussed in this article. If keeping
your voice interactions 100% hidden from Google's or Microsoft's eyes is a
priority, then Picovoice may be your best bet.
- 👍 Rich features. It uses different models for different purposes - for
example, Cheetah models are optimized for real-time speech detection, while
Leopard is optimized for offline transcription. Moreover, Picovoice is the
only integration among those analyzed in this article to support
speech-to-intent.
- 👍 It's very easy to build new models or customize existing ones. Picovoice
has a powerful developers console that allows you to easily create hotword
models, tweak the priority of some words in voice models, and create custom
intent models.
#### Cons
- 👎 The business model is still a bit weird. It's better than the earlier
"*write us an email with your business case and we'll reach back to you*",
but it still requires you to sign up with a business email and write a couple
of lines on what you want to build with their products. It feels like their
focus is on a B2B approach rather than "open up and let the community build
stuff", and that seems to create unnecessary friction.
- 👎 No native conversational features. At the time of writing, Picovoice
doesn't offer products that generate AI responses given voice or text
prompts. This means that, if you want AI-generated responses to your queries,
you'll have to do requests to e.g.
[`openai.get_response(prompt)`](https://docs.platypush.tech/platypush/plugins/openai.html#platypush.plugins.openai.OpenaiPlugin.get_response)
directly in your hooks for `SpeechRecognizedEvent`, and render the responses
through `assistant.picovoice.render_response`. This makes the use of
`assistant.picovoice` alone more fit to cases where you want to mostly create
voice command hooks rather than have general-purpose conversations.
- 👎 Speech-to-text, at least on my machine, is slower than the other two
integrations, and the accuracy with non-native accents is also much lower.
- 👎 Limited support for any languages other than English. At the time of
writing hotword detection with Porcupine seems to be in a relative good shape
with [support for 16
languages](https://github.com/Picovoice/porcupine/tree/master/lib/common).
However, both speech-to-text and text-to-speech only support English at the
moment.
- 👎 Some APIs are still quite unstable. The Orca text-to-speech API, for
example, doesn't even support text that includes digits or some punctuation
characters - at least not at the time of writing. The Platypush integration
fills the gap with workarounds that e.g. replace words to numbers and replace
punctuation characters, but you definitely have a feeling that some parts of
their products are still work in progress.
### `assistant.openai`
- [**Plugin
documentation**](https://docs.platypush.tech/platypush/plugins/assistant.openai.html)
- `pip` installation: `pip install 'platypush[assistant.openai]'`
This integration has been released in [Platypush
1.0.7](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-7-2024-06-02).
It uses the following OpenAI APIs:
- [`/audio/transcriptions`](https://platform.openai.com/docs/guides/speech-to-text)
for speech-to-text. At the time of writing the default model is `whisper-1`.
It can be configured through the `model` setting on the `assistant.openai`
plugin configuration. See the [OpenAI
documentation](https://platform.openai.com/docs/models/whisper) for a list of
available models.
- [`/chat/completions`](https://platform.openai.com/docs/api-reference/completions/create)
to get AI-generated responses using a GPT model. At the time of writing the
default is `gpt-3.5-turbo`, but it can be configurable through the `model`
setting on the `openai` plugin configuration. See the [OpenAI
documentation](https://platform.openai.com/docs/models) for a list of supported models.
- [`/audio/speech`](https://platform.openai.com/docs/guides/text-to-speech) for
text-to-speech. At the time of writing the default model is `tts-1` and the
default voice is `nova`. They can be configured through the `model` and
`voice` settings respectively on the `tts.openai` plugin. See the OpenAI
documentation for a list of available
[models](https://platform.openai.com/docs/models/tts) and
[voices](https://platform.openai.com/docs/guides/text-to-speech/voice-options).
You will need an [OpenAI API key](https://platform.openai.com/api-keys)
associated to your account.
A basic configuration would like this:
```yaml
openai:
api_key: YOUR_OPENAI_API_KEY # Required
# conversation_start_sound: ...
# model: ...
# context: ...
# context_expiry: ...
# max_tokens: ...
assistant.openai:
# model: ...
# tts_plugin: some.other.tts.plugin
tts.openai:
# model: ...
# voice: ...
```
If you want to build your custom hooks on speech events, the approach is the
same seen for the other `assistant` plugins - create an event hook on
[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent)
with a given exact phrase, regex or template.
#### Hotword support
OpenAI doesn't provide an API for hotword detection, nor a small model for
offline detection.
This means that, if no other `assistant` plugins with stand-alone hotword
support are configured (only `assistant.picovoice` for now), a conversation can
only be triggered by calling the `assistant.openai.start_conversation` action.
If you want hotword support, then the best bet is to add `assistant.picovoice`
to your configuration too - but make sure to only enable hotword detection and
not speech detection, which will be delegated to `assistant.openai` via event
hook:
```yaml
assistant.picovoice:
access_key: ...
keywords:
- computer
hotword_enabled: true
stt_enabled: false
# conversation_start_sound: ...
```
Then create a hook that listens for
[`HotwordDetectedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
and calls `assistant.openai.start_conversation`:
```python
from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent
@when(HotwordDetectedEvent, hotword="computer")
def on_hotword_detected():
run("assistant.openai.start_conversation")
```
#### Conversation contexts
The most powerful feature offered by the OpenAI assistant is the fact that it
leverages the *conversation contexts* provided by the OpenAI API.
This means two things:
1. Your assistant can be initialized/tuned with a *static context*. It is
possible to provide some initialization context to the assistant that can
fine tune how the assistant will behave, (e.g. what kind of
tone/language/approach will have when generating the responses), as well as
initialize the assistant with some predefined knowledge in the form of
hypothetical past conversations. Example:
```yaml
openai:
# ...
context:
# `system` can be used to initialize the context for the expected tone
# and language in the assistant responses
- role: system
content: >
You are a voice assistant that responds to user queries using
references to Lovecraftian lore.
# `user`/`assistant` interactions can be used to initialize the
# conversation context with previous knowledge. `user` is used to
# emulate previous user questions, and `assistant` models the
# expected response.
- role: user
content: What is a telephone?
- role: assistant
content: >
A Cthulhuian device that allows you to communicate with
otherworldly beings. It is said that the first telephone was
created by the Great Old Ones themselves, and that it is a
gateway to the void beyond the stars.
```
If you now start Platypush and ask a question like "*how does it work?*",
the voice assistant may give a response along the lines of:
```
The telephone functions by harnessing the eldritch energies of the cosmos to
transmit vibrations through the ether, allowing communication across vast
distances with entities from beyond the veil. Its operation is shrouded in
mystery, for it relies on arcane principles incomprehensible to mortal
minds.
```
Note that:
1. The style of the response is consistent with that initialized in the
`context` through `system` roles.
2. Even though a question like "*how does it work?*" is not very specific,
the assistant treats the `user`/`assistant` entries given in the context
as if they were the latest conversation prompts. Thus it realizes that
"*it*", in this context, probably means "*the telephone*".
2. The assistant has a *runtime context*. It will remember the recent
conversations for a given amount of time (configurable through the
`context_expiry` setting on the `openai` plugin configuration). So, even
without explicit context initialization in the `openai` plugin, the plugin
will remember the last interactions for (by default) 10 minutes. So if you
ask "*who wrote the Divine Comedy?*", and a few seconds later you ask
"*where was its writer from?*", you may get a response like "*Florence,
Italy*" - i.e. the assistant realizes that "*the writer*" in this context is
likely to mean "*the writer of the work that I was asked about in the
previous interaction*" and return pertinent information.
#### Pros
- 👍 Speech detection quality. The OpenAI speech-to-text features are the best
among the available `assistant` integrations. The `transcribe` API so far has
detected my non-native English accent right nearly 100% of the times (Google
comes close to 90%, while Picovoice trails quite behind). And it even detects
the speech of my young kid - something that the Google Assistant library has
always failed to do right.
- 👍 Text-to-speech quality. The voice models used by OpenAI sound much more
natural and human than those of both Google and Picovoice. Google's and
Picovoice's TTS models are actually already quite solid, but OpenAI
outclasses them when it comes to voice modulation, inflections and sentiment.
The result sounds intimidatingly realistic.
- 👍 AI responses quality. While the scope of the Google Assistant is somewhat
limited by what people expected from voice assistants until a few years ago
(control some devices and gadgets, find my phone, tell me the news/weather,
do basic Google searches...), usually without much room for follow-ups,
`assistant.openai` will basically render voice responses as if you were
typing them directly to ChatGPT. While Google would often respond you with a
"*sorry, I don't understand*", or "*sorry, I can't help with that*", the
OpenAI assistant is more likely to expose its reasoning, ask follow-up
questions to refine its understanding, and in general create a much more
realistic conversation.
- 👍 Contexts. They are an extremely powerful way to initialize your assistant
and customize it to speak the way you want, and know the kind of things that
you want it to know. Cross-conversation contexts with configurable expiry
also make it more natural to ask something, get an answer, and then ask
another question about the same topic a few seconds later, without having to
reintroduce the assistant to the whole context.
- 👍 Offline transcriptions available through the `openai.transcribe` action.
- 👍 Multi-language support seems to work great out of the box. Ask something
to the assistant in any language, and it'll give you a response in that
language.
- 👍 Configurable voices and models.
#### Cons
- 👎 The full pack of features is only available if you have an API key
associated to a paid OpenAI account.
- 👎 No hotword support. It relies on `assistant.picovoice` for hotword
detection.
- 👎 No intents support.
- 👎 No native support for weather forecast, alarms, timers, integrations with
other services/devices nor other features available out of the box with the
Google Assistant. You can always create hooks for them though.
### Weather forecast example
Both the OpenAI and Picovoice integrations lack some features available out of
the box on the Google Assistant - weather forecast, news playback, timers etc. -
as they rely on voice-only APIs that by default don't connect to other services.
However Platypush provides many plugins to fill those gaps, and those features
can be implemented with custom event hooks.
Let's see for example how to build a simple hook that delivers the weather
forecast for the next 24 hours whenever the assistant gets a phrase that
contains the "*weather today*" string.
You'll need to enable a `weather` plugin in Platypush -
[`weather.openweathermap`](https://docs.platypush.tech/platypush/plugins/weather.openweathermap.html)
will be used in this example. Configuration:
```yaml
weather.openweathermap:
token: OPENWEATHERMAP_API_KEY
location: London,GB
```
Then drop a script named e.g. `weather.py` in the Platypush scripts directory
(default: `<CONFDIR>/scripts`) with the following content:
```python
from datetime import datetime
from textwrap import dedent
from time import time
from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='weather today')
def weather_forecast(event: SpeechRecognizedEvent):
limit = time() + 24 * 60 * 60 # 24 hours from now
forecast = [
weather
for weather in run("weather.openweathermap.get_forecast")
if datetime.fromisoformat(weather["time"]).timestamp() < limit
]
min_temp = round(
min(weather["temperature"] for weather in forecast)
)
max_temp = round(
max(weather["temperature"] for weather in forecast)
)
max_wind_gust = round(
(max(weather["wind_gust"] for weather in forecast)) * 3.6
)
summaries = [weather["summary"] for weather in forecast]
most_common_summary = max(summaries, key=summaries.count)
avg_cloud_cover = round(
sum(weather["cloud_cover"] for weather in forecast) / len(forecast)
)
event.assistant.render_response(
dedent(
f"""
The forecast for today is: {most_common_summary}, with
a minimum of {min_temp} and a maximum of {max_temp}
degrees, wind gust of {max_wind_gust} km/h, and an
average cloud cover of {avg_cloud_cover}%.
"""
)
)
```
This script will work with any of the available voice assistants.
You can also implement something similar for news playback, for example using
the [`rss` plugin](https://docs.platypush.tech/platypush/plugins/rss.html) to
get the latest items in your subscribed feeds. Or to create custom alarms using
the [`alarm` plugin](https://docs.platypush.tech/platypush/plugins/alarm.html),
or a timer using the [`utils.set_timeout`
action](https://docs.platypush.tech/platypush/plugins/utils.html#platypush.plugins.utils.UtilsPlugin.set_timeout).
## Conclusions
The past few years have seen a lot of things happen in the voice industry.
Many products have gone out of market, been deprecated or sunset, but not all
hope is lost. The OpenAI and Picovoice products, especially when combined
together, can still provide a good out-of-the-box voice assistant experience.
And the OpenAI products have also raised the bar on what to expect from an
AI-based assistant.
I wish that there were still some fully open and on-device alternatives out
there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google
provide the best voice experience as of now, but of course they come with
trade-offs - namely the great amount of data points you feed to these
cloud-based services. Picovoice is somewhat a trade-off, as it runs at least
partly on-device, but their business model is still a bit fuzzy and it's not
clear whether they intend to have their products used by the wider public or if
it's mostly B2B.
I'll keep an eye however on what is going to come from the ashes of Mycroft
under the form of the
[OpenConversational](https://community.openconversational.ai/) project, and
probably keep you up-to-date when there is a new integration to share.