2024-06-03 13:08:57 +02:00
|
|
|
[//]: # (title: The state of voice assistant integrations in 2024)
|
|
|
|
[//]: # (description: How to use Platypush to build your voice assistants. Featuring Google, OpenAI and Picovoice.)
|
|
|
|
[//]: # (image: https://platypush-static.s3.nl-ams.scw.cloud/images/voice-assistant-2.png)
|
|
|
|
[//]: # (author: Fabio Manganiello <fabio@platypush.tech>)
|
|
|
|
[//]: # (published: 2024-06-02)
|
|
|
|
|
|
|
|
Those who have been following my blog or used Platypush for a while probably
|
|
|
|
know that I've put quite some efforts to get voice assistants rights over the
|
|
|
|
past few years.
|
|
|
|
|
|
|
|
I built my first (very primitive) voice assistant that used DCT+Markov models
|
|
|
|
[back in 2008](https://github.com/blacklight/Voxifera), when the concept was
|
|
|
|
still pretty much a science fiction novelty.
|
|
|
|
|
|
|
|
Then I wrote [an article in
|
|
|
|
2019](https://blog.platypush.tech/article/Build-your-customizable-voice-assistant-with-Platypush)
|
|
|
|
and [one in
|
|
|
|
2020](https://blog.platypush.tech/article/Build-custom-voice-assistants) on how
|
|
|
|
to use several voice integrations in [Platypush](https://platypush.tech) to
|
|
|
|
create custom voice assistants.
|
|
|
|
|
|
|
|
## Everyone in those pictures is now dead
|
|
|
|
|
|
|
|
Quite a few things have changed in this industry niche since I wrote my
|
|
|
|
previous article. Most of the solutions that I covered back in the day,
|
|
|
|
unfortunately, are gone in a way or another:
|
|
|
|
|
|
|
|
- The `assistant.snowboy` integration is gone because unfortunately [Snowboy is
|
|
|
|
gone](https://github.com/Kitt-AI/snowboy). For a while you could still run
|
|
|
|
the Snowboy code with models that either you had previously downloaded from
|
|
|
|
their website or trained yourself, but my latest experience proved to be
|
|
|
|
quite unfruitful - it's been more than 4 years since the last commit on
|
|
|
|
Snowboy, and it's hard to get the code to even run.
|
|
|
|
|
|
|
|
- The `assistant.alexa` integration is also gone, as Amazon [has stopped
|
|
|
|
maintaining the AVS SDK](https://github.com/alexa/avs-device-sdk). And I have
|
|
|
|
literally no clue of what Amazon's plans with the development of Alexa skills
|
|
|
|
are (if there are any plans at all).
|
|
|
|
|
|
|
|
- The `stt.deepspeech` integration is also gone: [the project hasn't seen a
|
|
|
|
commit in 3 years](https://github.com/mozilla/DeepSpeech) and I even
|
|
|
|
struggled to get the latest code to run. Given the current financial
|
|
|
|
situation at Mozilla, and the fact that they're trying to cut as much as
|
|
|
|
possible on what they don't consider part of their core product, it's
|
|
|
|
very unlikely that DeepSpeech will be revived any time soon.
|
|
|
|
|
|
|
|
- The `assistant.google` integration [is still
|
|
|
|
there](https://docs.platypush.tech/platypush/plugins/assistant.google.html),
|
|
|
|
but I can't make promises on how long it can be maintained. It uses the
|
|
|
|
[`google-assistant-library`](https://pypi.org/project/google-assistant-library/),
|
|
|
|
which was [deprecated in
|
|
|
|
2019](https://developers.google.com/assistant/sdk/release-notes). Google
|
|
|
|
replaced it with the [conversational
|
|
|
|
actions](https://developers.google.com/assistant/sdk/), which [was also
|
|
|
|
deprecated last year](https://developers.google.com/assistant/ca-sunset).
|
|
|
|
`<rant>`Put here your joke about Google building products with the shelf life
|
|
|
|
of a summer hit.`</rant>`
|
|
|
|
|
|
|
|
- The `tts.mimic3` integration, a text model based on
|
|
|
|
[mimic3](https://github.com/MycroftAI/mimic3), part of the
|
|
|
|
[Mycroft](https://en.wikipedia.org/wiki/Mycroft_(software)) initiative, [is
|
|
|
|
still there](https://docs.platypush.tech/platypush/plugins/tts.mimic3.html),
|
|
|
|
but only because it's still possible to [spin up a Docker
|
|
|
|
image](https://hub.docker.com/r/mycroftai/mimic3) that runs mimic3. The whole
|
|
|
|
Mycroft project, however, [is now
|
|
|
|
defunct](https://community.openconversational.ai/t/update-from-the-ceo-part-1/13268),
|
|
|
|
and [the story of how it went
|
|
|
|
bankrupt](https://www.reuters.com/legal/transactional/appeals-court-says-judge-favored-patent-plaintiff-scorched-earth-case-2022-03-04/)
|
|
|
|
is a very sad story about the power that patent trolls have on startups. The
|
|
|
|
Mycroft initiative however seems to [have been picked up by the
|
|
|
|
community](https://community.openconversational.ai/), and something seems to
|
|
|
|
move in the space of fully open source and on-device voice models. I'll
|
|
|
|
definitely be looking with interest at what happens in that space, but the
|
|
|
|
project seems to be at a stage that is still a bit immature to justify an
|
|
|
|
investment into a new Platypush integration.
|
|
|
|
|
|
|
|
## But not all hope is lost
|
|
|
|
|
|
|
|
### `assistant.google`
|
|
|
|
|
|
|
|
`assistant.google` may be relying on a dead library, but it's not dead (yet).
|
|
|
|
The code still works, but you're a bit constrained on the hardware side - the
|
|
|
|
assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3
|
|
|
|
and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other
|
|
|
|
ARMv7-compatible devices has proved to be a challenge in some cases. Given the
|
|
|
|
state of the library, it's safe to say that it'll never be supported on other
|
|
|
|
platforms, but if you want to run your assistant on a device that is still
|
|
|
|
supported then it should still work fine.
|
|
|
|
|
|
|
|
I had however to do a few dirty packaging tricks to ensure that the assistant
|
|
|
|
library code doesn't break badly on newer versions of Python. That code hasn't
|
|
|
|
been touched in 5 years and it's starting to rot. It depends on ancient and
|
|
|
|
deprecated Python libraries like [`enum34`](https://pypi.org/project/enum34/)
|
|
|
|
and it needs some hammering to work - without breaking the whole Python
|
|
|
|
environment in the process.
|
|
|
|
|
|
|
|
For now, `pip install 'platypush[assistant.google]'` should do all the dirty
|
|
|
|
work and get all of your assistant dependencies installed. But I can't promise
|
|
|
|
I can maintain that code forever.
|
|
|
|
|
|
|
|
### `assistant.picovoice`
|
|
|
|
|
|
|
|
Picovoice has been a nice surprise in an industry niche where all the
|
|
|
|
products that were available just 4 years ago are now dead.
|
|
|
|
|
|
|
|
I described some of their products [in my previous
|
|
|
|
articles](https://blog.platypush.tech/article/Build-custom-voice-assistants),
|
|
|
|
and I even built a couple of `stt.picovoice.*` plugins for Platypush back in
|
|
|
|
the day, but I didn't really put much effort in it.
|
|
|
|
|
|
|
|
Their business model seemed a bit weird - along the lines of "you can test our
|
|
|
|
products on x86_64, if you need an ARM build you should contact us as a
|
|
|
|
business partner". And the quality of their products was also a bit
|
|
|
|
disappointing compared to other mainstream offerings.
|
|
|
|
|
|
|
|
I'm glad to see that the situation has changed quite a bit now. They still have
|
|
|
|
a "sign up with a business email" model, but at least now you can just sign up
|
|
|
|
on their website and start using their products rather than sending emails
|
|
|
|
around. And I'm also quite impressed to see the progress on their website. You
|
|
|
|
can now train hotword models, customize speech-to-text models and build your
|
|
|
|
own intent rules directly from their website - a feature that was also
|
|
|
|
available in the beloved Snowboy and that went missing from any major product
|
|
|
|
offerings out there after Snowboy was gone. I feel like the quality of their
|
|
|
|
models has also greatly improved compared to the last time I checked them -
|
|
|
|
predictions are still slower than the Google Assistant, definitely less
|
|
|
|
accurate with non-native accents, but the gap with the Google Assistant when it
|
|
|
|
comes to native accents isn't very wide.
|
|
|
|
|
|
|
|
### `assistant.openai`
|
|
|
|
|
|
|
|
OpenAI has filled many gaps left by all the casualties in the voice assistants
|
|
|
|
market. Platypush now provides a new `assistant.openai` plugin that stitches
|
|
|
|
together several of their APIs to provide a voice assistant experience that
|
|
|
|
honestly feels much more natural than anything I've tried in all these years.
|
|
|
|
|
|
|
|
Let's explore how to use these integrations to build our on-device voice
|
|
|
|
assistant with custom rules.
|
|
|
|
|
|
|
|
## Feature comparison
|
|
|
|
|
|
|
|
As some of you may know, voice assistant often aren't monolithic products.
|
|
|
|
Unless explicitly designed as all-in-one packages (like the
|
|
|
|
`google-assistant-library`), voice assistant integrations in Platypush are
|
|
|
|
usually built on top of four distinct APIs:
|
|
|
|
|
|
|
|
1. **Hotword detection**: This is the component that continuously listens on
|
|
|
|
your microphone until you speak "Ok Google", "Alexa" or any other wake-up
|
|
|
|
word used to start a conversation. Since it's a continuously listening
|
|
|
|
component that needs to take decisions fast, and it only has to recognize
|
|
|
|
one word (or in a few cases 3-4 more at most), it usually doesn't need to
|
|
|
|
run on a full language model. It needs small models, often a couple of MBs
|
|
|
|
heavy at most.
|
|
|
|
|
|
|
|
2. **Speech-to-text** (*STT*): This is the component that will capture audio
|
|
|
|
from the microphone and use some API to transcribe it to text.
|
|
|
|
|
|
|
|
3. **Response engine**: Once you have the transcription of what the user said,
|
|
|
|
you need to feed it to some model that will generate some human-like
|
|
|
|
response for the question.
|
|
|
|
|
|
|
|
4. **Text-to-speech** (*TTS*): Once you have your AI response rendered as a
|
|
|
|
text string, you need a text-to-speech model to speak it out loud on your
|
|
|
|
speakers or headphones.
|
|
|
|
|
|
|
|
On top of these basic building blocks for a voice assistant, some integrations
|
|
|
|
may also provide two extra features.
|
|
|
|
|
|
|
|
#### Speech-to-intent
|
|
|
|
|
|
|
|
In this mode, the user's prompt, instead of being transcribed directly to text,
|
|
|
|
is transcribed into a structured *intent* that can be more easily processed by
|
|
|
|
a downstream integration with no need for extra text parsing, regular
|
|
|
|
expressions etc.
|
|
|
|
|
|
|
|
For instance, a voice command like "*turn off the bedroom lights*" could be
|
|
|
|
translated into an intent such as:
|
|
|
|
|
|
|
|
```json
|
|
|
|
{
|
|
|
|
"intent": "lights_ctrl",
|
|
|
|
"slots": {
|
|
|
|
"state": "off",
|
|
|
|
"lights": "bedroom"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Offline speech-to-text
|
|
|
|
|
|
|
|
a.k.a. *offline text transcriptions*. Some assistant integrations may offer you
|
|
|
|
the ability to pass some audio file and transcribe their content as text.
|
|
|
|
|
|
|
|
### Features summary
|
|
|
|
|
|
|
|
This table summarizes how the `assistant` integrations available in Platypush
|
|
|
|
compare when it comes to what I would call the *foundational* blocks:
|
|
|
|
|
|
|
|
| Plugin | Hotword | STT | AI responses | TTS |
|
|
|
|
| --------------------- | ------- | --- | ------------ | --- |
|
|
|
|
| `assistant.google` | ✅ | ✅ | ✅ | ✅ |
|
|
|
|
| `assistant.openai` | ❌ | ✅ | ✅ | ✅ |
|
|
|
|
| `assistant.picovoice` | ✅ | ✅ | ❌ | ✅ |
|
|
|
|
|
|
|
|
And this is how they compare in terms of extra features:
|
|
|
|
|
|
|
|
| Plugin | Intents | Offline SST |
|
|
|
|
| --------------------- | ------- | ------------|
|
|
|
|
| `assistant.google` | ❌ | ❌ |
|
|
|
|
| `assistant.openai` | ❌ | ✅ |
|
|
|
|
| `assistant.picovoice` | ✅ | ✅ |
|
|
|
|
|
|
|
|
Let's see a few configuration examples to better understand the pros and cons
|
|
|
|
of each of these integrations.
|
|
|
|
|
|
|
|
## Configuration
|
|
|
|
|
|
|
|
### Hardware requirements
|
|
|
|
|
|
|
|
1. A computer, a Raspberry Pi, an old tablet, or anything in between, as long
|
|
|
|
as it can run Python. At least 1GB of RAM is advised for smooth audio
|
|
|
|
processing experience.
|
|
|
|
|
|
|
|
2. A microphone.
|
|
|
|
|
|
|
|
3. Speaker/headphones.
|
|
|
|
|
|
|
|
### Installation notes
|
|
|
|
|
|
|
|
[Platypush
|
|
|
|
1.0.0](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-0-2024-05-26)
|
|
|
|
has [recently been
|
|
|
|
released](https://blog.platypush.tech/article/Platypush-1.0-is-out), and [new
|
|
|
|
installation procedures](https://docs.platypush.tech/wiki/Installation.html)
|
|
|
|
with it.
|
|
|
|
|
|
|
|
There's now official support for [several package
|
|
|
|
managers](https://docs.platypush.tech/wiki/Installation.html#system-package-manager-installation),
|
|
|
|
a better [Docker installation
|
|
|
|
process](https://docs.platypush.tech/wiki/Installation.html#docker), and more
|
|
|
|
powerful ways to [install
|
|
|
|
plugins](https://docs.platypush.tech/wiki/Plugins-installation.html) - via
|
|
|
|
[`pip` extras](https://docs.platypush.tech/wiki/Plugins-installation.html#pip),
|
|
|
|
[Web
|
|
|
|
interface](https://docs.platypush.tech/wiki/Plugins-installation.html#web-interface),
|
|
|
|
[Docker](https://docs.platypush.tech/wiki/Plugins-installation.html#docker) and
|
|
|
|
[virtual
|
|
|
|
environments](https://docs.platypush.tech/wiki/Plugins-installation.html#virtual-environment).
|
|
|
|
|
|
|
|
The optional dependencies for any Platypush plugins can be installed via `pip`
|
|
|
|
extras in the simplest case:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ pip install 'platypush[plugin1,plugin2,...]'
|
|
|
|
```
|
|
|
|
|
|
|
|
For example, if you want to install Platypush with the dependencies for
|
|
|
|
`assistant.openai` and `assistant.picovoice`:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ pip install 'platypush[assistant.openai,assistant.picovoice]'
|
|
|
|
```
|
|
|
|
|
|
|
|
Some plugins however may require extra system dependencies that are not
|
|
|
|
available via `pip` - for instance, both the OpenAI and Picovoice integrations
|
|
|
|
require the `ffmpeg` binary to be installed, as it is used for audio
|
|
|
|
conversion and exporting purposes. You can check the [plugins
|
|
|
|
documentation](https://docs.platypush.tech) for any system dependencies
|
|
|
|
required by some integrations, or install them automatically through the Web
|
|
|
|
interface or the `platydock` command for Docker containers.
|
|
|
|
|
|
|
|
### A note on the hooks
|
|
|
|
|
|
|
|
All the custom actions in this article are built through event hooks triggered
|
|
|
|
by
|
|
|
|
[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent)
|
|
|
|
(or
|
|
|
|
[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent)
|
|
|
|
for intents). When an intent event is triggered, or a speech event with a
|
|
|
|
condition on a phrase, the `assistant` integrations in Platypush will prevent
|
|
|
|
the default assistant response. That's to avoid cases where e.g. you say "*turn
|
|
|
|
off the lights*", your hook takes care of running the actual action, while your
|
|
|
|
voice assistant fetches a response from Google or ChatGPT along the lines of
|
|
|
|
"*sorry, I can't control your lights*".
|
|
|
|
|
|
|
|
If you want to render a custom response from an event hook, you can do so by
|
|
|
|
calling `event.assistant.render_response(text)`, and it will be spoken using
|
|
|
|
the available text-to-speech integration.
|
|
|
|
|
|
|
|
If you want to disable this behaviour, and you want the default assistant
|
|
|
|
response to always be rendered, even if it matches a hook with a phrase or an
|
|
|
|
intent, you can do so by setting the `stop_conversation_on_speech_match`
|
|
|
|
parameter to `false` in your assistant plugin configuration.
|
|
|
|
|
|
|
|
### Text-to-speech
|
|
|
|
|
|
|
|
Each of the available `assistant` plugins has it own default `tts` plugin associated:
|
|
|
|
|
|
|
|
- `assistant.google`:
|
|
|
|
[`tts`](https://docs.platypush.tech/platypush/plugins/tts.html), but
|
|
|
|
[`tts.google`](https://docs.platypush.tech/platypush/plugins/tts.google.html)
|
|
|
|
is also available. The difference is that `tts` uses the (unofficial) Google
|
|
|
|
Translate frontend API - it requires no extra configuration, but besides
|
|
|
|
setting the input language it isn't very configurable. `tts.google` on the
|
|
|
|
other hand uses the [Google Cloud Translation
|
|
|
|
API](https://cloud.google.com/translate/docs/reference/rest/). It is much
|
|
|
|
more versatile, but it requires an extra API registered to your Google
|
|
|
|
project and an extra credentials file.
|
|
|
|
|
|
|
|
- `assistant.openai`:
|
|
|
|
[`tts.openai`](https://docs.platypush.tech/platypush/plugins/tts.openai.html),
|
|
|
|
which leverages the [OpenAI
|
|
|
|
text-to-speech API](https://platform.openai.com/docs/guides/text-to-speech).
|
|
|
|
|
|
|
|
- `assistant.picovoice`:
|
|
|
|
[`tts.picovoice`](https://docs.platypush.tech/platypush/plugins/tts.picovoice.html),
|
|
|
|
which uses the (still experimental, at the time of writing) [Picovoice Orca
|
|
|
|
engine](https://github.com/Picovoice/orca).
|
|
|
|
|
|
|
|
Any text rendered via `assistant*.render_response` will be rendered using the
|
|
|
|
associated TTS plugin. You can however customize it by setting `tts_plugin` on
|
|
|
|
your assistant plugin configuration - e.g. you can render responses from the
|
|
|
|
OpenAI assistant through the Google or Picovoice engine, or the other way
|
|
|
|
around.
|
|
|
|
|
|
|
|
`tts` plugins also expose a `say` action that can be called outside of an
|
|
|
|
assistant context to render custom text at runtime - for example, from other
|
|
|
|
[event
|
|
|
|
hooks](https://docs.platypush.tech/wiki/Quickstart.html#turn-on-the-lights-when-i-say-so),
|
|
|
|
[procedures](https://docs.platypush.tech/wiki/Quickstart.html#greet-me-with-lights-and-music-when-i-come-home),
|
|
|
|
[cronjobs](https://docs.platypush.tech/wiki/Quickstart.html#turn-off-the-lights-at-1-am)
|
|
|
|
or [API calls](https://docs.platypush.tech/wiki/APIs.html). For example:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
|
|
|
|
{
|
|
|
|
"type": "request",
|
|
|
|
"action": "tts.openai.say",
|
|
|
|
"args": {
|
|
|
|
"text": "What a wonderful day!"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
' http://localhost:8008/execute
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### `assistant.google`
|
|
|
|
|
|
|
|
- [**Plugin documentation**](https://docs.platypush.tech/platypush/plugins/assistant.google.html)
|
|
|
|
- `pip` installation: `pip install 'platypush[assistant.google]'`
|
|
|
|
|
|
|
|
This is the oldest voice integration in Platypush - and one of the use-cases
|
|
|
|
that actually motivated me into forking the [previous
|
|
|
|
project](https://github.com/blacklight/evesp) into what is now Platypush.
|
|
|
|
|
|
|
|
As mentioned in the previous section, this integration is built on top of a
|
|
|
|
deprecated library (with no available alternatives) that just so happens to
|
|
|
|
still work with a bit of hammering on x86_64 and Raspberry Pi 3/4.
|
|
|
|
|
|
|
|
Personally it's the voice assistant I still use on most of my devices, but it's
|
|
|
|
definitely not guaranteed that it will keep working in the future.
|
|
|
|
|
|
|
|
Once you have installed Platypush with the dependencies for this integration,
|
|
|
|
you can configure it through these steps:
|
|
|
|
|
|
|
|
1. Create a new project on the [Google developers
|
|
|
|
console](https://console.cloud.google.com) and [generate a new set of
|
|
|
|
credentials for it](https://console.cloud.google.com/apis/credentials).
|
|
|
|
Download the credentials secrets as JSON.
|
|
|
|
2. Generate [scoped
|
|
|
|
credentials](https://developers.google.com/assistant/sdk/guides/library/python/embed/install-sample#generate_credentials)
|
|
|
|
from your `secrets.json`.
|
|
|
|
3. Configure the integration in your `config.yaml` for Platypush (see the
|
|
|
|
[configuration
|
|
|
|
page](https://docs.platypush.tech/wiki/Configuration.html#configuration-file)
|
|
|
|
for more details):
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
assistant.google:
|
|
|
|
# Default: ~/.config/google-oauthlib-tool/credentials.json
|
|
|
|
# or <PLATYPUSH_WORKDIR>/credentials/google/assistant.json
|
|
|
|
credentials_file: /path/to/credentials.json
|
|
|
|
# Default: no sound is played when "Ok Google" is detected
|
|
|
|
conversation_start_sound: /path/to/sound.mp3
|
|
|
|
```
|
|
|
|
|
|
|
|
Restart the service, say "Ok Google" or "Hey Google" while the microphone is
|
|
|
|
active, and everything should work out of the box.
|
|
|
|
|
|
|
|
You can now start creating event hooks to execute your custom voice commands.
|
|
|
|
For example, if you configured a lights plugin (e.g.
|
|
|
|
[`light.hue`](https://docs.platypush.tech/platypush/plugins/light.hue.html))
|
|
|
|
and a music plugin (e.g.
|
|
|
|
[`music.mopidy`](https://docs.platypush.tech/platypush/plugins/music.mopidy.html)),
|
|
|
|
you can start building voice commands like these:
|
|
|
|
|
|
|
|
```python
|
|
|
|
# Content of e.g. /path/to/config_yaml/scripts/assistant.py
|
|
|
|
|
|
|
|
from platypush import run, when
|
|
|
|
from platypush.events.assistant import (
|
|
|
|
ConversationStartEvent, SpeechRecognizedEvent
|
|
|
|
)
|
|
|
|
|
|
|
|
light_plugin = "light.hue"
|
|
|
|
music_plugin = "music.mopidy"
|
|
|
|
|
|
|
|
@when(ConversationStartEvent)
|
|
|
|
def pause_music_when_conversation_starts():
|
|
|
|
run(f"{music_plugin}.pause_if_playing")
|
|
|
|
|
|
|
|
# Note: (limited) support for regular expressions on `phrase`
|
|
|
|
# This hook will match any phrase containing either "turn on the lights"
|
|
|
|
# or "turn off the lights"
|
2024-06-03 15:22:43 +02:00
|
|
|
@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
|
2024-06-03 13:08:57 +02:00
|
|
|
def lights_on_command():
|
|
|
|
run(f"{light_plugin}.on")
|
|
|
|
# Or, with arguments:
|
|
|
|
# run(f"{light_plugin}.on", groups=["Bedroom"])
|
|
|
|
|
2024-06-03 15:22:43 +02:00
|
|
|
@when(SpeechRecognizedEvent, phrase="turn off (the)? lights")
|
2024-06-03 13:08:57 +02:00
|
|
|
def lights_off_command():
|
|
|
|
run(f"{light_plugin}.off")
|
|
|
|
|
2024-06-03 15:22:43 +02:00
|
|
|
@when(SpeechRecognizedEvent, phrase="play (the)? music")
|
2024-06-03 13:08:57 +02:00
|
|
|
def play_music_command():
|
|
|
|
run(f"{music_plugin}.play")
|
|
|
|
|
2024-06-03 15:22:43 +02:00
|
|
|
@when(SpeechRecognizedEvent, phrase="stop (the)? music")
|
2024-06-03 13:08:57 +02:00
|
|
|
def stop_music_command():
|
|
|
|
run(f"{music_plugin}.stop")
|
|
|
|
```
|
|
|
|
|
|
|
|
Or, via YAML:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
# Add to your config.yaml, or to one of the files included in it
|
|
|
|
|
|
|
|
event.hook.pause_music_when_conversation_starts:
|
|
|
|
if:
|
|
|
|
type: platypush.message.event.ConversationStartEvent
|
|
|
|
|
|
|
|
then:
|
|
|
|
- action: music.mopidy.pause_if_playing
|
|
|
|
|
|
|
|
event.hook.lights_on_command:
|
|
|
|
if:
|
|
|
|
type: platypush.message.event.SpeechRecognizedEvent
|
|
|
|
phrase: "turn on (the)? lights"
|
|
|
|
|
|
|
|
then:
|
|
|
|
- action: light.hue.on
|
|
|
|
# args:
|
|
|
|
# groups:
|
|
|
|
# - Bedroom
|
|
|
|
|
|
|
|
event.hook.lights_off_command:
|
|
|
|
if:
|
|
|
|
type: platypush.message.event.SpeechRecognizedEvent
|
|
|
|
phrase: "turn off (the)? lights"
|
|
|
|
|
|
|
|
then:
|
|
|
|
- action: light.hue.off
|
|
|
|
|
|
|
|
event.hook.play_music_command:
|
|
|
|
if:
|
|
|
|
type: platypush.message.event.SpeechRecognizedEvent
|
|
|
|
phrase: "play (the)? music"
|
|
|
|
|
|
|
|
then:
|
|
|
|
- action: music.mopidy.play
|
|
|
|
|
|
|
|
event.hook.stop_music_command:
|
|
|
|
if:
|
|
|
|
type: platypush.message.event.SpeechRecognizedEvent
|
|
|
|
phrase: "stop (the)? music"
|
|
|
|
|
|
|
|
then:
|
|
|
|
- action: music.mopidy.stop
|
|
|
|
```
|
|
|
|
|
|
|
|
Parameters are also supported on the `phrase` event argument through the `${}` template construct. For example:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from platypush import when, run
|
|
|
|
from platypush.events.assistant import SpeechRecognizedEvent
|
|
|
|
|
|
|
|
@when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}')
|
|
|
|
def on_play_track_command(
|
|
|
|
event: SpeechRecognizedEvent, title: str, artist: str
|
|
|
|
):
|
|
|
|
results = run(
|
|
|
|
"music.mopidy.search",
|
|
|
|
filter={"title": title, "artist": artist}
|
|
|
|
)
|
|
|
|
|
|
|
|
if not results:
|
|
|
|
event.assistant.render_response(f"Couldn't find {title} by {artist}")
|
|
|
|
return
|
|
|
|
|
|
|
|
run("music.mopidy.play", resource=results[0]["uri"])
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Pros
|
|
|
|
|
|
|
|
- 👍 Very fast and robust API.
|
|
|
|
- 👍 Easy to install and configure.
|
|
|
|
- 👍 It comes with almost all the features of a voice assistant installed on
|
|
|
|
Google hardware - except some actions native to Android-based devices and
|
|
|
|
video/display features. This means that features such as timers, alarms,
|
|
|
|
weather forecast, setting the volume or controlling Chromecasts on the same
|
|
|
|
network are all supported out of the box.
|
|
|
|
- 👍 It connects to your Google account (can be configured from your Google
|
|
|
|
settings), so things like location-based suggestions and calendar events are
|
|
|
|
available. Support for custom actions and devices configured in your Google
|
|
|
|
Home app is also available out of the box, although I haven't tested it in a
|
|
|
|
while.
|
|
|
|
- 👍 Good multi-language support. In most of the cases the assistant seems
|
|
|
|
quite capable of understanding questions in multiple language and respond in
|
|
|
|
the input language without any further configuration.
|
|
|
|
|
|
|
|
#### Cons
|
|
|
|
|
|
|
|
- 👎 Based on a deprecated API that could break at any moment.
|
|
|
|
- 👎 Limited hardware support (only x86_64 and RPi 3/4).
|
|
|
|
- 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available.
|
|
|
|
- 👎 Not possible to configure the output voice - it can only use the stock
|
|
|
|
Google Assistant voice.
|
|
|
|
- 👎 No support for intents - something similar was available (albeit tricky to
|
|
|
|
configure) through the Actions SDK, but that has also been abandoned by
|
|
|
|
Google.
|
|
|
|
- 👎 Not very modular. Both `assistant.picovoice` and `assistant.openai` have
|
|
|
|
been built by stitching together different independent APIs. Those plugins
|
|
|
|
are therefore quite *modular*. You can choose for instance to run only the
|
|
|
|
hotword engine of `assistant.picovoice`, which in turn will trigger the
|
|
|
|
conversation engine of `assistant.openai`, and maybe use `tts.google` to
|
|
|
|
render the responses. By contrast, given the relatively monolithic nature of
|
|
|
|
`google-assistant-library`, which runs the whole service locally, if your
|
|
|
|
instance runs `assistant.google` then it can't run other assistant plugins.
|
|
|
|
|
|
|
|
### `assistant.picovoice`
|
|
|
|
|
|
|
|
- [**Plugin
|
|
|
|
documentation**](https://docs.platypush.tech/platypush/plugins/assistant.picovoice.html)
|
|
|
|
- `pip` installation: `pip install 'platypush[assistant.picovoice]'`
|
|
|
|
|
|
|
|
The `assistant.picovoice` integration is available from [Platypush
|
|
|
|
1.0.0](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-0-2024-05-26).
|
|
|
|
|
|
|
|
Previous versions had some outdated `sst.picovoice.*` plugins for the
|
|
|
|
individual products, but they weren't properly tested and they weren't combined
|
|
|
|
together into a single integration that implements the Platypush' `assistant`
|
|
|
|
API.
|
|
|
|
|
|
|
|
This integration is built on top of the voice products developed by
|
|
|
|
[Picovoice](https://picovoice.ai/). These include:
|
|
|
|
|
|
|
|
- [**Porcupine**](https://picovoice.ai/platform/porcupine/): a fast and
|
|
|
|
customizable engine for hotword/wake-word detection. It can be enabled by
|
|
|
|
setting `hotword_enabled` to `true` in the `assistant.picovoice` plugin
|
|
|
|
configuration.
|
|
|
|
|
|
|
|
- [**Cheetah**](https://picovoice.ai/docs/cheetah/): a speech-to-text engine
|
|
|
|
optimized for real-time transcriptions. It can be enabled by setting
|
|
|
|
`stt_enabled` to `true` in the `assistant.picovoice` plugin configuration.
|
|
|
|
|
|
|
|
- [**Leopard**](https://picovoice.ai/docs/leopard/): a speech-to-text engine
|
|
|
|
optimized for offline transcriptions of audio files.
|
|
|
|
|
|
|
|
- [**Rhino**](https://picovoice.ai/docs/rhino/): a speech-to-intent engine.
|
|
|
|
|
|
|
|
- [**Orca**](https://picovoice.ai/docs/orca/): a text-to-speech engine.
|
|
|
|
|
|
|
|
You can get your personal access key by signing up at the [Picovoice
|
|
|
|
console](https://console.picovoice.ai/). You may be asked to submit a reason
|
|
|
|
for using the service (feel free to mention a personal Platypush integration),
|
|
|
|
and you will receive your personal access key.
|
|
|
|
|
|
|
|
If prompted to select the products you want to use, make sure to select
|
|
|
|
the ones from the Picovoice suite that you want to use with the
|
|
|
|
`assistant.picovoice` plugin.
|
|
|
|
|
|
|
|
A basic plugin configuration would like this:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
assistant.picovoice:
|
|
|
|
access_key: YOUR_ACCESS_KEY
|
|
|
|
|
|
|
|
# Keywords that the assistant should listen for
|
|
|
|
keywords:
|
|
|
|
- alexa
|
|
|
|
- computer
|
|
|
|
- ok google
|
|
|
|
|
|
|
|
# Paths to custom keyword files
|
|
|
|
# keyword_paths:
|
|
|
|
# - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn
|
|
|
|
|
|
|
|
# Enable/disable the hotword engine
|
|
|
|
hotword_enabled: true
|
|
|
|
# Enable the STT engine
|
|
|
|
stt_enabled: true
|
|
|
|
|
|
|
|
# conversation_start_sound: ...
|
|
|
|
|
|
|
|
# Path to a custom model to be used to speech-to-text
|
|
|
|
# speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv
|
|
|
|
|
|
|
|
# Path to an intent model. At least one custom intent model is required if
|
|
|
|
# you want to enable intent detection.
|
|
|
|
# intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Hotword detection
|
|
|
|
|
|
|
|
If enabled through the `hotword_enabled` parameter (default: True), the
|
|
|
|
assistant will listen for a specific wake word before starting the
|
|
|
|
speech-to-text or intent recognition engines. You can specify custom models for
|
|
|
|
your hotword (e.g. on the same device you may use "Alexa" to trigger the
|
|
|
|
speech-to-text engine in English, "Computer" to trigger the speech-to-text
|
|
|
|
engine in Italian, and "Ok Google" to trigger the intent recognition engine).
|
|
|
|
|
|
|
|
You can also create your custom hotword models using the [Porcupine
|
|
|
|
console](https://console.picovoice.ai/ppn).
|
|
|
|
|
|
|
|
If `hotword_enabled` is set to True, you must also specify the `keywords`
|
|
|
|
parameter with the list of keywords that you want to listen for, and optionally
|
|
|
|
the `keyword_paths` parameter with the paths to the any custom hotword models
|
|
|
|
that you want to use. If `hotword_enabled` is set to False, then the assistant
|
|
|
|
won't start listening for speech after the plugin is started, and you will need
|
|
|
|
to programmatically start the conversation by calling the
|
|
|
|
`assistant.picovoice.start_conversation` action.
|
|
|
|
|
|
|
|
When a wake-word is detected, the assistant will emit a
|
|
|
|
[`HotwordDetectedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
|
|
|
|
that you can use to build your custom logic.
|
|
|
|
|
|
|
|
By default, the assistant will start listening for speech after the hotword if
|
|
|
|
either `stt_enabled` or `intent_model_path` are set. If you don't want the
|
|
|
|
assistant to start listening for speech after the hotword is detected (for
|
|
|
|
example because you want to build your custom response flows, or trigger the
|
|
|
|
speech detection using different models depending on the hotword that is used,
|
|
|
|
or because you just want to detect hotwords but not speech), then you can also
|
|
|
|
set the `start_conversation_on_hotword` parameter to `false`. If that is the
|
|
|
|
case, then you can programmatically start the conversation by calling the
|
|
|
|
`assistant.picovoice.start_conversation` method in your event hooks:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from platypush import when, run
|
|
|
|
from platypush.message.event.assistant import HotwordDetectedEvent
|
|
|
|
|
|
|
|
# Start a conversation using the Italian language model when the
|
|
|
|
# "Buongiorno" hotword is detected
|
|
|
|
@when(HotwordDetectedEvent, hotword='Buongiorno')
|
|
|
|
def on_it_hotword_detected(event: HotwordDetectedEvent):
|
|
|
|
event.assistant.start_conversation(model_file='path/to/it.pv')
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Speech-to-text
|
|
|
|
|
|
|
|
If you want to build your custom STT hooks, the approach is the same seen for
|
|
|
|
the `assistant.google` plugins - create an event hook on
|
|
|
|
[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent)
|
|
|
|
with a given exact phrase, regex or template.
|
|
|
|
|
|
|
|
#### Speech-to-intent
|
|
|
|
|
|
|
|
|
|
|
|
*Intents* are structured actions parsed from unstructured human-readable text.
|
|
|
|
|
|
|
|
Unlike with hotword and speech-to-text detection, you need to provide a
|
|
|
|
custom model for intent detection. You can create your custom model using
|
|
|
|
the [Rhino console](https://console.picovoice.ai/rhn).
|
|
|
|
|
|
|
|
When an intent is detected, the assistant will emit an
|
|
|
|
[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent)
|
|
|
|
and you can build your custom hooks on it.
|
|
|
|
|
|
|
|
For example, you can build a model to control groups of smart lights by
|
|
|
|
defining the following slots on the Rhino console:
|
|
|
|
|
|
|
|
- ``device_state``: The new state of the device (e.g. with ``on`` or
|
|
|
|
``off`` as supported values)
|
|
|
|
|
|
|
|
- ``room``: The name of the room associated to the group of lights to
|
|
|
|
be controlled (e.g. ``living room``, ``kitchen``, ``bedroom``)
|
|
|
|
|
|
|
|
You can then define a ``lights_ctrl`` intent with the following expressions:
|
|
|
|
|
|
|
|
- "*turn ``$device_state:state`` the lights*"
|
|
|
|
- "*turn ``$device_state:state`` the ``$room:room`` lights*"
|
|
|
|
- "*turn the lights ``$device_state:state``*"
|
|
|
|
- "*turn the ``$room:room`` lights ``$device_state:state``*"
|
|
|
|
- "*turn ``$room:room`` lights ``$device_state:state``*"
|
|
|
|
|
|
|
|
This intent will match any of the following phrases:
|
|
|
|
|
|
|
|
- "*turn on the lights*"
|
|
|
|
- "*turn off the lights*"
|
|
|
|
- "*turn the lights on*"
|
|
|
|
- "*turn the lights off*"
|
|
|
|
- "*turn on the living room lights*"
|
|
|
|
- "*turn off the living room lights*"
|
|
|
|
- "*turn the living room lights on*"
|
|
|
|
- "*turn the living room lights off*"
|
|
|
|
|
|
|
|
And it will extract any slots that are matched in the phrases in the
|
|
|
|
[`IntentRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.IntentRecognizedEvent).
|
|
|
|
|
|
|
|
Train the model, download the context file, and pass the path on the
|
|
|
|
``intent_model_path`` parameter.
|
|
|
|
|
|
|
|
You can then register a hook to listen to a specific intent:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from platypush import when, run
|
|
|
|
from platypush.events.assistant import IntentRecognizedEvent
|
|
|
|
|
|
|
|
@when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'})
|
|
|
|
def on_turn_on_lights(event: IntentRecognizedEvent):
|
|
|
|
room = event.slots.get('room')
|
|
|
|
if room:
|
|
|
|
run("light.hue.on", groups=[room])
|
|
|
|
else:
|
|
|
|
run("light.hue.on")
|
|
|
|
```
|
|
|
|
|
|
|
|
Note that if both `stt_enabled` and `intent_model_path` are set, then
|
|
|
|
both the speech-to-text and intent recognition engines will run in parallel
|
|
|
|
when a conversation is started.
|
|
|
|
|
|
|
|
The intent engine is usually faster, as it has a smaller set of intents to
|
|
|
|
match and doesn't have to run a full speech-to-text transcription. This means that,
|
|
|
|
if an utterance matches both a speech-to-text phrase and an intent, the
|
|
|
|
`IntentRecognizedEvent` event is emitted (and not `SpeechRecognizedEvent`).
|
|
|
|
|
|
|
|
This may not be always the case though. So, if you want to use the intent
|
|
|
|
detection engine together with the speech detection, it may be a good practice
|
|
|
|
to also provide a fallback `SpeechRecognizedEvent` hook to catch the text if
|
|
|
|
the speech is not recognized as an intent:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from platypush import when, run
|
|
|
|
from platypush.events.assistant import SpeechRecognizedEvent
|
|
|
|
|
|
|
|
@when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?')
|
|
|
|
def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context):
|
|
|
|
if room:
|
|
|
|
run("light.hue.on", groups=[room])
|
|
|
|
else:
|
|
|
|
run("light.hue.on")
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Text-to-speech and response management
|
|
|
|
|
|
|
|
The text-to-speech engine, based on Orca, is provided by the
|
|
|
|
[`tts.picovoice`](https://docs.platypush.tech/platypush/plugins/tts.picovoice.html)
|
|
|
|
plugin.
|
|
|
|
|
|
|
|
However, the Picovoice integration won't provide you with automatic
|
|
|
|
AI-generated responses for your queries. That's because Picovoice doesn't seem
|
|
|
|
to offer (yet) any products for conversational assistants, either voice-based
|
|
|
|
or text-based.
|
|
|
|
|
|
|
|
You can however leverage the `render_response` action to render some text as
|
|
|
|
speech in response to a user command, and that in turn will leverage the
|
|
|
|
Picovoice TTS plugin to render the response.
|
|
|
|
|
|
|
|
For example, the following snippet provides a hook that:
|
|
|
|
|
|
|
|
- Listens for `SpeechRecognizedEvent`.
|
|
|
|
|
|
|
|
- Matches the phrase against a list of predefined commands that shouldn't
|
|
|
|
require an AI-generated response.
|
|
|
|
|
|
|
|
- Has a fallback logic that leverages `openai.get_response` to generate a
|
|
|
|
response through a ChatGPT model and render it as audio.
|
|
|
|
|
|
|
|
Also, note that any text rendered over the `render_response` action that ends
|
|
|
|
with a question mark will automatically trigger a follow-up - i.e. the
|
|
|
|
assistant will wait for the user to answer its question.
|
|
|
|
|
|
|
|
```python
|
|
|
|
import re
|
|
|
|
|
|
|
|
from platypush import hook, run
|
|
|
|
from platypush.message.event.assistant import SpeechRecognizedEvent
|
|
|
|
|
|
|
|
def play_music():
|
|
|
|
run("music.mopidy.play")
|
|
|
|
|
|
|
|
def stop_music():
|
|
|
|
run("music.mopidy.stop")
|
|
|
|
|
|
|
|
def ai_assist(event: SpeechRecognizedEvent):
|
|
|
|
response = run("openai.get_response", prompt=event.phrase)
|
|
|
|
if not response:
|
|
|
|
return
|
|
|
|
|
|
|
|
run("assistant.picovoice.render_response", text=response)
|
|
|
|
|
|
|
|
# List of commands to match, as pairs of regex patterns and the
|
|
|
|
# corresponding actions
|
|
|
|
hooks = (
|
|
|
|
(re.compile(r"play (the)?music", re.IGNORECASE), play_music),
|
|
|
|
(re.compile(r"stop (the)?music", re.IGNORECASE), stop_music),
|
2024-06-03 16:43:36 +02:00
|
|
|
# ...
|
2024-06-03 13:08:57 +02:00
|
|
|
# Fallback to the AI assistant
|
|
|
|
(re.compile(r".*"), ai_assist),
|
|
|
|
)
|
|
|
|
|
|
|
|
@when(SpeechRecognizedEvent)
|
|
|
|
def on_speech_recognized(event, **kwargs):
|
|
|
|
for pattern, command in hooks:
|
|
|
|
if pattern.search(event.phrase):
|
|
|
|
run("logger.info", msg=f"Running voice command: {command.__name__}")
|
|
|
|
command(event, **kwargs)
|
|
|
|
break
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Offline speech-to-text
|
|
|
|
|
|
|
|
An [`assistant.picovoice.transcribe`
|
|
|
|
action](https://docs.platypush.tech/platypush/plugins/assistant.picovoice.html#platypush.plugins.assistant.picovoice.AssistantPicovoicePlugin.transcribe)
|
|
|
|
is provided for offline transcriptions of audio files, using the Leopard
|
|
|
|
models.
|
|
|
|
|
|
|
|
You can easily call it from your procedures, hooks or through the API:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
|
|
|
|
{
|
|
|
|
"type": "request",
|
|
|
|
"action": "assistant.picovoice.transcribe",
|
|
|
|
"args": {
|
|
|
|
"audio_file": "/path/to/some/speech.mp3"
|
|
|
|
}
|
|
|
|
}' http://localhost:8008/execute
|
|
|
|
|
|
|
|
{
|
|
|
|
"transcription": "This is a test",
|
|
|
|
"words": [
|
|
|
|
{
|
|
|
|
"word": "this",
|
|
|
|
"start": 0.06400000303983688,
|
|
|
|
"end": 0.19200000166893005,
|
|
|
|
"confidence": 0.9626294374465942
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"word": "is",
|
|
|
|
"start": 0.2879999876022339,
|
|
|
|
"end": 0.35199999809265137,
|
|
|
|
"confidence": 0.9781675934791565
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"word": "a",
|
|
|
|
"start": 0.41600000858306885,
|
|
|
|
"end": 0.41600000858306885,
|
|
|
|
"confidence": 0.9764975309371948
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"word": "test",
|
|
|
|
"start": 0.5120000243186951,
|
|
|
|
"end": 0.8320000171661377,
|
|
|
|
"confidence": 0.9511580467224121
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Pros
|
|
|
|
|
|
|
|
- 👍 The Picovoice integration is extremely configurable. `assistant.picovoice`
|
|
|
|
stitches together five independent products developed by a small company
|
|
|
|
specialized in voice products for developers. As such, Picovoice may be the
|
|
|
|
best option if you have custom use-cases. You can pick which features you
|
|
|
|
need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you
|
|
|
|
have plenty of flexibility in building your integrations.
|
|
|
|
|
|
|
|
- 👍 Runs (or seems to run) (mostly) on device. This is something that we can't
|
|
|
|
say about the other two integrations discussed in this article. If keeping
|
|
|
|
your voice interactions 100% hidden from Google's or Microsoft's eyes is a
|
|
|
|
priority, then Picovoice may be your best bet.
|
|
|
|
|
|
|
|
- 👍 Rich features. It uses different models for different purposes - for
|
|
|
|
example, Cheetah models are optimized for real-time speech detection, while
|
|
|
|
Leopard is optimized for offline transcription. Moreover, Picovoice is the
|
|
|
|
only integration among those analyzed in this article to support
|
|
|
|
speech-to-intent.
|
|
|
|
|
|
|
|
- 👍 It's very easy to build new models or customize existing ones. Picovoice
|
|
|
|
has a powerful developers console that allows you to easily create hotword
|
|
|
|
models, tweak the priority of some words in voice models, and create custom
|
|
|
|
intent models.
|
|
|
|
|
|
|
|
#### Cons
|
|
|
|
|
|
|
|
- 👎 The business model is still a bit weird. It's better than the earlier
|
|
|
|
"*write us an email with your business case and we'll reach back to you*",
|
|
|
|
but it still requires you to sign up with a business email and write a couple
|
|
|
|
of lines on what you want to build with their products. It feels like their
|
|
|
|
focus is on a B2B approach rather than "open up and let the community build
|
|
|
|
stuff", and that seems to create unnecessary friction.
|
|
|
|
|
|
|
|
- 👎 No native conversational features. At the time of writing, Picovoice
|
|
|
|
doesn't offer products that generate AI responses given voice or text
|
|
|
|
prompts. This means that, if you want AI-generated responses to your queries,
|
|
|
|
you'll have to do requests to e.g.
|
|
|
|
[`openai.get_response(prompt)`](https://docs.platypush.tech/platypush/plugins/openai.html#platypush.plugins.openai.OpenaiPlugin.get_response)
|
|
|
|
directly in your hooks for `SpeechRecognizedEvent`, and render the responses
|
|
|
|
through `assistant.picovoice.render_response`. This makes the use of
|
|
|
|
`assistant.picovoice` alone more fit to cases where you want to mostly create
|
|
|
|
voice command hooks rather than have general-purpose conversations.
|
|
|
|
|
|
|
|
- 👎 Speech-to-text, at least on my machine, is slower than the other two
|
|
|
|
integrations, and the accuracy with non-native accents is also much lower.
|
|
|
|
|
|
|
|
- 👎 Limited support for any languages other than English. At the time of
|
|
|
|
writing hotword detection with Porcupine seems to be in a relative good shape
|
|
|
|
with [support for 16
|
|
|
|
languages](https://github.com/Picovoice/porcupine/tree/master/lib/common).
|
|
|
|
However, both speech-to-text and text-to-speech only support English at the
|
|
|
|
moment.
|
|
|
|
|
|
|
|
- 👎 Some APIs are still quite unstable. The Orca text-to-speech API, for
|
|
|
|
example, doesn't even support text that includes digits or some punctuation
|
|
|
|
characters - at least not at the time of writing. The Platypush integration
|
|
|
|
fills the gap with workarounds that e.g. replace words to numbers and replace
|
|
|
|
punctuation characters, but you definitely have a feeling that some parts of
|
|
|
|
their products are still work in progress.
|
|
|
|
|
|
|
|
### `assistant.openai`
|
|
|
|
|
|
|
|
- [**Plugin
|
|
|
|
documentation**](https://docs.platypush.tech/platypush/plugins/assistant.openai.html)
|
|
|
|
- `pip` installation: `pip install 'platypush[assistant.openai]'`
|
|
|
|
|
|
|
|
This integration has been released in [Platypush
|
|
|
|
1.0.7](https://git.platypush.tech/platypush/platypush/src/branch/master/CHANGELOG.md#1-0-7-2024-06-02).
|
|
|
|
|
|
|
|
It uses the following OpenAI APIs:
|
|
|
|
|
|
|
|
- [`/audio/transcriptions`](https://platform.openai.com/docs/guides/speech-to-text)
|
|
|
|
for speech-to-text. At the time of writing the default model is `whisper-1`.
|
|
|
|
It can be configured through the `model` setting on the `assistant.openai`
|
|
|
|
plugin configuration. See the [OpenAI
|
|
|
|
documentation](https://platform.openai.com/docs/models/whisper) for a list of
|
|
|
|
available models.
|
|
|
|
- [`/chat/completions`](https://platform.openai.com/docs/api-reference/completions/create)
|
|
|
|
to get AI-generated responses using a GPT model. At the time of writing the
|
|
|
|
default is `gpt-3.5-turbo`, but it can be configurable through the `model`
|
|
|
|
setting on the `openai` plugin configuration. See the [OpenAI
|
|
|
|
documentation](https://platform.openai.com/docs/models) for a list of supported models.
|
|
|
|
- [`/audio/speech`](https://platform.openai.com/docs/guides/text-to-speech) for
|
|
|
|
text-to-speech. At the time of writing the default model is `tts-1` and the
|
|
|
|
default voice is `nova`. They can be configured through the `model` and
|
|
|
|
`voice` settings respectively on the `tts.openai` plugin. See the OpenAI
|
|
|
|
documentation for a list of available
|
|
|
|
[models](https://platform.openai.com/docs/models/tts) and
|
|
|
|
[voices](https://platform.openai.com/docs/guides/text-to-speech/voice-options).
|
|
|
|
|
|
|
|
You will need an [OpenAI API key](https://platform.openai.com/api-keys)
|
|
|
|
associated to your account.
|
|
|
|
|
|
|
|
A basic configuration would like this:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
openai:
|
|
|
|
api_key: YOUR_OPENAI_API_KEY # Required
|
|
|
|
# conversation_start_sound: ...
|
|
|
|
# model: ...
|
|
|
|
# context: ...
|
|
|
|
# context_expiry: ...
|
|
|
|
# max_tokens: ...
|
|
|
|
|
|
|
|
assistant.openai:
|
|
|
|
# model: ...
|
|
|
|
# tts_plugin: some.other.tts.plugin
|
|
|
|
|
|
|
|
tts.openai:
|
|
|
|
# model: ...
|
|
|
|
# voice: ...
|
|
|
|
```
|
|
|
|
|
|
|
|
If you want to build your custom hooks on speech events, the approach is the
|
|
|
|
same seen for the other `assistant` plugins - create an event hook on
|
|
|
|
[`SpeechRecognizedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent)
|
|
|
|
with a given exact phrase, regex or template.
|
|
|
|
|
|
|
|
#### Hotword support
|
|
|
|
|
|
|
|
OpenAI doesn't provide an API for hotword detection, nor a small model for
|
|
|
|
offline detection.
|
|
|
|
|
|
|
|
This means that, if no other `assistant` plugins with stand-alone hotword
|
|
|
|
support are configured (only `assistant.picovoice` for now), a conversation can
|
|
|
|
only be triggered by calling the `assistant.openai.start_conversation` action.
|
|
|
|
|
|
|
|
If you want hotword support, then the best bet is to add `assistant.picovoice`
|
|
|
|
to your configuration too - but make sure to only enable hotword detection and
|
|
|
|
not speech detection, which will be delegated to `assistant.openai` via event
|
|
|
|
hook:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
assistant.picovoice:
|
|
|
|
access_key: ...
|
|
|
|
keywords:
|
|
|
|
- computer
|
|
|
|
|
|
|
|
hotword_enabled: true
|
|
|
|
stt_enabled: false
|
|
|
|
# conversation_start_sound: ...
|
|
|
|
```
|
|
|
|
|
|
|
|
Then create a hook that listens for
|
|
|
|
[`HotwordDetectedEvent`](https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.HotwordDetectedEvent)
|
|
|
|
and calls `assistant.openai.start_conversation`:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from platypush import run, when
|
|
|
|
from platypush.events.assistant import HotwordDetectedEvent
|
|
|
|
|
|
|
|
@when(HotwordDetectedEvent, hotword="computer")
|
|
|
|
def on_hotword_detected():
|
|
|
|
run("assistant.openai.start_conversation")
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Conversation contexts
|
|
|
|
|
|
|
|
The most powerful feature offered by the OpenAI assistant is the fact that it
|
|
|
|
leverages the *conversation contexts* provided by the OpenAI API.
|
|
|
|
|
|
|
|
This means two things:
|
|
|
|
|
|
|
|
1. Your assistant can be initialized/tuned with a *static context*. It is
|
|
|
|
possible to provide some initialization context to the assistant that can
|
|
|
|
fine tune how the assistant will behave, (e.g. what kind of
|
|
|
|
tone/language/approach will have when generating the responses), as well as
|
|
|
|
initialize the assistant with some predefined knowledge in the form of
|
|
|
|
hypothetical past conversations. Example:
|
|
|
|
|
2024-06-03 16:43:36 +02:00
|
|
|
```yaml
|
|
|
|
openai:
|
2024-06-03 16:44:07 +02:00
|
|
|
# ...
|
2024-06-03 16:43:36 +02:00
|
|
|
|
|
|
|
context:
|
|
|
|
# `system` can be used to initialize the context for the expected tone
|
|
|
|
# and language in the assistant responses
|
|
|
|
- role: system
|
|
|
|
content: >
|
|
|
|
You are a voice assistant that responds to user queries using
|
|
|
|
references to Lovecraftian lore.
|
|
|
|
|
|
|
|
# `user`/`assistant` interactions can be used to initialize the
|
|
|
|
# conversation context with previous knowledge. `user` is used to
|
|
|
|
# emulate previous user questions, and `assistant` models the
|
|
|
|
# expected response.
|
|
|
|
- role: user
|
|
|
|
content: What is a telephone?
|
|
|
|
- role: assistant
|
|
|
|
content: >
|
|
|
|
A Cthulhuian device that allows you to communicate with
|
|
|
|
otherworldly beings. It is said that the first telephone was
|
|
|
|
created by the Great Old Ones themselves, and that it is a
|
|
|
|
gateway to the void beyond the stars.
|
|
|
|
```
|
2024-06-03 13:08:57 +02:00
|
|
|
|
|
|
|
If you now start Platypush and ask a question like "*how does it work?*",
|
|
|
|
the voice assistant may give a response along the lines of:
|
|
|
|
|
|
|
|
```
|
|
|
|
The telephone functions by harnessing the eldritch energies of the cosmos to
|
|
|
|
transmit vibrations through the ether, allowing communication across vast
|
|
|
|
distances with entities from beyond the veil. Its operation is shrouded in
|
|
|
|
mystery, for it relies on arcane principles incomprehensible to mortal
|
|
|
|
minds.
|
|
|
|
```
|
|
|
|
|
|
|
|
Note that:
|
|
|
|
|
|
|
|
1. The style of the response is consistent with that initialized in the
|
|
|
|
`context` through `system` roles.
|
|
|
|
|
|
|
|
2. Even though a question like "*how does it work?*" is not very specific,
|
|
|
|
the assistant treats the `user`/`assistant` entries given in the context
|
|
|
|
as if they were the latest conversation prompts. Thus it realizes that
|
|
|
|
"*it*", in this context, probably means "*the telephone*".
|
|
|
|
|
|
|
|
2. The assistant has a *runtime context*. It will remember the recent
|
|
|
|
conversations for a given amount of time (configurable through the
|
|
|
|
`context_expiry` setting on the `openai` plugin configuration). So, even
|
|
|
|
without explicit context initialization in the `openai` plugin, the plugin
|
|
|
|
will remember the last interactions for (by default) 10 minutes. So if you
|
|
|
|
ask "*who wrote the Divine Comedy?*", and a few seconds later you ask
|
|
|
|
"*where was its writer from?*", you may get a response like "*Florence,
|
|
|
|
Italy*" - i.e. the assistant realizes that "*the writer*" in this context is
|
|
|
|
likely to mean "*the writer of the work that I was asked about in the
|
|
|
|
previous interaction*" and return pertinent information.
|
|
|
|
|
|
|
|
#### Pros
|
|
|
|
|
|
|
|
- 👍 Speech detection quality. The OpenAI speech-to-text features are the best
|
|
|
|
among the available `assistant` integrations. The `transcribe` API so far has
|
|
|
|
detected my non-native English accent right nearly 100% of the times (Google
|
|
|
|
comes close to 90%, while Picovoice trails quite behind). And it even detects
|
|
|
|
the speech of my young kid - something that the Google Assistant library has
|
|
|
|
always failed to do right.
|
|
|
|
|
|
|
|
- 👍 Text-to-speech quality. The voice models used by OpenAI sound much more
|
|
|
|
natural and human than those of both Google and Picovoice. Google's and
|
|
|
|
Picovoice's TTS models are actually already quite solid, but OpenAI
|
|
|
|
outclasses them when it comes to voice modulation, inflections and sentiment.
|
|
|
|
The result sounds intimidatingly realistic.
|
|
|
|
|
|
|
|
- 👍 AI responses quality. While the scope of the Google Assistant is somewhat
|
|
|
|
limited by what people expected from voice assistants until a few years ago
|
|
|
|
(control some devices and gadgets, find my phone, tell me the news/weather,
|
|
|
|
do basic Google searches...), usually without much room for follow-ups,
|
|
|
|
`assistant.openai` will basically render voice responses as if you were
|
|
|
|
typing them directly to ChatGPT. While Google would often respond you with a
|
|
|
|
"*sorry, I don't understand*", or "*sorry, I can't help with that*", the
|
|
|
|
OpenAI assistant is more likely to expose its reasoning, ask follow-up
|
|
|
|
questions to refine its understanding, and in general create a much more
|
|
|
|
realistic conversation.
|
|
|
|
|
|
|
|
- 👍 Contexts. They are an extremely powerful way to initialize your assistant
|
|
|
|
and customize it to speak the way you want, and know the kind of things that
|
|
|
|
you want it to know. Cross-conversation contexts with configurable expiry
|
|
|
|
also make it more natural to ask something, get an answer, and then ask
|
|
|
|
another question about the same topic a few seconds later, without having to
|
|
|
|
reintroduce the assistant to the whole context.
|
|
|
|
|
|
|
|
- 👍 Offline transcriptions available through the `openai.transcribe` action.
|
|
|
|
|
|
|
|
- 👍 Multi-language support seems to work great out of the box. Ask something
|
|
|
|
to the assistant in any language, and it'll give you a response in that
|
|
|
|
language.
|
|
|
|
|
|
|
|
- 👍 Configurable voices and models.
|
|
|
|
|
|
|
|
#### Cons
|
|
|
|
|
|
|
|
- 👎 The full pack of features is only available if you have an API key
|
|
|
|
associated to a paid OpenAI account.
|
|
|
|
|
|
|
|
- 👎 No hotword support. It relies on `assistant.picovoice` for hotword
|
|
|
|
detection.
|
|
|
|
|
|
|
|
- 👎 No intents support.
|
|
|
|
|
|
|
|
- 👎 No native support for weather forecast, alarms, timers, integrations with
|
|
|
|
other services/devices nor other features available out of the box with the
|
|
|
|
Google Assistant. You can always create hooks for them though.
|
|
|
|
|
|
|
|
### Weather forecast example
|
|
|
|
|
|
|
|
Both the OpenAI and Picovoice integrations lack some features available out of
|
|
|
|
the box on the Google Assistant - weather forecast, news playback, timers etc. -
|
|
|
|
as they rely on voice-only APIs that by default don't connect to other services.
|
|
|
|
|
|
|
|
However Platypush provides many plugins to fill those gaps, and those features
|
|
|
|
can be implemented with custom event hooks.
|
|
|
|
|
|
|
|
Let's see for example how to build a simple hook that delivers the weather
|
|
|
|
forecast for the next 24 hours whenever the assistant gets a phrase that
|
|
|
|
contains the "*weather today*" string.
|
|
|
|
|
|
|
|
You'll need to enable a `weather` plugin in Platypush -
|
|
|
|
[`weather.openweathermap`](https://docs.platypush.tech/platypush/plugins/weather.openweathermap.html)
|
|
|
|
will be used in this example. Configuration:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
weather.openweathermap:
|
|
|
|
token: OPENWEATHERMAP_API_KEY
|
|
|
|
location: London,GB
|
|
|
|
```
|
|
|
|
|
|
|
|
Then drop a script named e.g. `weather.py` in the Platypush scripts directory
|
|
|
|
(default: `<CONFDIR>/scripts`) with the following content:
|
|
|
|
|
|
|
|
```python
|
|
|
|
from datetime import datetime
|
|
|
|
from textwrap import dedent
|
|
|
|
from time import time
|
|
|
|
|
|
|
|
from platypush import run, when
|
|
|
|
from platypush.events.assistant import SpeechRecognizedEvent
|
|
|
|
|
|
|
|
@when(SpeechRecognizedEvent, phrase='weather today')
|
|
|
|
def weather_forecast(event: SpeechRecognizedEvent):
|
|
|
|
limit = time() + 24 * 60 * 60 # 24 hours from now
|
|
|
|
forecast = [
|
|
|
|
weather
|
|
|
|
for weather in run("weather.openweathermap.get_forecast")
|
|
|
|
if datetime.fromisoformat(weather["time"]).timestamp() < limit
|
|
|
|
]
|
|
|
|
|
|
|
|
min_temp = round(
|
|
|
|
min(weather["temperature"] for weather in forecast)
|
|
|
|
)
|
|
|
|
max_temp = round(
|
|
|
|
max(weather["temperature"] for weather in forecast)
|
|
|
|
)
|
|
|
|
max_wind_gust = round(
|
|
|
|
(max(weather["wind_gust"] for weather in forecast)) * 3.6
|
|
|
|
)
|
|
|
|
summaries = [weather["summary"] for weather in forecast]
|
|
|
|
most_common_summary = max(summaries, key=summaries.count)
|
|
|
|
avg_cloud_cover = round(
|
|
|
|
sum(weather["cloud_cover"] for weather in forecast) / len(forecast)
|
|
|
|
)
|
|
|
|
|
|
|
|
event.assistant.render_response(
|
|
|
|
dedent(
|
|
|
|
f"""
|
|
|
|
The forecast for today is: {most_common_summary}, with
|
|
|
|
a minimum of {min_temp} and a maximum of {max_temp}
|
|
|
|
degrees, wind gust of {max_wind_gust} km/h, and an
|
|
|
|
average cloud cover of {avg_cloud_cover}%.
|
|
|
|
"""
|
|
|
|
)
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
This script will work with any of the available voice assistants.
|
|
|
|
|
|
|
|
You can also implement something similar for news playback, for example using
|
|
|
|
the [`rss` plugin](https://docs.platypush.tech/platypush/plugins/rss.html) to
|
|
|
|
get the latest items in your subscribed feeds. Or to create custom alarms using
|
|
|
|
the [`alarm` plugin](https://docs.platypush.tech/platypush/plugins/alarm.html),
|
|
|
|
or a timer using the [`utils.set_timeout`
|
|
|
|
action](https://docs.platypush.tech/platypush/plugins/utils.html#platypush.plugins.utils.UtilsPlugin.set_timeout).
|
|
|
|
|
|
|
|
## Conclusions
|
|
|
|
|
|
|
|
The past few years have seen a lot of things happen in the voice industry.
|
|
|
|
Many products have gone out of market, been deprecated or sunset, but not all
|
|
|
|
hope is lost. The OpenAI and Picovoice products, especially when combined
|
|
|
|
together, can still provide a good out-of-the-box voice assistant experience.
|
|
|
|
And the OpenAI products have also raised the bar on what to expect from an
|
|
|
|
AI-based assistant.
|
|
|
|
|
|
|
|
I wish that there were still some fully open and on-device alternatives out
|
|
|
|
there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google
|
|
|
|
provide the best voice experience as of now, but of course they come with
|
|
|
|
trade-offs - namely the great amount of data points you feed to these
|
|
|
|
cloud-based services. Picovoice is somewhat a trade-off, as it runs at least
|
|
|
|
partly on-device, but their business model is still a bit fuzzy and it's not
|
|
|
|
clear whether they intend to have their products used by the wider public or if
|
|
|
|
it's mostly B2B.
|
|
|
|
|
|
|
|
I'll keep an eye however on what is going to come from the ashes of Mycroft
|
|
|
|
under the form of the
|
|
|
|
[OpenConversational](https://community.openconversational.ai/) project, and
|
|
|
|
probably keep you up-to-date when there is a new integration to share.
|