52 KiB
Those who have been following my blog or used Platypush for a while probably know that I've put quite some efforts to get voice assistants rights over the past few years.
I built my first (very primitive) voice assistant that used DCT+Markov models back in 2008, when the concept was still pretty much a science fiction novelty.
Then I wrote an article in 2019 and one in 2020 on how to use several voice integrations in Platypush to create custom voice assistants.
Everyone in those pictures is now dead
Quite a few things have changed in this industry niche since I wrote my previous article. Most of the solutions that I covered back in the day, unfortunately, are gone in a way or another:
-
The
assistant.snowboy
integration is gone because unfortunately Snowboy is gone. For a while you could still run the Snowboy code with models that either you had previously downloaded from their website or trained yourself, but my latest experience proved to be quite unfruitful - it's been more than 4 years since the last commit on Snowboy, and it's hard to get the code to even run. -
The
assistant.alexa
integration is also gone, as Amazon has stopped maintaining the AVS SDK. And I have literally no clue of what Amazon's plans with the development of Alexa skills are (if there are any plans at all). -
The
stt.deepspeech
integration is also gone: the project hasn't seen a commit in 3 years and I even struggled to get the latest code to run. Given the current financial situation at Mozilla, and the fact that they're trying to cut as much as possible on what they don't consider part of their core product, it's very unlikely that DeepSpeech will be revived any time soon. -
The
assistant.google
integration is still there, but I can't make promises on how long it can be maintained. It uses thegoogle-assistant-library
, which was deprecated in 2019. Google replaced it with the conversational actions, which was also deprecated last year.<rant>
Put here your joke about Google building products with the shelf life of a summer hit.</rant>
-
The
tts.mimic3
integration, a text model based on mimic3, part of the Mycroft initiative, is still there, but only because it's still possible to spin up a Docker image that runs mimic3. The whole Mycroft project, however, is now defunct, and the story of how it went bankrupt is a very sad story about the power that patent trolls have on startups. The Mycroft initiative however seems to have been picked up by the community, and something seems to move in the space of fully open source and on-device voice models. I'll definitely be looking with interest at what happens in that space, but the project seems to be at a stage that is still a bit immature to justify an investment into a new Platypush integration.
But not all hope is lost
assistant.google
assistant.google
may be relying on a dead library, but it's not dead (yet).
The code still works, but you're a bit constrained on the hardware side - the
assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3
and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other
ARMv7-compatible devices has proved to be a challenge in some cases. Given the
state of the library, it's safe to say that it'll never be supported on other
platforms, but if you want to run your assistant on a device that is still
supported then it should still work fine.
I had however to do a few dirty packaging tricks to ensure that the assistant
library code doesn't break badly on newer versions of Python. That code hasn't
been touched in 5 years and it's starting to rot. It depends on ancient and
deprecated Python libraries like enum34
and it needs some hammering to work - without breaking the whole Python
environment in the process.
For now, pip install 'platypush[assistant.google]'
should do all the dirty
work and get all of your assistant dependencies installed. But I can't promise
I can maintain that code forever.
assistant.picovoice
Picovoice has been a nice surprise in an industry niche where all the products that were available just 4 years ago are now dead.
I described some of their products in my previous
articles,
and I even built a couple of stt.picovoice.*
plugins for Platypush back in
the day, but I didn't really put much effort in it.
Their business model seemed a bit weird - along the lines of "you can test our products on x86_64, if you need an ARM build you should contact us as a business partner". And the quality of their products was also a bit disappointing compared to other mainstream offerings.
I'm glad to see that the situation has changed quite a bit now. They still have a "sign up with a business email" model, but at least now you can just sign up on their website and start using their products rather than sending emails around. And I'm also quite impressed to see the progress on their website. You can now train hotword models, customize speech-to-text models and build your own intent rules directly from their website - a feature that was also available in the beloved Snowboy and that went missing from any major product offerings out there after Snowboy was gone. I feel like the quality of their models has also greatly improved compared to the last time I checked them - predictions are still slower than the Google Assistant, definitely less accurate with non-native accents, but the gap with the Google Assistant when it comes to native accents isn't very wide.
assistant.openai
OpenAI has filled many gaps left by all the casualties in the voice assistants
market. Platypush now provides a new assistant.openai
plugin that stitches
together several of their APIs to provide a voice assistant experience that
honestly feels much more natural than anything I've tried in all these years.
Let's explore how to use these integrations to build our on-device voice assistant with custom rules.
Feature comparison
As some of you may know, voice assistant often aren't monolithic products.
Unless explicitly designed as all-in-one packages (like the
google-assistant-library
), voice assistant integrations in Platypush are
usually built on top of four distinct APIs:
-
Hotword detection: This is the component that continuously listens on your microphone until you speak "Ok Google", "Alexa" or any other wake-up word used to start a conversation. Since it's a continuously listening component that needs to take decisions fast, and it only has to recognize one word (or in a few cases 3-4 more at most), it usually doesn't need to run on a full language model. It needs small models, often a couple of MBs heavy at most.
-
Speech-to-text (STT): This is the component that will capture audio from the microphone and use some API to transcribe it to text.
-
Response engine: Once you have the transcription of what the user said, you need to feed it to some model that will generate some human-like response for the question.
-
Text-to-speech (TTS): Once you have your AI response rendered as a text string, you need a text-to-speech model to speak it out loud on your speakers or headphones.
On top of these basic building blocks for a voice assistant, some integrations may also provide two extra features.
Speech-to-intent
In this mode, the user's prompt, instead of being transcribed directly to text, is transcribed into a structured intent that can be more easily processed by a downstream integration with no need for extra text parsing, regular expressions etc.
For instance, a voice command like "turn off the bedroom lights" could be translated into an intent such as:
{
"intent": "lights_ctrl",
"slots": {
"state": "off",
"lights": "bedroom"
}
}
Offline speech-to-text
a.k.a. offline text transcriptions. Some assistant integrations may offer you the ability to pass some audio file and transcribe their content as text.
Features summary
This table summarizes how the assistant
integrations available in Platypush
compare when it comes to what I would call the foundational blocks:
Plugin | Hotword | STT | AI responses | TTS |
---|---|---|---|---|
assistant.google |
✅ | ✅ | ✅ | ✅ |
assistant.openai |
❌ | ✅ | ✅ | ✅ |
assistant.picovoice |
✅ | ✅ | ❌ | ✅ |
And this is how they compare in terms of extra features:
Plugin | Intents | Offline SST |
---|---|---|
assistant.google |
❌ | ❌ |
assistant.openai |
❌ | ✅ |
assistant.picovoice |
✅ | ✅ |
Let's see a few configuration examples to better understand the pros and cons of each of these integrations.
Configuration
Hardware requirements
-
A computer, a Raspberry Pi, an old tablet, or anything in between, as long as it can run Python. At least 1GB of RAM is advised for smooth audio processing experience.
-
A microphone.
-
Speaker/headphones.
Installation notes
Platypush 1.0.0 has recently been released, and new installation procedures with it.
There's now official support for several package
managers,
a better Docker installation
process, and more
powerful ways to install
plugins - via
pip
extras,
Web
interface,
Docker and
virtual
environments.
The optional dependencies for any Platypush plugins can be installed via pip
extras in the simplest case:
$ pip install 'platypush[plugin1,plugin2,...]'
For example, if you want to install Platypush with the dependencies for
assistant.openai
and assistant.picovoice
:
$ pip install 'platypush[assistant.openai,assistant.picovoice]'
Some plugins however may require extra system dependencies that are not
available via pip
- for instance, both the OpenAI and Picovoice integrations
require the ffmpeg
binary to be installed, as it is used for audio
conversion and exporting purposes. You can check the plugins
documentation for any system dependencies
required by some integrations, or install them automatically through the Web
interface or the platydock
command for Docker containers.
A note on the hooks
All the custom actions in this article are built through event hooks triggered
by
SpeechRecognizedEvent
(or
IntentRecognizedEvent
for intents). When an intent event is triggered, or a speech event with a
condition on a phrase, the assistant
integrations in Platypush will prevent
the default assistant response. That's to avoid cases where e.g. you say "turn
off the lights", your hook takes care of running the actual action, while your
voice assistant fetches a response from Google or ChatGPT along the lines of
"sorry, I can't control your lights".
If you want to render a custom response from an event hook, you can do so by
calling event.assistant.render_response(text)
, and it will be spoken using
the available text-to-speech integration.
If you want to disable this behaviour, and you want the default assistant
response to always be rendered, even if it matches a hook with a phrase or an
intent, you can do so by setting the stop_conversation_on_speech_match
parameter to false
in your assistant plugin configuration.
Text-to-speech
Each of the available assistant
plugins has it own default tts
plugin associated:
-
assistant.google
:tts
, buttts.google
is also available. The difference is thattts
uses the (unofficial) Google Translate frontend API - it requires no extra configuration, but besides setting the input language it isn't very configurable.tts.google
on the other hand uses the Google Cloud Translation API. It is much more versatile, but it requires an extra API registered to your Google project and an extra credentials file. -
assistant.openai
:tts.openai
, which leverages the OpenAI text-to-speech API. -
assistant.picovoice
:tts.picovoice
, which uses the (still experimental, at the time of writing) Picovoice Orca engine.
Any text rendered via assistant*.render_response
will be rendered using the
associated TTS plugin. You can however customize it by setting tts_plugin
on
your assistant plugin configuration - e.g. you can render responses from the
OpenAI assistant through the Google or Picovoice engine, or the other way
around.
tts
plugins also expose a say
action that can be called outside of an
assistant context to render custom text at runtime - for example, from other
event
hooks,
procedures,
cronjobs
or API calls. For example:
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
"type": "request",
"action": "tts.openai.say",
"args": {
"text": "What a wonderful day!"
}
}
' http://localhost:8008/execute
assistant.google
- Plugin documentation
pip
installation:pip install 'platypush[assistant.google]'
This is the oldest voice integration in Platypush - and one of the use-cases that actually motivated me into forking the previous project into what is now Platypush.
As mentioned in the previous section, this integration is built on top of a deprecated library (with no available alternatives) that just so happens to still work with a bit of hammering on x86_64 and Raspberry Pi 3/4.
Personally it's the voice assistant I still use on most of my devices, but it's definitely not guaranteed that it will keep working in the future.
Once you have installed Platypush with the dependencies for this integration, you can configure it through these steps:
- Create a new project on the Google developers console and generate a new set of credentials for it. Download the credentials secrets as JSON.
- Generate scoped
credentials
from your
secrets.json
. - Configure the integration in your
config.yaml
for Platypush (see the configuration page for more details):
assistant.google:
# Default: ~/.config/google-oauthlib-tool/credentials.json
# or <PLATYPUSH_WORKDIR>/credentials/google/assistant.json
credentials_file: /path/to/credentials.json
# Default: no sound is played when "Ok Google" is detected
conversation_start_sound: /path/to/sound.mp3
Restart the service, say "Ok Google" or "Hey Google" while the microphone is active, and everything should work out of the box.
You can now start creating event hooks to execute your custom voice commands.
For example, if you configured a lights plugin (e.g.
light.hue
)
and a music plugin (e.g.
music.mopidy
),
you can start building voice commands like these:
# Content of e.g. /path/to/config_yaml/scripts/assistant.py
from platypush import run, when
from platypush.events.assistant import (
ConversationStartEvent, SpeechRecognizedEvent
)
light_plugin = "light.hue"
music_plugin = "music.mopidy"
@when(ConversationStartEvent)
def pause_music_when_conversation_starts():
run(f"{music_plugin}.pause_if_playing")
# Note: (limited) support for regular expressions on `phrase`
# This hook will match any phrase containing either "turn on the lights"
# or "turn off the lights"
@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def lights_on_command():
run(f"{light_plugin}.on")
# Or, with arguments:
# run(f"{light_plugin}.on", groups=["Bedroom"])
@when(SpeechRecognizedEvent, phrase="turn off (the)? lights")
def lights_off_command():
run(f"{light_plugin}.off")
@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music_command():
run(f"{music_plugin}.play")
@when(SpeechRecognizedEvent, phrase="stop (the)? music")
def stop_music_command():
run(f"{music_plugin}.stop")
Or, via YAML:
# Add to your config.yaml, or to one of the files included in it
event.hook.pause_music_when_conversation_starts:
if:
type: platypush.message.event.ConversationStartEvent
then:
- action: music.mopidy.pause_if_playing
event.hook.lights_on_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "turn on (the)? lights"
then:
- action: light.hue.on
# args:
# groups:
# - Bedroom
event.hook.lights_off_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "turn off (the)? lights"
then:
- action: light.hue.off
event.hook.play_music_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "play (the)? music"
then:
- action: music.mopidy.play
event.hook.stop_music_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "stop (the)? music"
then:
- action: music.mopidy.stop
Parameters are also supported on the phrase
event argument through the ${}
template construct. For example:
from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}')
def on_play_track_command(
event: SpeechRecognizedEvent, title: str, artist: str
):
results = run(
"music.mopidy.search",
filter={"title": title, "artist": artist}
)
if not results:
event.assistant.render_response(f"Couldn't find {title} by {artist}")
return
run("music.mopidy.play", resource=results[0]["uri"])
Pros
- 👍 Very fast and robust API.
- 👍 Easy to install and configure.
- 👍 It comes with almost all the features of a voice assistant installed on Google hardware - except some actions native to Android-based devices and video/display features. This means that features such as timers, alarms, weather forecast, setting the volume or controlling Chromecasts on the same network are all supported out of the box.
- 👍 It connects to your Google account (can be configured from your Google settings), so things like location-based suggestions and calendar events are available. Support for custom actions and devices configured in your Google Home app is also available out of the box, although I haven't tested it in a while.
- 👍 Good multi-language support. In most of the cases the assistant seems quite capable of understanding questions in multiple language and respond in the input language without any further configuration.
Cons
- 👎 Based on a deprecated API that could break at any moment.
- 👎 Limited hardware support (only x86_64 and RPi 3/4).
- 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available.
- 👎 Not possible to configure the output voice - it can only use the stock Google Assistant voice.
- 👎 No support for intents - something similar was available (albeit tricky to configure) through the Actions SDK, but that has also been abandoned by Google.
- 👎 Not very modular. Both
assistant.picovoice
andassistant.openai
have been built by stitching together different independent APIs. Those plugins are therefore quite modular. You can choose for instance to run only the hotword engine ofassistant.picovoice
, which in turn will trigger the conversation engine ofassistant.openai
, and maybe usetts.google
to render the responses. By contrast, given the relatively monolithic nature ofgoogle-assistant-library
, which runs the whole service locally, if your instance runsassistant.google
then it can't run other assistant plugins.
assistant.picovoice
- Plugin documentation
pip
installation:pip install 'platypush[assistant.picovoice]'
The assistant.picovoice
integration is available from Platypush
1.0.0.
Previous versions had some outdated sst.picovoice.*
plugins for the
individual products, but they weren't properly tested and they weren't combined
together into a single integration that implements the Platypush' assistant
API.
This integration is built on top of the voice products developed by Picovoice. These include:
-
Porcupine: a fast and customizable engine for hotword/wake-word detection. It can be enabled by setting
hotword_enabled
totrue
in theassistant.picovoice
plugin configuration. -
Cheetah: a speech-to-text engine optimized for real-time transcriptions. It can be enabled by setting
stt_enabled
totrue
in theassistant.picovoice
plugin configuration. -
Leopard: a speech-to-text engine optimized for offline transcriptions of audio files.
-
Rhino: a speech-to-intent engine.
-
Orca: a text-to-speech engine.
You can get your personal access key by signing up at the Picovoice console. You may be asked to submit a reason for using the service (feel free to mention a personal Platypush integration), and you will receive your personal access key.
If prompted to select the products you want to use, make sure to select
the ones from the Picovoice suite that you want to use with the
assistant.picovoice
plugin.
A basic plugin configuration would like this:
assistant.picovoice:
access_key: YOUR_ACCESS_KEY
# Keywords that the assistant should listen for
keywords:
- alexa
- computer
- ok google
# Paths to custom keyword files
# keyword_paths:
# - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn
# Enable/disable the hotword engine
hotword_enabled: true
# Enable the STT engine
stt_enabled: true
# conversation_start_sound: ...
# Path to a custom model to be used to speech-to-text
# speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv
# Path to an intent model. At least one custom intent model is required if
# you want to enable intent detection.
# intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn
Hotword detection
If enabled through the hotword_enabled
parameter (default: True), the
assistant will listen for a specific wake word before starting the
speech-to-text or intent recognition engines. You can specify custom models for
your hotword (e.g. on the same device you may use "Alexa" to trigger the
speech-to-text engine in English, "Computer" to trigger the speech-to-text
engine in Italian, and "Ok Google" to trigger the intent recognition engine).
You can also create your custom hotword models using the Porcupine console.
If hotword_enabled
is set to True, you must also specify the keywords
parameter with the list of keywords that you want to listen for, and optionally
the keyword_paths
parameter with the paths to the any custom hotword models
that you want to use. If hotword_enabled
is set to False, then the assistant
won't start listening for speech after the plugin is started, and you will need
to programmatically start the conversation by calling the
assistant.picovoice.start_conversation
action.
When a wake-word is detected, the assistant will emit a
HotwordDetectedEvent
that you can use to build your custom logic.
By default, the assistant will start listening for speech after the hotword if
either stt_enabled
or intent_model_path
are set. If you don't want the
assistant to start listening for speech after the hotword is detected (for
example because you want to build your custom response flows, or trigger the
speech detection using different models depending on the hotword that is used,
or because you just want to detect hotwords but not speech), then you can also
set the start_conversation_on_hotword
parameter to false
. If that is the
case, then you can programmatically start the conversation by calling the
assistant.picovoice.start_conversation
method in your event hooks:
from platypush import when, run
from platypush.message.event.assistant import HotwordDetectedEvent
# Start a conversation using the Italian language model when the
# "Buongiorno" hotword is detected
@when(HotwordDetectedEvent, hotword='Buongiorno')
def on_it_hotword_detected(event: HotwordDetectedEvent):
event.assistant.start_conversation(model_file='path/to/it.pv')
Speech-to-text
If you want to build your custom STT hooks, the approach is the same seen for
the assistant.google
plugins - create an event hook on
SpeechRecognizedEvent
with a given exact phrase, regex or template.
Speech-to-intent
Intents are structured actions parsed from unstructured human-readable text.
Unlike with hotword and speech-to-text detection, you need to provide a custom model for intent detection. You can create your custom model using the Rhino console.
When an intent is detected, the assistant will emit an
IntentRecognizedEvent
and you can build your custom hooks on it.
For example, you can build a model to control groups of smart lights by defining the following slots on the Rhino console:
-
device_state
: The new state of the device (e.g. withon
oroff
as supported values) -
room
: The name of the room associated to the group of lights to be controlled (e.g.living room
,kitchen
,bedroom
)
You can then define a lights_ctrl
intent with the following expressions:
- "turn
$device_state:state
the lights" - "turn
$device_state:state
the$room:room
lights" - "turn the lights
$device_state:state
" - "turn the
$room:room
lights$device_state:state
" - "turn
$room:room
lights$device_state:state
"
This intent will match any of the following phrases:
- "turn on the lights"
- "turn off the lights"
- "turn the lights on"
- "turn the lights off"
- "turn on the living room lights"
- "turn off the living room lights"
- "turn the living room lights on"
- "turn the living room lights off"
And it will extract any slots that are matched in the phrases in the
IntentRecognizedEvent
.
Train the model, download the context file, and pass the path on the
intent_model_path
parameter.
You can then register a hook to listen to a specific intent:
from platypush import when, run
from platypush.events.assistant import IntentRecognizedEvent
@when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'})
def on_turn_on_lights(event: IntentRecognizedEvent):
room = event.slots.get('room')
if room:
run("light.hue.on", groups=[room])
else:
run("light.hue.on")
Note that if both stt_enabled
and intent_model_path
are set, then
both the speech-to-text and intent recognition engines will run in parallel
when a conversation is started.
The intent engine is usually faster, as it has a smaller set of intents to
match and doesn't have to run a full speech-to-text transcription. This means that,
if an utterance matches both a speech-to-text phrase and an intent, the
IntentRecognizedEvent
event is emitted (and not SpeechRecognizedEvent
).
This may not be always the case though. So, if you want to use the intent
detection engine together with the speech detection, it may be a good practice
to also provide a fallback SpeechRecognizedEvent
hook to catch the text if
the speech is not recognized as an intent:
from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context):
if room:
run("light.hue.on", groups=[room])
else:
run("light.hue.on")
Text-to-speech and response management
The text-to-speech engine, based on Orca, is provided by the
tts.picovoice
plugin.
However, the Picovoice integration won't provide you with automatic AI-generated responses for your queries. That's because Picovoice doesn't seem to offer (yet) any products for conversational assistants, either voice-based or text-based.
You can however leverage the render_response
action to render some text as
speech in response to a user command, and that in turn will leverage the
Picovoice TTS plugin to render the response.
For example, the following snippet provides a hook that:
-
Listens for
SpeechRecognizedEvent
. -
Matches the phrase against a list of predefined commands that shouldn't require an AI-generated response.
-
Has a fallback logic that leverages
openai.get_response
to generate a response through a ChatGPT model and render it as audio.
Also, note that any text rendered over the render_response
action that ends
with a question mark will automatically trigger a follow-up - i.e. the
assistant will wait for the user to answer its question.
import re
from platypush import hook, run
from platypush.message.event.assistant import SpeechRecognizedEvent
def play_music():
run("music.mopidy.play")
def stop_music():
run("music.mopidy.stop")
def ai_assist(event: SpeechRecognizedEvent):
response = run("openai.get_response", prompt=event.phrase)
if not response:
return
run("assistant.picovoice.render_response", text=response)
# List of commands to match, as pairs of regex patterns and the
# corresponding actions
hooks = (
(re.compile(r"play (the)?music", re.IGNORECASE), play_music),
(re.compile(r"stop (the)?music", re.IGNORECASE), stop_music),
# ...
# Fallback to the AI assistant
(re.compile(r".*"), ai_assist),
)
@when(SpeechRecognizedEvent)
def on_speech_recognized(event, **kwargs):
for pattern, command in hooks:
if pattern.search(event.phrase):
run("logger.info", msg=f"Running voice command: {command.__name__}")
command(event, **kwargs)
break
Offline speech-to-text
An assistant.picovoice.transcribe
action
is provided for offline transcriptions of audio files, using the Leopard
models.
You can easily call it from your procedures, hooks or through the API:
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
"type": "request",
"action": "assistant.picovoice.transcribe",
"args": {
"audio_file": "/path/to/some/speech.mp3"
}
}' http://localhost:8008/execute
{
"transcription": "This is a test",
"words": [
{
"word": "this",
"start": 0.06400000303983688,
"end": 0.19200000166893005,
"confidence": 0.9626294374465942
},
{
"word": "is",
"start": 0.2879999876022339,
"end": 0.35199999809265137,
"confidence": 0.9781675934791565
},
{
"word": "a",
"start": 0.41600000858306885,
"end": 0.41600000858306885,
"confidence": 0.9764975309371948
},
{
"word": "test",
"start": 0.5120000243186951,
"end": 0.8320000171661377,
"confidence": 0.9511580467224121
}
]
}
Pros
-
👍 The Picovoice integration is extremely configurable.
assistant.picovoice
stitches together five independent products developed by a small company specialized in voice products for developers. As such, Picovoice may be the best option if you have custom use-cases. You can pick which features you need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you have plenty of flexibility in building your integrations. -
👍 Runs (or seems to run) (mostly) on device. This is something that we can't say about the other two integrations discussed in this article. If keeping your voice interactions 100% hidden from Google's or Microsoft's eyes is a priority, then Picovoice may be your best bet.
-
👍 Rich features. It uses different models for different purposes - for example, Cheetah models are optimized for real-time speech detection, while Leopard is optimized for offline transcription. Moreover, Picovoice is the only integration among those analyzed in this article to support speech-to-intent.
-
👍 It's very easy to build new models or customize existing ones. Picovoice has a powerful developers console that allows you to easily create hotword models, tweak the priority of some words in voice models, and create custom intent models.
Cons
-
👎 The business model is still a bit weird. It's better than the earlier "write us an email with your business case and we'll reach back to you", but it still requires you to sign up with a business email and write a couple of lines on what you want to build with their products. It feels like their focus is on a B2B approach rather than "open up and let the community build stuff", and that seems to create unnecessary friction.
-
👎 No native conversational features. At the time of writing, Picovoice doesn't offer products that generate AI responses given voice or text prompts. This means that, if you want AI-generated responses to your queries, you'll have to do requests to e.g.
openai.get_response(prompt)
directly in your hooks forSpeechRecognizedEvent
, and render the responses throughassistant.picovoice.render_response
. This makes the use ofassistant.picovoice
alone more fit to cases where you want to mostly create voice command hooks rather than have general-purpose conversations. -
👎 Speech-to-text, at least on my machine, is slower than the other two integrations, and the accuracy with non-native accents is also much lower.
-
👎 Limited support for any languages other than English. At the time of writing hotword detection with Porcupine seems to be in a relative good shape with support for 16 languages. However, both speech-to-text and text-to-speech only support English at the moment.
-
👎 Some APIs are still quite unstable. The Orca text-to-speech API, for example, doesn't even support text that includes digits or some punctuation characters - at least not at the time of writing. The Platypush integration fills the gap with workarounds that e.g. replace words to numbers and replace punctuation characters, but you definitely have a feeling that some parts of their products are still work in progress.
assistant.openai
- Plugin documentation
pip
installation:pip install 'platypush[assistant.openai]'
This integration has been released in Platypush 1.0.7.
It uses the following OpenAI APIs:
/audio/transcriptions
for speech-to-text. At the time of writing the default model iswhisper-1
. It can be configured through themodel
setting on theassistant.openai
plugin configuration. See the OpenAI documentation for a list of available models./chat/completions
to get AI-generated responses using a GPT model. At the time of writing the default isgpt-3.5-turbo
, but it can be configurable through themodel
setting on theopenai
plugin configuration. See the OpenAI documentation for a list of supported models./audio/speech
for text-to-speech. At the time of writing the default model istts-1
and the default voice isnova
. They can be configured through themodel
andvoice
settings respectively on thetts.openai
plugin. See the OpenAI documentation for a list of available models and voices.
You will need an OpenAI API key associated to your account.
A basic configuration would like this:
openai:
api_key: YOUR_OPENAI_API_KEY # Required
# conversation_start_sound: ...
# model: ...
# context: ...
# context_expiry: ...
# max_tokens: ...
assistant.openai:
# model: ...
# tts_plugin: some.other.tts.plugin
tts.openai:
# model: ...
# voice: ...
If you want to build your custom hooks on speech events, the approach is the
same seen for the other assistant
plugins - create an event hook on
SpeechRecognizedEvent
with a given exact phrase, regex or template.
Hotword support
OpenAI doesn't provide an API for hotword detection, nor a small model for offline detection.
This means that, if no other assistant
plugins with stand-alone hotword
support are configured (only assistant.picovoice
for now), a conversation can
only be triggered by calling the assistant.openai.start_conversation
action.
If you want hotword support, then the best bet is to add assistant.picovoice
to your configuration too - but make sure to only enable hotword detection and
not speech detection, which will be delegated to assistant.openai
via event
hook:
assistant.picovoice:
access_key: ...
keywords:
- computer
hotword_enabled: true
stt_enabled: false
# conversation_start_sound: ...
Then create a hook that listens for
HotwordDetectedEvent
and calls assistant.openai.start_conversation
:
from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent
@when(HotwordDetectedEvent, hotword="computer")
def on_hotword_detected():
run("assistant.openai.start_conversation")
Conversation contexts
The most powerful feature offered by the OpenAI assistant is the fact that it leverages the conversation contexts provided by the OpenAI API.
This means two things:
- Your assistant can be initialized/tuned with a static context. It is possible to provide some initialization context to the assistant that can fine tune how the assistant will behave, (e.g. what kind of tone/language/approach will have when generating the responses), as well as initialize the assistant with some predefined knowledge in the form of hypothetical past conversations. Example:
openai:
# ...
context:
# `system` can be used to initialize the context for the expected tone
# and language in the assistant responses
- role: system
content: >
You are a voice assistant that responds to user queries using
references to Lovecraftian lore.
# `user`/`assistant` interactions can be used to initialize the
# conversation context with previous knowledge. `user` is used to
# emulate previous user questions, and `assistant` models the
# expected response.
- role: user
content: What is a telephone?
- role: assistant
content: >
A Cthulhuian device that allows you to communicate with
otherworldly beings. It is said that the first telephone was
created by the Great Old Ones themselves, and that it is a
gateway to the void beyond the stars.
If you now start Platypush and ask a question like "how does it work?", the voice assistant may give a response along the lines of:
The telephone functions by harnessing the eldritch energies of the cosmos to
transmit vibrations through the ether, allowing communication across vast
distances with entities from beyond the veil. Its operation is shrouded in
mystery, for it relies on arcane principles incomprehensible to mortal
minds.
Note that:
-
The style of the response is consistent with that initialized in the
context
throughsystem
roles. -
Even though a question like "how does it work?" is not very specific, the assistant treats the
user
/assistant
entries given in the context as if they were the latest conversation prompts. Thus it realizes that "it", in this context, probably means "the telephone". -
The assistant has a runtime context. It will remember the recent conversations for a given amount of time (configurable through the
context_expiry
setting on theopenai
plugin configuration). So, even without explicit context initialization in theopenai
plugin, the plugin will remember the last interactions for (by default) 10 minutes. So if you ask "who wrote the Divine Comedy?", and a few seconds later you ask "where was its writer from?", you may get a response like "Florence, Italy" - i.e. the assistant realizes that "the writer" in this context is likely to mean "the writer of the work that I was asked about in the previous interaction" and return pertinent information.
Pros
-
👍 Speech detection quality. The OpenAI speech-to-text features are the best among the available
assistant
integrations. Thetranscribe
API so far has detected my non-native English accent right nearly 100% of the times (Google comes close to 90%, while Picovoice trails quite behind). And it even detects the speech of my young kid - something that the Google Assistant library has always failed to do right. -
👍 Text-to-speech quality. The voice models used by OpenAI sound much more natural and human than those of both Google and Picovoice. Google's and Picovoice's TTS models are actually already quite solid, but OpenAI outclasses them when it comes to voice modulation, inflections and sentiment. The result sounds intimidatingly realistic.
-
👍 AI responses quality. While the scope of the Google Assistant is somewhat limited by what people expected from voice assistants until a few years ago (control some devices and gadgets, find my phone, tell me the news/weather, do basic Google searches...), usually without much room for follow-ups,
assistant.openai
will basically render voice responses as if you were typing them directly to ChatGPT. While Google would often respond you with a "sorry, I don't understand", or "sorry, I can't help with that", the OpenAI assistant is more likely to expose its reasoning, ask follow-up questions to refine its understanding, and in general create a much more realistic conversation. -
👍 Contexts. They are an extremely powerful way to initialize your assistant and customize it to speak the way you want, and know the kind of things that you want it to know. Cross-conversation contexts with configurable expiry also make it more natural to ask something, get an answer, and then ask another question about the same topic a few seconds later, without having to reintroduce the assistant to the whole context.
-
👍 Offline transcriptions available through the
openai.transcribe
action. -
👍 Multi-language support seems to work great out of the box. Ask something to the assistant in any language, and it'll give you a response in that language.
-
👍 Configurable voices and models.
Cons
-
👎 The full pack of features is only available if you have an API key associated to a paid OpenAI account.
-
👎 No hotword support. It relies on
assistant.picovoice
for hotword detection. -
👎 No intents support.
-
👎 No native support for weather forecast, alarms, timers, integrations with other services/devices nor other features available out of the box with the Google Assistant. You can always create hooks for them though.
Weather forecast example
Both the OpenAI and Picovoice integrations lack some features available out of the box on the Google Assistant - weather forecast, news playback, timers etc. - as they rely on voice-only APIs that by default don't connect to other services.
However Platypush provides many plugins to fill those gaps, and those features can be implemented with custom event hooks.
Let's see for example how to build a simple hook that delivers the weather forecast for the next 24 hours whenever the assistant gets a phrase that contains the "weather today" string.
You'll need to enable a weather
plugin in Platypush -
weather.openweathermap
will be used in this example. Configuration:
weather.openweathermap:
token: OPENWEATHERMAP_API_KEY
location: London,GB
Then drop a script named e.g. weather.py
in the Platypush scripts directory
(default: <CONFDIR>/scripts
) with the following content:
from datetime import datetime
from textwrap import dedent
from time import time
from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='weather today')
def weather_forecast(event: SpeechRecognizedEvent):
limit = time() + 24 * 60 * 60 # 24 hours from now
forecast = [
weather
for weather in run("weather.openweathermap.get_forecast")
if datetime.fromisoformat(weather["time"]).timestamp() < limit
]
min_temp = round(
min(weather["temperature"] for weather in forecast)
)
max_temp = round(
max(weather["temperature"] for weather in forecast)
)
max_wind_gust = round(
(max(weather["wind_gust"] for weather in forecast)) * 3.6
)
summaries = [weather["summary"] for weather in forecast]
most_common_summary = max(summaries, key=summaries.count)
avg_cloud_cover = round(
sum(weather["cloud_cover"] for weather in forecast) / len(forecast)
)
event.assistant.render_response(
dedent(
f"""
The forecast for today is: {most_common_summary}, with
a minimum of {min_temp} and a maximum of {max_temp}
degrees, wind gust of {max_wind_gust} km/h, and an
average cloud cover of {avg_cloud_cover}%.
"""
)
)
This script will work with any of the available voice assistants.
You can also implement something similar for news playback, for example using
the rss
plugin to
get the latest items in your subscribed feeds. Or to create custom alarms using
the alarm
plugin,
or a timer using the utils.set_timeout
action.
Conclusions
The past few years have seen a lot of things happen in the voice industry. Many products have gone out of market, been deprecated or sunset, but not all hope is lost. The OpenAI and Picovoice products, especially when combined together, can still provide a good out-of-the-box voice assistant experience. And the OpenAI products have also raised the bar on what to expect from an AI-based assistant.
I wish that there were still some fully open and on-device alternatives out there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google provide the best voice experience as of now, but of course they come with trade-offs - namely the great amount of data points you feed to these cloud-based services. Picovoice is somewhat a trade-off, as it runs at least partly on-device, but their business model is still a bit fuzzy and it's not clear whether they intend to have their products used by the wider public or if it's mostly B2B.
I'll keep an eye however on what is going to come from the ashes of Mycroft under the form of the OpenConversational project, and probably keep you up-to-date when there is a new integration to share.