5.2 KiB
My dream of a piece of software that you could simply talk to and get things done started more than 10 years ago, when I was still a young M.Sc student who imagined getting common tasks done on my computer through the same kind of natural interaction you see between Dave and HAL 9000 in 2001: A Space Odyssey. Together with a friend I developed Voxifera way back in 2008. Although the software worked well enough for basic tasks, as long as it was always me to provide the voice commands and as long as the list of custom voice commands was below 10 items, Google and Amazon in the latest years have gone way beyond what an M.Sc student alone could do with fast-Fourier transforms and Markov models.
When years later I started building Platypush, I still dreamed of the same voice interface, leveraging the new technologies, while not being caged by the interactions natively provided by those commercial assistants. My goal was still to talk to my assistant and get it to do whatever I wanted to, regardless of the skills/integrations supported by the product, regardless of whichever answer its AI was intended to provide for that phrase. And, most of all, my goal was to have all the business logic of the actions to run on my own device(s), not on someone else’s cloud. I feel like by now that goal has been mostly accomplished (assistant technology with 100% flexibility when it comes to phrase patterns and custom actions), and today I’d like to show you how to set up your own Google Assistant on steroids as well with a Raspberry Pi, microphone and platypush. I’ll also show how to run your custom hotword detection models through the Snowboy integration, for those who wish greater flexibility when it comes to how to summon your digital butler besides the boring “Ok Google” formula, or those who aren’t that happy with the idea of having Google to constantly listen to everything that is said in the room. For those who are unfamiliar with platypush, I suggest reading my previous article on what it is, what it can do, why I built it and how to get started with it.
Context and expectations
First, a bit of context around the current state of the assistant integration (and the state of the available assistant APIs/SDKs in general).
My initial goal was to have a voice assistant that could:
-
Continuously listen through an audio device for a specific audio pattern or phrase and process the subsequent voice requests.
-
Support multiple models for the hotword, so that multiple phrases could be used to trigger a request process, and optionally one could even associate a different assistant language to each hotword.
-
Support conversation start/end actions even without hotword detection — something like “start listening when I press a button or when I get close to a distance sensor”.
-
Provide the possibility to configure a list of custom phrases or patterns (ideally through regular expressions) that, when matched, would run a custom pre-configured task or list of tasks on the executing device, or on any device connected through it.
-
If a phrase doesn’t match any of those pre-configured patterns, then the assistant would go on and process the request in the default way (e.g. rely on Google’s “how’s the weather?” or “what’s on my calendar?” standard response).
Basically, I needed an assistant SDK or API that could be easily wrapped into a library or tiny module, a module that could listen for hotwords, start/stop conversations programmatically, and return the detected phrase directly back to my business logic if any speech was recognized.
I eventually decided to develop the integration with the Google Assistant and ignore Alexa because:
-
Alexa’s original sample app for developers was a relatively heavy piece of software that relied on a Java backend and a Node.js web service.
-
In the meantime Amazon has pulled the plug off that original project.
-
The sample app has been replaced by the Amazon AVS (Alexa Voice Service), which is a C++ service mostly aimed to commercial applications and doesn’t provide a decent quickstart for custom Python integrations.
-
There are few Python examples for the Alexa SDK, but they focus on how to develop a skill. I’m not interested in building a skill that runs on Amazon’s servers — I’m interested in detecting hotwords and raw speech on any device, and the SDK should let me do whatever I want with that.