blog/static/pages/Build-your-customizable-voice-assistant-with-Platypush.md

68 lines
5.2 KiB
Markdown
Raw Normal View History

2021-01-25 19:13:08 +01:00
[//]: # (title: Build your customizable voice assistant with Platypush)
[//]: # (description: Use the available integrations to build a voice assistant with a simple microphone)
[//]: # (image: /img/voice-assistant-1.png)
[//]: # (published: 2019-09-05)
My dream of a piece of software that you could simply talk to and get things done started more than 10 years ago, when I
was still a young M.Sc student who imagined getting common tasks done on my computer through the same kind of natural
interaction you see between Dave and [HAL 9000](https://en.wikipedia.org/wiki/HAL_9000) in
[2001: A Space Odyssey](https://en.wikipedia.org/wiki/2001:_A_Space_Odyssey_(film)). Together with a friend I developed
[Voxifera](https://github.com/BlackLight/Voxifera) way back in 2008. Although the software worked well enough for basic
tasks, as long as it was always me to provide the voice commands and as long as the list of custom voice commands was
below 10 items, Google and Amazon in the latest years have gone way beyond what an M.Sc student alone could do with
fast-Fourier transforms and Markov models.
When years later I started building [Platypush](https://git.platypush.tech/platypush/platypush), I still dreamed of the
same voice interface, leveraging the new technologies, while not being caged by the interactions natively provided by
those commercial assistants. My goal was still to talk to my assistant and get it to do whatever I wanted to, regardless
of the skills/integrations supported by the product, regardless of whichever answer its AI was intended to provide for
that phrase. And, most of all, my goal was to have all the business logic of the actions to run on my own device(s), not
on someone elses cloud. I feel like by now that goal has been mostly accomplished (assistant technology with 100%
flexibility when it comes to phrase patterns and custom actions), and today Id like to show you how to set up your own
Google Assistant on steroids as well with a Raspberry Pi, microphone and platypush. Ill also show how to run your
custom hotword detection models through the [Snowboy](https://snowboy.kitt.ai/) integration, for those who wish greater flexibility when it comes to
how to summon your digital butler besides the boring “Ok Google” formula, or those who arent that happy with the idea
of having Google to constantly listen to everything that is said in the room. For those who are unfamiliar with
platypush, I suggest
reading [my previous article](https://blog.platypush.tech/article/Ultimate-self-hosted-automation-with-Platypush) on
what it is, what it can do, why I built it and how to get started with it.
## Context and expectations
First, a bit of context around the current state of the assistant integration (and the state of the available assistant APIs/SDKs in general).
My initial goal was to have a voice assistant that could:
1. Continuously listen through an audio device for a specific audio pattern or phrase and process the subsequent voice
requests.
2. Support multiple models for the hotword, so that multiple phrases could be used to trigger a request process, and
optionally one could even associate a different assistant language to each hotword.
3. Support conversation start/end actions even without hotword detection — something like “start listening when I press
a button or when I get close to a distance sensor”.
4. Provide the possibility to configure a list of custom phrases or patterns (ideally
through [regular expressions](https://en.wikipedia.org/wiki/Regular_expression)) that, when matched, would run a
custom pre-configured task or list of tasks on the executing device, or on any device connected through it.
5. If a phrase doesnt match any of those pre-configured patterns, then the assistant would go on and process the
request in the default way (e.g. rely on Googles “hows the weather?” or “whats on my calendar?” standard response).
Basically, I needed an assistant SDK or API that could be easily wrapped into a library or tiny module, a module that could listen for hotwords, start/stop conversations programmatically, and return the detected phrase directly back to my business logic if any speech was recognized.
I eventually decided to develop the integration with the Google Assistant and ignore Alexa because:
- Alexas original [sample app](https://github.com/alexa/alexa-avs-sample-app.git) for developers was a relatively heavy
piece of software that relied on a Java backend and a Node.js web service.
- In the meantime Amazon has pulled the plug off that original project.
- The sample app has been replaced by the [Amazon AVS (Alexa Voice Service)](https://github.com/alexa/avs-device-sdk),
which is a C++ service mostly aimed to commercial applications and doesnt provide a decent quickstart for custom
Python integrations.
- There are [few Python examples for the Alexa SDK](https://developer.amazon.com/en-US/alexa/alexa-skills-kit/alexa-skill-python-tutorial#sample-python-projects),
but they focus on how to develop a skill. Im not interested in building a skill that runs on Amazons servers — Im
interested in detecting hotwords and raw speech on any device, and the SDK should let me do whatever I want with that.