[//]: # (title: How to build your personal infrastructure for data collection and visualization)
[//]: # (description: Use Grafana, PostgreSQL, Mosquitto and Platypush to collect data points on your infrastructure and be the real owner of your own data.)
[//]: # (image: /img/data-visualization-1.png)
[//]: # (published: 2019-10-16)
A smart home can generate and collect data. Lots of data. And there are currently a few outstanding issues with home-generated data:
- **Fragmentation**. You probably have your home weather station, your own motion detectors, security cameras, gas and smoke detectors, body sensors, GPS and fit trackers and smart plugs around. It’s quite likely that most of these devices generates data, and that such data will in most of the cases be only accessible through a proprietary app or web service, and that any integration with other services, or any room for tinkering and automation purposes, will mostly depend on the benevolence of the developer or business in building third-party interfaces for such data. In this article, we’ll explore how, thanks to open source solutions like platypush, Grafana and Mosquitto, it’s possible to overcome the fragmentation issue and “glue” together data sources that wouldn’t otherwise be able to communicate nor share data.
- **Ability to query**. Most of the hardware and data geeks out there won’t settle with the ability to access their data through a gauge in an app or a timeline graph. Many of us want the ability to explore our own generated data in a structured way, preferably through SQL, or any query language, and we demand tailor-made dashboards to explore our data, not dumb mobile apps. The ability to generate custom monthly reports of our fit activities, query the countries where we’ve been in a certain range of time, or how much time we spent indoor in the past three months, or how many times the smoke detectors in our guest room went above threshold in the past week, is, for many of us, priceless, and often neglected by hardware and software makers. In this article we’ll explore how to leverage an open-source relational database (PostgreSQL in this example) and some elementary data pipelines to dispatch and store your data on your private computers, ready to be queried or visualized however you like.
- **Privacy**. Many of the solutions or services I’ve mentioned in the previous examples come with their cloud-based infrastructure to store user data. While storing your data on somebody else’s computers saves you the time and disk space required to invest in your local solution, it also comes with all the concerns related to — ehm — storing your data on somebody else’s computer. That somebody else can decide if and how you can access your data, can decide to sell your data for profit, or can be hacked in a way or another. This can be especially worrisome if we’re talking data about your own body, location or house environment. A house-hosted data infrastructure bypasses the issue with third-party ownership of your data.
This article will analyze the building blocks to set up your data infrastructure and build automation on it. We’ll see how to set up data collection and monitoring for a few use cases (temperature, humidity, gas, phone location and fit data) and how to build automation triggers based on such data.
## Dependencies setup
First, you’ll need a RaspberryPi (or any similar clone) with Platypush. I assume that you’ve already got Platypush installed and configured. If not, please head to my my previous article on [getting started with Platypush](https://blog.platypush.tech/article/Ultimate-self-hosted-automation-with-Platypush).
You’ll also need a relational database installed on your device. The example in this article will rely on PostgreSQL, but any relational database will do its job. To install and configure PostgreSQL on Raspbian and create a database named `sensors`:
We’ll use the database to store the following information:
- System metrics
- Sensors data
- Smartphone and location data
- Fit data
You’ll also need a message queue broker running on your RaspberryPi to dispatch messages with new data reads — check [this Instructables tutorial](https://www.instructables.com/id/Installing-MQTT-BrokerMosquitto-on-Raspberry-Pi/) on how to get Mosquitto up and running on your RaspberryPi.
For some of the data measurements, we’ll also need an MQTT client to test messages over the configured queue — for example, to send measurements from a shell script. I like to use mqttcli for these purposes — it’s fast, lightweight and written in Go:
```shell
go get github.com/shirou/mqttcli
```
Finally, install Grafana as a web-based gateway to visualize your data:
```shell
[sudo] apt-get install grafana
[sudo] systemctl restart grafana
```
After starting the service head to `http://your-pi:3000` and make sure that you see the Grafana splash screen — create a new admin user and you’re good to go for now.
Now that you’ve got all the fundamental pieces in place it’s time to set up your data collection pipeline and dashboard. Let’s start from setting up the tables and data storage logic on your database.
## Database configuration
If you followed the instructions above then you’ll have a PostgreSQL instance running on your RaspberryPi, accessible through the user `pi`, and a `sensors` database created for the purpose. In this section, I’ll explain how to create the basic tables and the triggers to normalize the data. Keep in mind that your measurement tables might become quite large, depending on how much data you process and how often you process it. It’s relatively important, to keep database size under control and to make queries efficient, to provide normalized tables structures enforced by triggers. I’ve prepared the following provisioning script for my purposes:
```sql
-- Temporary sensors table where we store the raw
-- measurements as received on the message queue
drop table if exists tmp_sensors cascade;
create table tmp_sensors(
id serial not null,
host varchar(64) not null,
metric varchar(255) not null,
data double precision,
created_at timestamp with time zone default CURRENT_TIMESTAMP,
primary key(id)
);
-- Table to store the hosts associated to the data points
drop table if exists sensor_host cascade;
create table sensor_host(
id serial not null,
host varchar(64) unique not null,
primary key(id)
);
-- Table to store the metrics
drop table if exists sensor_metric cascade;
create table sensor_metric(
id serial not null,
metric varchar(255) unique not null,
primary key(id)
);
-- Table to store the normalized data points
drop table if exists sensor_data cascade;
create table sensor_data(
id serial not null,
host_id integer not null,
metric_id integer not null,
data double precision,
created_at timestamp with time zone default CURRENT_TIMESTAMP,
-- Define a stored procedure that normalizes new rows on tmp_sensors
-- by either creating or returning the associated host_id and metric_id,
-- creating a normalized representation of the row on sensor_data and
-- delete the original raw entry on tmp_sensors.
create or replace function sync_sensors_data()
returns trigger as
$$
begin
insert into sensor_host(host) values(new.host)
on conflict do nothing;
insert into sensor_metric(metric) values(new.metric)
on conflict do nothing;
insert into sensor_data(host_id, metric_id, data) values(
(select id from sensor_host where host = new.host),
(select id from sensor_metric where metric = new.metric),
new.data
);
delete from tmp_sensors where id = new.id;
return new;
end;
$$
language 'plpgsql';
-- Create a trigger that invokes the store procedure defined above
-- after a row is inserted on tmp_sensors
drop trigger if exists on_sensor_data_insert on tmp_sensors;
create trigger on_sensor_data_insert
after insert on tmp_sensors
for each row
execute procedure sync_sensors_data();
create view public.vsensors AS
select d.id AS data_id,
h.host,
m.metric,
d.data,
d.created_at
from ((public.sensor_data d
join public.sensor_host h ON ((d.host_id = h.id)))
join public.sensor_metric m ON ((d.metric_id = m.id)));
```
The script above will keep the data on your database normalized and query-friendly even if the messages pushed on the message queue don’t care about which is the right numeric host_id or metric_id. Run it against your PostgreSQL instance:
```shell
psql -U pi <database_provisioning.sql
```
Now that you’ve got the tables ready it’s time to fill them with data. We’ll see a few examples of metrics collection, starting with system metrics.
## System metrics
You may want to monitor the CPU, RAM or disk usage of your own RaspberryPi or any other host or virtual server you’ve got around, do things like setting up a dashboard to easily monitor your metrics or set up alerts in case something goes out of control.
First, create a script that checks the memory available on your system and sends the percentage of used memory on a message queue channel — we’ll store this script under `~/bin/send_mem_stats.sh` for the purposes of this tutorial:
You can extend this pattern to any sensor data you want to send over the queue.
Once scheduled these jobs will start pushing data to your message queue, on the configured topic (in the examples above respectively to `sensors/<hostname>/memory` and `sensors/<hostname>/disk_root`) at regular intervals.
It’s now time to set up Platypush to listen on those channels and whenever a new message comes in store it in the database you have provisioned. Add the following configuration to your `~/.config/platypush/config.yaml` file:
```yaml
# Enable the MQTT backend
backend.mqtt:
host: your-mqtt-server
port: 1883
# Configure platypush to listen for new messages on these topics
listeners:
- host: your-mqtt-server
topics:
- sensors/host1/disk_root
- sensors/host2/disk_root
- sensors/host1/memory
- sensors/host2/memory
```
And create an event hook (e.g. under `~/.config/platypush/scripts/mqtt.py`) that stores the messages
received on some specified channels to your database:
```python
from platypush.event.hook import hook
from platypush.utils import run
from platypush.message.event.mqtt import MQTTMessageEvent
Start Platypush, and if everything went smooth you’ll soon see your sensor_data table getting populated with memory and
disk usage stats.
## Sensors data
Commercial weather stations, air quality solutions and presence detectors can be relatively expensive, and relatively
limited when it comes to opening up their data, but by using the ingredients we’ve talked about so far it’s relatively
easy to set up your network of sensors around the house and get them to collect data on your existing data
infrastructure. Let’s consider for the purposes of this post an example that collects temperature and humidity
measurements from some sensors around the house. You’ve got mainly two options when it comes to set up analog sensors on
a RaspberryPi:
- *Option 1*: Use an analog microprocessor (like Arduino or ESP8266) connected to your RaspberryPi over USB and
configure platypush to read analogue measurements over serial port. The RaspberryPi is an amazing piece of technology
but it doesn’t come with a native ADC converter. That means that many simple analog sensors available on the market
that map different environment values to different voltage values won’t work on a RaspberryPi unless you use a device
in between that can actually read the analog measurements and push them to the RaspberryPi over serial interface. For
my purposes I often use Arduino Nano clones, as they’re usually quite cheap, but any device that can communicate over
USB/serial port should do its job. You can find cheap but accurate temperature and humidity sensors on the internet,
like the [TMP36](https://shop.pimoroni.com/products/temperature-sensor-tmp36), [DHT11](https://learn.adafruit.com/dht)
and [AM2320](https://shop.pimoroni.com/products/digital-temperature-and-humidity-sensor), that can easily be set up to
communicate with your Arduino/ESP* device. All you need is to make sure that your Arduino/ESP* device spits a valid
JSON message back on the serial port whenever it performs a new measurement (e.g. `{"temperature": 21.0, "humidity":
45.0}`), so Platypush can easily understand when there is a change in value for a certain measurement.
- *Option 2*: Devices like the ESP8266 already come with a Wi-Fi module and can directly send message over MQTT through
small MicroPython libraries
like [`umqttsimple`](https://raw.githubusercontent.com/RuiSantosdotme/ESP-MicroPython/master/code/MQTT/umqttsimple.py)
(check out [this tutorial](https://randomnerdtutorials.com/micropython-mqtt-esp32-esp8266/) for ESP8266+MQTT setup).
In this case you won’t need a serial connection, and you can directly send data from your sensor to your MQTT server
from the device.
- *Option 3*: Use a breakout sensor (like
the [BMP280](https://shop.pimoroni.com/products/bmp280-breakout-temperature-pressure-altitude-sensor),
[SHT31](https://shop.pimoroni.com/products/adafruit-sensiron-sht31-d-temperature-humidity-sensor-breakout) or
[HTU21D-F](https://shop.pimoroni.com/products/adafruit-htu21d-f-temperature-humidity-sensor-breakout-board)) that
communicates over I2C/SPI that you can plug directly on the RaspberryPi. If you go for this solution then you won’t
need another microprocessor to deal with the ADC conversion, but you’ll also have to make sure that these devices come
with a Python library and they’re [supported in Platypush](https://platypush.readthedocs.io/en/latest/) (feel free to
open an issue or send a pull request if that’s not the case).
Let’s briefly analyze an example of the option 1 implementation. Let’s suppose that you have an Arduino with a connected
DHT11 temperature and humidity sensor on the PIN 7. You can prepare a sketch that looks like this to send new
measurements over USB to the RaspberryPi in JSON format:
```c
#include <Arduino.h>
#include <dht.h>
#define DHT11_PIN 7
dht DHT;
void setup() {
Serial.begin(9600);
}
void loop() {
int ret = DHT.read11(DHT11_PIN);
if (ret <-1){
delay(1000);
return;
}
Serial.print("{\"temperature\":");
Serial.print(DHT.temperature);
Serial.print(", \"humidity\":");
Serial.print(DHT.humidity);
Serial.println("}");
delay(1000);
}
```
Install the Platypush serial plugin dependencies:
```shell
[sudo] pip install 'platypush[serial]'
```
Then you can add the following lines into the `~/.config/platypush/config.yaml` file of the RaspberryPi that has the
sensors connected to forward new measurements to the message queue, and store them on your local database. The example
also shows how to tweak polling period, tolerance and thresholds:
```yaml
# Enable the serial plugin and specify
# the path to your Arduino/Esp* device
serial:
device: /dev/ttyUSB0
# Enable the serial sensor backend to
# listen for changes in the metrics
backend.sensor.serial:
# How often we should poll for new data
poll_seconds: 5.0
# Which sensors should be enabled. These are
# the keys in the JSON you'll be sending over serial
enabled_sensors:
- temperature
- humidity
# Specify the tolerance for the metrics. A new
# measurement event will be triggered only if
# the absolute value difference between the value in
# the latest event and the value in the current
# measurement is higher than these thresholds.
# If no tolerance value is set for a specific metric
# then new events will be triggered whenever we've
# got new values, as long as they're different from
# the previous, no matter the difference.
tolerance:
temperature: 0.25
humidity: 0.5
# Specify optional thresholds for the metrics. A new
# sensor above/below threshold event will be triggered
# when the value of that metric goes above/below the
# configured threshold.
thresholds:
humidity: 70.0
# You can also specify multiple thresholds values for a metric
temperature:
- 20.0
- 25.0
- 30.0
```
[`backend.sensor.serial`](https://platypush.readthedocs.io/en/latest/platypush/backend/sensor.serial.html) (and, in
general, any sensor backend) will trigger
a [`SensorDataChangeEvent`](https://platypush.readthedocs.io/en/latest/platypush/events/sensor.html#platypush.message.event.sensor.SensorDataChangeEvent)