diff --git a/markdown/Building-a-better-digital-reading-experience.md b/markdown/Building-a-better-digital-reading-experience.md new file mode 100644 index 0000000..ca5a2a7 --- /dev/null +++ b/markdown/Building-a-better-digital-reading-experience.md @@ -0,0 +1,759 @@ +[//]: # (title: Building a better digital reading experience) +[//]: # (description: Bypass client-side restrictions on news and blog articles, archive them and read them on any offline reader) +[//]: # (image: https://s3.platypush.tech/static/images/reading-experience.jpg) +[//]: # (author: Fabio Manganiello <fabio@manganiello.tech>) +[//]: # (published: 2025-06-05) + +I've always been an avid book reader as a kid. + +I liked the smell of the paper, the feeling of turning the pages, and the +ability to read them anywhere I wanted, as well as lend them to friends and +later share our reading experiences. + +As I grew and chose a career in tech and a digital-savvy lifestyle, I started +to shift my consumption from the paper to the screen. But I *still* wanted the +same feeling of a paper book, the same freedom of reading wherever I wanted +without distractions, and without being constantly watched by someone who will +recommend me other products based on what I read or how I read. + +I was an early support of the Amazon Kindle idea, I quickly moved most of my +physical books to the Kindle, I became a vocal supported of online magazines +that also provided Kindle subscriptions, and I started to read more and more on +e-ink devices. + +Then I noticed that, after an initial spike, not many magazines and blogs +provided Kindle subscriptions or EPub versions of their articles. + +So nevermind - I started tinkering my way out of it and [wrote an article in +2019](https://blog.platypush.tech/article/Deliver-articles-to-your-favourite-e-reader-using-Platypush) +on how to use [Platypush](https://platypush.tech) with its +[`rss`](https://docs.platypush.tech/platypush/plugins/rss.html), +[`instapaper`](https://docs.platypush.tech/platypush/plugins/instapaper.html) and +[`gmail`](https://docs.platypush.tech/platypush/plugins/google.mail.html) +plugins to subscribe to RSS feeds, parse new articles, convert them to PDF and +deliver them to my Kindle. + +Later I moved from Kindle to the first version of the +[Mobiscribe](https://www.mobiscribe.com), as Amazon started to be more and more +restrictive in its option to import and export stuff out of the Kindle. Using +Calibre and some DRM removal tools to export articles or books I had regularly +purchased was gradually getting more cumbersome and error-prone, and the +Mobiscribe at that time was an interesting option because it offered a decent +e-ink device, for a decent price, and it ran Android (an ancient version, but +at least one that was sufficient to run [Instapaper](https://instapaper.com) +and [KOReader](https://koreader.rocks)). + +That simplified things a bit because I didn't need intermediary delivery via +email to get stuff on my Kindle or Calibre to try and pull things out of it. I +was using Instapaper on all of my devices, included the Mobiscribe, I could +easily scrape and push articles to it through Platypush, and I could easily +keep track of my reading state across multiple devices. + +Good things aren't supposed to last though. + +Instapaper started to feel quite limited in its capabilities, and I didn't like +the idea of a centralized server holding all of my saved articles. So I've +moved to a self-hosted [Wallabag](https://wallabag.org) instance in the +meantime - which isn't perfect, but provides a lot more customization and +control. + +Moreover, more and more sites started implementing client-side restrictions for +my scrapers - Instapaper was initially more affected, as it was much easier for +publisher's websites to detect scraping requests coming from the same subnet, +but slowly Wallabag too started bumping into Cloudflare screens, CAPTCHAs and +paywalls. + +So the Internet Archive provided some temporary relief - I could still archive +articles there, and then instruct my Wallabag instance to read them from the +archived link. + +Except that, in the past few months, the Internet Archive has also started +implementing anti-scraping features, and you'll most likely get a Cloudflare +screen if you try and access an article from an external scraper. + +## A little ethical note before continuing + +_Feel free to skip this part and go to the technical setup section if you +already agree that, if buying isn't owning, then piracy isn't stealing._ + +I _do not_ condone nor support piracy. + +I mean, sometimes I do, but being a creator myself I always try to make sure +that, if piracy is the only way to freely access content wherever I want, then +creators are not being harmed. + +I don't mind however harming any intermediaries that add friction to the +process just to have a piece of the pie, stubbornly rely on unsustainable +business models that sacrifices both the revenue of the authors and the privacy +and freedom of the readers, and prevent me from having a raw file that I can +download and read wherever I want though. It's because of those folks that the +digital reading experience, despite all the initial promises, has become much +worse than reading physical books and magazines. So I don't see a big moral +conundrum in pirating to harm those folks and get back my basic freedoms as a +reader. + +But I do support creators via Patreon. I pay for subscriptions to digital +magazines that I will anyway never read through their official mobile app. +Every now and then I buy physical books and magazines that I've already read +and that I've really enjoyed, to support the authors, just like I still buy +some vinyls of albums I really love even though I could just stream them. And I +send one-off donations when I find that some content was particularly useful to +me. And I'd probably support content creators even more if only they allowed me +to pay only for the digital content I want to read, if only there was a viable +digital business model also for the occasional reader, instead of everybody +trying to lock me into a Hotel California subscription ("_you can check out any +time you like, but you can never leave_") just because their business managers +are those folks who have learned how to use the hammer of the recurring +revenue, and think that every problem in the world is a subscription nail to be +hit on its head. + +I also think that the current business model that runs most of the high-quality +content available online (locking people into apps and subscriptions in order +to view the content) is detrimental for the distribution of knowledge in what's +supposed to be the age of information. If I want to be exposed to diverse +opinions on what's going on in different industries or different parts of the +world, I'd probably need at least a dozen subscriptions, while in earlier +generations folks could just walk into their local library or buy a single +issue of a newspaper every now and then. + +If we have no digital alternatives for such simple and established ways to +access and spread knowledge, then piracy is almost a civic duty. It can't be +that high quality reports or insightful blog articles are locked behind +paywalls, subscriptions and apps, and all that's left for free is cheap +disinformation on social media. Future historians will have a very hard time +deciphering what was going on in the world in the 2020s, because most of the +content that was available online is now locked behind paywalls, the companies +that ran those sites and built the apps may be long gone, and if publishers +keep waging war against folks like the Internet Archive, then they may start +looking at our age like some kind of strange digital dark age. + +I also think that it's my right, as a reader, to be able to consume content on +a medium without distractions - like social media buttons, ads, comments, or +other stuff that distracts me from the main content, and if the publisher +doesn't provide me with a solution for that, and I have already paid for the +content, then I should be able to build a solution myself. Even in an age where +attention is the new currency, at least we should not try to grab people's +attention when they're trying to read some dense content - that's just common +sense. + +And I also demand the right to access the content I've paid for however I want. + +Do I want to export everything to Markdown or read it in ASCII art in a +terminal? Do I want to export it to EPub so I can read it on my e-ink device? +Do I want to export it to PDF and email it to one of my students for a research +project, or to myself for later reference? Do I want to access it without +having to use their tracker-ridden mobile app, or without being forced to see +ads despite having paid for a subscription? Well, that's my business. I firmly +believe that it's not an author's or publisher's right to dictate how I access +the content after paying for it. Just like in earlier days nobody minded if, +after purchasing a book, I would share it with my kids, or lend it to a friend, +or scan it and read it on my computer, or make the copies of a few pages to +bring to my students or my colleagues for a project, or leave it on a bench at +the park or in a public bookshelf after reading it. + +If some freedoms were legally granted to me before, and now they've been taken +away, then it's not piracy if I keep demanding those freedoms. + +And content ownership is another problem. I'll no longer be able to access +content I've read during my subscription period once my subscription expires. +I'll not be able to pass on the books or magazines I've read in my lifetime to +my kid. I'll never be able to lend it to someone else, just like I would leave +a book I had read on a public bookshelf or a bench at the park for someone +else to read it. + +In other words, buying now grants you a temporary license to access the content +on someone else's device - you don't really own anything. + +So, if buying isn't owning, piracy isn't stealing. + +And again, to make it very clear, I'll be referring to *personal use* in this +article. The case where you support creators through other means, but the +distribution channel and the business models are the problem, and you just +want your basic freedoms as a content consumer back. + +If however you want to share scraped articles on the Web, or even worse profit +from access to it, then you're *really* doing the kind of piracy I can't +condone. + +With this out of the way, let's get our hands dirty. + +## The setup + +A high-level overview of the setup is as follows: + +<img alt="High-level overview of the scraper setup" src="https://s3.platypush.tech/static/images/wallabag-scraper-architecture.png" width="650px"> + +Let's break down the building blocks of this setup: + +- **[Redirector](https://addons.mozilla.org/en-US/firefox/addon/redirector/)** + is a browser extension that allows you to redirect URLs based on custom + rules as soon as the page is loaded. This is useful to redirect paywalled + resources to the Internet Archive, which usually stores full copies of the + content. Even if you regularly paid for a subscription to a magazine, and you + can read the article on the publisher's site or from their app, your Wallabag + scraper will still be blocked if the site implements client-side restrictions + or is protected by Cloudflare. So you need to redirect the URL to the Internet + Archive, which will then return a copy of the article that you can scrape. + +- **[Platypush](https://platypush.tech)** is a Python-based general-purpose + platform for automation that I've devoted a good chunk of the past decade + to develop. It allows you to run actions, react to events and control devices + and services through a unified API and Web interface, and it comes with + [hundreds of supported integrations](https://docs.platypush.tech). We'll use + the [`wallabag`](https://docs.platypush.tech/platypush/plugins/wallabag.html) + plugin to push articles to your Wallabag instance, and optionally the + [`rss`](https://docs.platypush.tech/platypush/plugins/rss.html) plugin if you + want to programmatically subscribe to RSS feeds, scrape articles and archive + them to Wallabag, and the + [`ntfy`](https://docs.platypush.tech/platypush/plugins/ntfy.html) plugin to + optionally send notifications to your mobile device when new articles are + available. + +- **[Platypush Web extension](https://addons.mozilla.org/en-US/firefox/addon/platypush/)** + is a browser extension that allows you to interact with Platypush from your + browser, and it also provides a powerful JavaScript API that you can leverage + to manipulate the DOM and automate tasks in the browser. It's like a + [Greasemonkey](https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/) + or [Tampermonkey](https://addons.mozilla.org/en-US/firefox/addon/tampermonkey/) + extension that allows you to write scripts to customize your browser + experience, but it also allows you to interact with Platypush and leverage + its backend capabilities. On top of that, I've also added built-in support + for the [Mercury Parser API](https://github.com/usr42/mercury-parser) in it, + so you can easily distill articles - similar to what Firefox does with its + [Reader + Mode](https://support.mozilla.org/en-US/kb/firefox-reader-view-clutter-free-web-pages), + but in this case you can customize the layout and modify the original DOM + directly, and the distilled content can easily be dispatched to any other + service or application. We'll use it to: + + - Distill the article content from the page, removing all the + unnecessary elements (ads, comments, etc.) and leaving only the main text + and images. + + - Archive the distilled article to Wallabag, so you can read it later + from any device that has access to your Wallabag instance. + +- **[Wallabag](https://wallabag.org)** is a self-hosted read-it-later + service that allows you to save articles from the Web and read them later, + even offline. It resembles the features of the ([recently + defunct](https://support.mozilla.org/en-US/kb/future-of-pocket)) + [Pocket](https://getpocket.com/home). It provides a Web interface, mobile + apps and browser extensions to access your saved articles, and it can also be + used as a backend for scraping articles from the Web. + +- (_Optional_) **[KOReader](https://koreader.rocks)** is an + open-source e-book reader that runs on a variety of devices, including any + e-ink readers that run Android (and even the + [Remarkable](https://github.com/koreader/koreader/wiki/Installation-on-Remarkable)). + It has a quite minimal interface and it may take a while to get used to, but + it's extremely powerful and customizable. I personally prefer it over the + official Wallabag app - it has a native Wallabag integration, as well as OPDS + integration to synchronize with my + [Ubooquity](https://docs.linuxserver.io/images/docker-ubooquity/) server, + synchronization of highlights and notes to Nextcloud Notes, WebDAV support + (so you can access anything hosted on e.g. your Nextcloud instance), progress + sync across devices through their [sync + server](https://github.com/koreader/koreader-sync-server), and much more. It + basically gives you a single app to access your saved articles, your books, + your notes, your highlights, and your documents. + +- (_Optional_) An Android-based e-book reader to run KOReader on. I have + recently switched from my old Mobiscribe to an [Onyx BOOX Note Air + 4](https://www.onyxbooxusa.com/onyx-boox-note-air4-c) and I love it. It's + powerful, the display is great, it runs basically any Android app out there + (and I've had no issues with running any apps installed through + [F-Droid](https://f-droid.org)), and it also has a good set of stock apps, + and most of them support WebDAV synchronization - ideal if you have a + [Nextcloud](https://nextcloud.com) instance to store your documents and + archived links. + +**NOTE**: The Platypush extension only works with Firefox, on any Firefox-based +browser, or on any browser out there that still supports the [Manifest +V2](https://blog.mozilla.org/addons/2024/03/13/manifest-v3-manifest-v2-march-2024-update/). +The Manifest V3 has been a disgrace that Google has forced all browser +extension developers to swallow. I won't go in detail here, but the Platypush +extension needs to be able to perform actions (such as calls to custom remote +endpoints and runtime interception of HTTP headers) that are either no longer +supported on Manifest V3, or that are only supported through laborious +workarounds (such as using the declarative Net Request API to explicitly +define what you want to intercept and what remote endpoints you want to call). + +**NOTE 2**: As of June 2025, the Platypush extension is only supported on +Firefox for desktop. A Firefox for Android version [is +work in progress](https://git.platypush.tech/platypush/platypush-webext/issues/1). + +Let's dig deeper into the individual components of this setup. + +## Redirector + + + +This is a nice addition if you want to automatically view some links through +the Internet Archive rather than the original site. + +You can install it from the [Firefox Add-ons site](https://addons.mozilla.org/en-US/firefox/addon/redirector/). +Once installed, you can create a bunch of rules (regular expressions are supported) +to redirect URLs from paywalled domains that you visit often to the Internet Archive. + +For example, this regular expression: + +``` +^(https://([\w-]+).substack.com/p/.*) +``` + +will match any Substack article URL, and you can redirect it to the Internet Archive +through this URL: + +``` +https://archive.is/$1 +``` + +Next time you open a Substack article, it will be automatically redirected to its +most recent archived version - or it will prompt you to archive the URL if it's not +been archived yet. + +## Wallabag + + + +Wallabag can easily be installed on any server [through Docker](https://doc.wallabag.org/developer/docker/). + +Follow the documentation for the set up of your user and create an API token +from the Web interface. + +It's also advised to [set up a reverse +proxy](https://doc.wallabag.org/admin/installation/virtualhosts/#configuration-on-nginx) +in front of Wallabag, so you can easily access it over HTTPS. + +Once configured the reverse proxy, you can generate a certificate for it - for +example, if you use [`certbot`](https://certbot.eff.org/) and `nginx`: + +```bash +❯ certbot --nginx -d your-domain.com +``` + +Then you can access your Wallabag instance at `https://your-domain.com` and log +in with the user you created. + +Bonus: I personally find the Web interface of Wallabag quite ugly - the +fluorescent light blue headers are distracting and the default font and column +width isn't ideal for my taste. So I made a [Greasemonkey/Tampermonkey +script](https://gist.manganiello.tech/fabio/ec9e28170988441d9a091b3fa6535038) +to make it better if you want (see screenshot above). + +## [_Optional_] ntfy + +[ntfy](https://ntfy.sh) is a simple HTTP-based pub/sub notification service +that you can use to send notifications to your devices or your browser. It +provides both an [Android app](https://f-droid.org/en/packages/io.heckel.ntfy/) +and a [browser +addon](https://addons.mozilla.org/en-US/firefox/addon/send-to-ntfy/) to send +and receive notifications, allowing you to open saved links directly on your +phone or any other device subscribed to the same topic. + +Running it via docker-compose [is quite +straightforward](https://github.com/binwiederhier/ntfy/blob/main/docker-compose.yml). + +It's also advised to serve it behind a reverse proxy with HTTPS support, +keeping in mind to set the right header for the Websocket paths - example nginx +configuration: + +```nginx +map $http_upgrade $connection_upgrade { + default upgrade; + '' close; +} + +server { + server_name notify.example.com; + + location / { + proxy_pass http://your-internal-ntfy-host:port; + + client_max_body_size 5M; + + proxy_read_timeout 60; + proxy_connect_timeout 60; + proxy_redirect off; + + proxy_set_header Host $http_host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-Ssl on; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + } + + location ~ .*/ws/?$ { + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection $connection_upgrade; + proxy_set_header Host $http_host; + proxy_pass http://your-internal-ntfy-host:port; + } +} +``` + +Once the server is running, you can check the connectivity by opening your +server's main page in your browser. + +**NOTE**: Be _careful_ when choosing your ntfy topic name, especially if you +are using a public instance. ntfy by default doesn't require any authentication +for publishing or subscribing to a topic. So choose a random name (or at least +a random prefix/suffix) for your topics and treat them like a password. + +## Platypush + +Create a new virtual environment and install Platypush through `pip` (the +plugins we'll use in the first part don't require any additional dependencies): + +```bash +❯ python3 -m venv venv +❯ source venv/bin/activate +❯ pip install platypush +``` + +Then create a new configuration file `~/.config/platypush/config.yaml` with the +following configuration: + +```yaml +# Web server configuration +backend.http: + # port: 8008 + +# Wallabag configuration +wallabag: + server_url: https://your-domain.com + client_id: your_client_id + client_secret: your_client_secret + # Your Wallabag user credentials are required for the first login. + # It's also advised to keep them here afterwards so the refresh + # token can be automatically updated. + username: your_username + password: your_password +``` + +Then you can start the service with: + +```bash +❯ platypush +``` + +You can also create a systemd service to run Platypush in the background: + +```bash +❯ mkdir -p ~/.config/systemd/user +❯ cat <<EOF > ~/.config/systemd/user/platypush.service +[Unit] +Description=Platypush service +After=network.target + +[Service] +ExecStart=/path/to/venv/bin/platypush +Restart=always +RestartSec=5 +EOF +❯ systemctl --user daemon-reload +❯ systemctl --user enable --now platypush.service +``` + +After starting the service, head over to `http://your_platypush_host:8008` (or +the port you configured in the `backend.http` section) and create a new user +account. + +It's also advised to serve the Platypush Web server behind a reverse proxy with +HTTPS support if you want it to easily be accessible from the browser extension - +a basic `nginx` configuration [is available on the +repo](https://git.platypush.tech/platypush/platypush/src/branch/master/examples/nginx/nginx.sample.conf). + +## Platypush Web extension + +You can install the Platypush Web extension from the [Firefox Add-ons +site](https://addons.mozilla.org/en-US/firefox/addon/platypush/). + +After installing it, click on the extension popup and add the URL of your +Platypush Web server. + + + +When successfully connected, you should see the device in the main menu, you +can run commands on it and save actions. + +A good place to start familiarizing with the Platypush API is the _Run Action_ +dialog, which allows you to run commands on your server and provides +autocomplete for the available actions, as well as documentation about their +arguments. + + + +The default action mode is _Request_ (i.e. single requests against the API). +You can also pack together more actions on the backend [into +_procedures_](https://docs.platypush.tech/wiki/Quickstart.html#greet-me-with-lights-and-music-when-i-come-home), +which can be written either in the YAML config or as Python scripts (by default +loaded from `~/.config/platypush/scripts`). If correctly configured, procedures +will be available in the _Run Action_ dialog. + +The other mode, which we'll use in this article, is _Script_. In this mode you +can write custom JavaScript code that can interact with your browser. + + + +[Here](https://gist.github.com/BlackLight/d80c571705215924abc06a80994fd5f4) is +a sample script that you can use as a reference for the API exposed by the +extension. Some examples include: + +- `app.run`, to run an action on the Platypush backend + +- `app.getURL`, `app.setURL` and `app.openTab` to get and set the current URL, + or open a new tab with a given URL + +- `app.axios.get`, `app.axios.post` etc. to perform HTTP requests to other + external services through the Axios library + +- `app.getDOM` and `app.setDOM` to get and set the current page DOM + +- `app.mercury.parse` to distill the current page content using the Mercury + Parser API + +### Reader Mode script + +We can put together the building blocks above to create our first script, which +will distill the current page content and swap the current page DOM with the +simplified content - with no ads, comments, or other distracting visual +elements. The full content of the script is available +[here](https://gist.manganiello.tech/fabio/c731b57ff6b24d21a8f43fbedde3dc30). + +This is akin to what Firefox' [Reader +Mode](https://support.mozilla.org/en-US/kb/firefox-reader-view-clutter-free-web-pages) +does, but with much more room for customization. + +Note that for this specific script we don't need any interactions with the +Platypush backend. Everything happens on the client, as the Mercury API is +built into the Platypush Web extension. + +Switch to _Script_ mode in the _Run Action_ dialog, paste the script content +and click on _Save Script_. You can also choose a custom name, icon +([FontAwesome](https://fontawesome.com/icons) icon classes are supported), +color and group for the script. Quite importantly, you can also associate a +keyboard shortcut to it, so you can quickly distill a page without having to +search for the command either in the extension popup or in the context menu. + +### Save to Wallabag script + +Now that we have a script to distill the current page content, we can create +another script to save the distilled content (if available) to Wallabag. +Otherwise, it will just save the original page content. + +The full content of the script is available +[here](https://gist.manganiello.tech/fabio/8f5b08d8fbaa404bafc6fdeaf9b154b4). +The structure is quite straightforward: + +- First, it checks if the page content has already been "distilled" by the + Reader Mode script. If so, it uses the distilled content to save it to + Wallabag. Otherwise, it will use the full page body. + +- It saves the URL to Wallabag. + +- Optionally, it sends a notification over ntfy. + +Again, feel free to assign a keybinding to this action so you can quickly call +it from any page. + +Personally I've picked `Ctrl+Alt+1` for the Reader Mode script and `Ctrl+Alt+2` +for the Save to Wallabag script, so I can quickly distill a page and, if takes +me more time to read it, send the already simplified content to Wallabag. + +If you don't want to create a keybinding, you can always call these actions +either from the extension popup or from the (right click) context menu. + +## [_Optional_] RSS subscriptions and automated delivery + +You now have a way to manually scrape and archive articles from the Web. + +If you are also a regular reader of a publication or a blog that provides RSS +or Atom feeds, you can also automate the process of subscribing to those feeds +and delivering new articles to Wallabag. + +Just keep in mind two things if you want to go down this way: + +1. It's not advised to subscribe to feeds that provide a lot of articles + every day, as this will quickly fill up your Wallabag instance and make it + hard to find the articles you want to read. So stick to feeds that provide + one or a few articles per day, or at least don't provide more than a dozen + articles per day. Or augment the RSS event hook with custom filters to only + include links that match some criteria. + +2. Unlike the manual actions we saw before, the logic to handle automated + subscriptions and content delivery is implemented on the Platypush service + (on the backend). So it may not be as optimal in scraping and distilling + articles as some logic that operates on the client side and can more easily + bypass client-side restrictions. So you may want to pick feeds that don't + implement aggressive paywalls, are behind Cloudflare, or implement other + client-side restrictions. + +If you have some good candidates for automated delivery, follow these steps: + +- Install the [`rss`](https://docs.platypush.tech/platypush/plugins/rss.html) + plugin in your Platypush instance: + +```bash +(venv)> pip install 'platypush[rss]' +``` + +- If you want to use the Mercury Parser API to distill articles (_optional_), + install the dependencies for the + [`http.webpage`](https://docs.platypush.tech/platypush/plugins/http.webpage.html). + The Mercury API is only available in JavaScript, so you'll need to have + `nodejs` and `npm` installed on your system. The Mercury Parser API is optional, + but it's usually more successful than the default Wallabag scraper in distilling + content. And, on top of that, it also makes it easier to customize your + requests. So if you want to scrape content from paywalled websites that + you're subscribed to you can easily pass your credentials or cookies to the + Mercury API (Wallabag doesn't support customizing the scraping requests). + Moreover, the Mercury integration also allows you to export the distilled + content to other formats, such as plain text, HTML, Markdown, or PDF - this + is useful if you want to save content to other services or applications. For + example, I find it quite useful to scrape content from some articles in + Markdown, and then save it to my [Nextcloud + Notes](https://apps.nextcloud.com/apps/notes) or + [Obsidian](https://obsidian.md). + +```bash +# Example for Debian/Ubuntu +❯ [sudo] apt install nodejs npm +# Install Mercury Parser globally +❯ [sudo] npm install -g @postlight/parser +``` + +- Add your subscriptions to the `~/.config/platypush/config.yaml` file: + +```yaml +rss: + subscriptions: + - https://example.com/feed.xml + - https://example.com/atom.xml + +# Optional +# http.webpage +# headers: +# # These headers will be used in all the requests made by the Mercury Parser. +# # You can still override the headers when you call the `http.webpage.simplify` +# # action though. +# User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" +``` + +- Create an event hook to handle new articles from the RSS feed and + distill them using the Mercury Parser API. You can e.g. create a + `~/.config/platypush/scripts/subscriptions.py` file with the following + content: + +```python +import logging +import urllib.parse + +from platypush import run, when +from platypush.events.rss import NewFeedEntryEvent + +logger = logging.getLogger(__name__) + +# Optional, set the False if you don't want to use the Mercury Parser API +USE_MERCURY_PARSER = True + +# If there are any websites that require specific headers to be passed, +# for example paywalled news sites that you're subscribed to and require +# authentication, you can specify them here. +headers_by_domain = { + 'example.com': { + 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', + 'Accept-Language': 'en-US,en;q=0.5', + 'Cookie': 'sessionid=your_cookie_value; other_cookie=other_value', + }, +} + + +def get_headers(url: str) -> dict: + """ + Get the headers to use for the request based on the URL. + """ + domain = re.sub(r'^www\.', '', urllib.parse.urlparse(url).netloc) + return headers_by_domain.get(domain, {}) + + +@when(NewFeedEntryEvent) +def scrape_and_save(event: NewFeedEntryEvent, **_): + """ + Scrape and save the new article to Wallabag. + """ + content = None + logger.info( + 'New article available on %s - title: %s, url: %s', + event.feed_url, + event.title, + event.url + ) + + if USE_MERCURY_PARSER: + # Distill the article content using the Mercury Parser API + response = run( + 'http.webpage.simplify', + url=url, + format='html', + headers=get_headers(event.url), + ) + + if not (response and response.get('content')): + logger.warning(f'Failed to distill {url} through Mercury Parser') + else: + content = response['content'] + + # Save the distilled content to Wallabag + run( + 'wallabag.save', + title=event.entry.title, + content=content, + url=url, + ) + + logger.info(f'Saved {url} to Wallabag') +``` + +It is advised to run the Platypush script once _without_ the `@when` hook +above, but with the `rss` plugin configured. + +The reason is that, on the first run, the `rss` plugin will fetch all the +entries in the subscribed feeds and trigger the `NewFeedEntryEvent` for each +of them. That in turn could end up with hundreds of articles pushed +simultaneously to your Wallabag instance, you may not want that. + +The recommended flow instead (which should probably apply also any time you add +new feeds to your subscriptions) is: + +1. Add the feeds to your `rss` plugin configuration. + +2. Restart the Platypush service and let it process all the `NewFeedEntryEvent` + events for the existing articles. + +3. Add the event hook logic to any file under `~/.config/platypush/scripts`. + +4. Restart the service - now only new entries will trigger the events. + +## Conclusions + +In this article we have seen how to set up a self-hosted solution to scrape and +archive articles from the Web, and also how to automate the process through +feed subscriptions. + +This is a powerful way to regain control over your reading experience, hopefully +bringing it one step closer to the one you had with paper books or walks to the +local library. + +Just remember to do so responsibly, only for personal use, and respecting the +rights of content creators and publishers. + +It's fine to get creative and build your own reading experience by bypassing +all the needless friction that has been added as media has moved to the digital +space. + +But always remember to fund authors and creators in other ways, subscribe to +those who produce high-quality content (even if you don't read content from +their mobile app), and try to limit your scraping experience to personal use. diff --git a/markdown/Read-and-archive-everything.md b/markdown/Read-and-archive-everything.md deleted file mode 100644 index bf1c918..0000000 --- a/markdown/Read-and-archive-everything.md +++ /dev/null @@ -1,505 +0,0 @@ -[//]: # (title: Read and archive everything) -[//]: # (description: Bypass client-side restrictions on news and blog articles, archive them and read them wherever you want) -[//]: # (image: /img/twitter2mastodon.png) -[//]: # (author: Fabio Manganiello <fabio@manganiello.tech>) -[//]: # (published: 2025-06-04) - -I've always been an avid book reader as a kid. - -I liked the smell of the paper, the feeling of turning the pages, and the -ability to read anywhere I wanted. - -As I grew and chose a career in tech and a digital-savvy lifestyle, I started -to shift my consumption from the paper to the screen. But I *still* wanted the -same feeling of a paper book, the same freedom of reading wherever I wanted. - -I was an early support of the Amazon Kindle idea, I quickly moved most of my -physical books to the Kindle, I became a vocal supported of online magazines -that also provided Kindle subscriptions, and I started to read more and more on -e-ink devices. - -Then I noticed that, after an initial spike, not many magazines and blogs -provided Kindle subscriptions or EPub versions of their articles. - -So nevermind - I started tinkering my way out of it and [wrote an article in -2019](https://blog.platypush.tech/article/Deliver-articles-to-your-favourite-e-reader-using-Platypush) -on how to use [Platypush](https://platypush.tech) with its [`rss`](https://docs.platypush.tech/platypush/plugins/rss.html), -[`instapaper`](https://docs.platypush.tech/platypush/plugins/instapaper.html) and -[`gmail`](https://docs.platypush.tech/platypush/plugins/google.mail.html) -plugins to subscribe to RSS feeds, parse new articles, convert them to PDF and -deliver them to my Kindle. - -Later I moved from Kindle to the first version of the -[Mobiscribe](https://www.mobiscribe.com), as Amazon started to be more and more -restrictive in its option to import and export stuff out of the Kindle, using -Calibre and some DRM removal tools to export articles or books I had regularly -purchased was becoming more cumbersome, and the Mobiscribe at that time was an -interesting option because it offered a decent e-ink device, for a decent -price, and it ran Android (an ancient version, but at least one that was -sufficient to run [Instapaper](https://instapaper.com) and -[KOReader](https://koreader.rocks)). - -That simplified things a bit because I didn't need intermediary delivery via -email to get stuff on my Kindle or Calibre to try and pull things out of it. I -was using Instapaper on all of my devices, included the Mobiscribe, I could -easily scrape and push articles to it through Platypush, and I could easily -keep track of my reading state across multiple devices. - -Good things aren't supposed to last though. - -Instapaper started to feel quite limited in its capabilities, and I didn't like -the idea of a centralized server holding all of my saved articles. So I've -moved to a self-hosted [Wallabag](https://wallabag.org) instance in the -meantime - which isn't perfect, but provides a lot more customization and -control. - -Moreover, more and more sites started implementing client-side restrictions for -my scrapers - Instapaper was initially more affected, but slowly Wallabag too -started bumping into Cloudflare screens, CAPTCHAs and paywalls. - -So the Internet Archive provided some temporary relief - I could still archive -articles there, and then instruct my Wallabag instance to read them from the -archive link. - -Except that, in the past few months, the Internet Archive has also started -implementing anti-scraping features, and you'll most likely get a Cloudflare -screen if you try and access an article from an external scraper. - -## A little ethical note before continuing - -I _do not_ condone nor support piracy. - -I mean, sometimes I do, but being a creator myself I always try to make sure -that, if piracy is the only way to freely access content wherever I want, then -creators are not being harmed (I don't mind harming any intermediaries that add -friction to the process and prevent me from having a raw file that I can -download and read wherever I want though). - -So I support creators via Patreon. I pay for subscriptions to digital magazines -that I will anyway never read through their official mobile app. I send one-off -donations when I find that some content was particularly useful to me. I buy -physical books and magazines every now and then from authors or publishers that -I want to support. And I'd probably support content creators even more if only -they allowed me to pay only for the content I want to read, and not lock me -into a Hotel California subscription ("_you can check out any time you like, -but you can never leave_") because their PMs only care about recurring revenue. - -I also think that the current business model that runs most of the high-quality -content available online (locking people into apps and subscriptions in order -to view the content) is detrimental for the distribution of knowledge in what's -supposed to be the age of information. If I want to be exposed to diverse -opinions on what's going on in different industries or different parts of the -world, I probably need at least a dozen subscriptions. And probably pay -something off to download special reports. In the earlier days we didn't have -to give away so much money if we wanted to access content for our personal -research - we could just buy a book or a single issue of a magazine, or even -just walk into a library and read content for free. If we have no digital -alternatives for such simple and established ways to access knowledge, then -piracy becomes almost a civic duty. It can't be that high quality reports or -insightful blog articles are locked behind paywalls, subscriptions and apps and -all that's left for free is cheap disinformation on social media. Future -historians will have a very hard time deciphering what was going on in the -world in the 2020s, because most of the content that was available online is -now locked behind paywalls, the companies that ran those sites and built the -apps may be long gone, and if publishers keep waging war against folks like the -Internet Archive, then they may start looking at our age like some kind of -strange digital dark age. - -I also think that it's my right, as a reader, to be able to consume content on a medium without distractions - like -social media buttons, ads, comments, or other stuff that distracts me from the main content, and if the publisher -doesn't provide me with a solution for that, and I have already paid for the content, then I should be able to build a -solution myself. - -And I also demand the right to access the content I've paid for however I want. - -Do I want to export everything to Markdown or read it in ASCII art in a -terminal? Do I want to export it to EPub so I can read it on my e-ink device? -Do I want to access it without having to use their tracker-ridden mobile app, -or without being forced to see ads despite having paid for a subscription? -Well, that's my business. I firmly believe that it's not an author's or -publisher's right to dictate how I access the content after paying for it. Just -like in earlier times nobody minded if, after purchasing a book, I would share -it with my kids, or lend it to a friend, or scan it and read it on my computer, -or make the copies of a few pages to bring to my students or my colleagues. - -If some freedoms were legally granted to me before, and now they've been taken -away, then it's not piracy if I keep demanding those freedoms. - -And content ownership is another problem. I'll no longer be able to access -content I've read during my subscription period once my subscription expires. -I'll not be able to pass on the books or magazine I've read in my lifetime to -my kid. I'll never be able to lend it to someone else, just like I would leave -a book I had read on a public bookshelf or a bench at the park for someone -else to read it. - -In other words, buying now grants you a temporary license to access the content -on someone else's devices - you don't really own anything. - -So, if buying isn't owning, then piracy isn't stealing. - -And again, to make it very clear, I'll be referring to *personal usage* in this -article. The case where you support creators through other means, but the -distribution channel is the problem, and you just want your basic freedoms -as a content consumer back. - -If however you start to share scraped articles on the Web, or even worse profit -from access to it, then you're *really* doing the kind of piracy I can't -condone. - -With this out of the way, let's get our hands dirty. - -## The setup - -My current set up is quite complex. At some point I may package all the moving -parts into a single stand-alone application, including both the browser -extension and the backend, but at the moment it should be sufficient to get -things to work. - -A high-level overview of the setup is as follows: - -<img alt="High-level overview of the scraper setup" src="http://s3.platypush.tech/static/images/wallabag-scraper-architecture.png" width="650px"> - -Let's break down the building blocks of this setup: - -- **[Redirector](https://addons.mozilla.org/en-US/firefox/addon/redirector/)** - is a browser extension that allows you to redirect URLs based on custom - rules as soon as the page is loaded. This is useful to redirect paywalled - resources to the Internet Archive, which usually stores full copies of the - content. Even if you regularly paid for a subscription to a magazine, and you - can read the article on the publisher's site or from their app, your Wallabag - scraper will still be blocked if the site implements client-side restrictions - or is protected by Cloudflare. So you need to redirect the URL to the Internet - Archive, which will then return a copy of the article that you can scrape. - -- **[Platypush](https://platypush.tech)** is a Python-based general-purpose - platform for automation that I've devoted a good chunk of the past decade - to develop. It allows you to run actions, react to events and control devices - and services through a unified API and Web interface, and it comes with - [hundreds of supported integrations](https://docs.platypush.tech). We'll use - the [`wallabag`](https://docs.platypush.tech/platypush/plugins/wallabag.html) - plugin to push articles to your Wallabag instance, and optionally the - [`rss`](https://docs.platypush.tech/platypush/plugins/rss.html) plugin if you - want to programmatically subscribe to RSS feeds, scrape articles and archive - them to Wallabag, and the - [`ntfy`](https://docs.platypush.tech/platypush/plugins/ntfy.html) plugin to - optionally send notifications to your mobile device when new articles are - available. - -- **[Platypush Web extension](https://addons.mozilla.org/en-US/firefox/addon/platypush/)** - is a browser extension that allows you to interact with Platypush from your - browser, and it also provides a powerful JavaScript API that you can leverage - to manipulate the DOM and automate tasks in the browser. It's like a - [Greasemonkey](https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/) - or [Tampermonkey](https://addons.mozilla.org/en-US/firefox/addon/tampermonkey/) - extension that allows you to write custom scripts to customize your browser - experience, but it also allows you to interact with Platypush and leverage - its backend capabilities. On top of that, I've also added built-in support - for the [Mercury Parser API](https://github.com/usr42/mercury-parser) in it, - so you can easily distill articles - similar to what Firefox does with its - [Reader - Mode](https://support.mozilla.org/en-US/kb/firefox-reader-view-clutter-free-web-pages), - but in this case you can customize the layout and modify the original DOM - directly, and the distilled content can easily be dispatched to any other - service or application. We'll use it to: - - - Distill the article content from the page, removing all the - unnecessary elements (ads, comments, etc.) and leaving only the main text - and images. - - - Temporarily archive the distilled article to a Web server capable of - serving static files, so Wallabag can get the full content and bypass any - client-side restrictions. - - - Archive the distilled article to Wallabag, so you can read it later - from any device that has access to your Wallabag instance. - -- **[Wallabag](https://wallabag.org)** is a self-hosted read-it-later - service that allows you to save articles from the Web and read them later, - even offline. It resembles the features of the ([recently - defunct](https://support.mozilla.org/en-US/kb/future-of-pocket)) - [Pocket](https://getpocket.com/home). It provides a Web interface, mobile - apps and browser extensions to access your saved articles, and it can also be - used as a backend for scraping articles from the Web. - -- (_Optional_) **[KOReader](https://koreader.rocks)** is an - open-source e-book reader that runs on a variety of devices, including any - e-ink readers that run Android (and even the - [Remarkable](https://github.com/koreader/koreader/wiki/Installation-on-Remarkable)). - It has a quite minimal interface and it may take a while to get used to, but - it's extremely powerful and customizable. I personally prefer it over the - official Wallabag app - it has a native Wallabag integration, as well as OPDS - integration to synchronize with my - [Ubooquity](https://docs.linuxserver.io/images/docker-ubooquity/) server, - synchronization of highlights and notes to Nextcloud Notes, WebDAV support - (so you can access anything hosted on e.g. your Nextcloud instance), progress - sync across devices through their [sync - server](https://github.com/koreader/koreader-sync-server), and much more. It - basically gives you a single app to access your saved articles, your books, - your notes, your highlights, and your documents. - -- (_Optional_) An Android-based e-book reader to run KOReader on. I have - recently switched from my old Mobiscribe to an [Onyx BOOX Note Air - 4](https://www.onyxbooxusa.com/onyx-boox-note-air4-c) and I love it. It's - powerful, the display is great, it runs basically any Android app out there - (and I've had no issues with running any apps installed through - [F-Droid](https://f-droid.org)), and it also has a good set of stock apps, - and most of them support WebDAV synchronization - ideal if you have a - [Nextcloud](https://nextcloud.com) instance to store your documents and - archived links. - -**NOTE**: The Platypush extension only works with Firefox, on any Firefox-based -browser, or on any browser out there that still supports the [Manifest -V2](https://blog.mozilla.org/addons/2024/03/13/manifest-v3-manifest-v2-march-2024-update/). -The Manifest V3 has been a disgrace that Google has forced all browser -extension developers to swallow. I won't go in detail here, but the Platypush -extension needs to be able to perform actions (such as calls to custom remote -endpoints and runtime interception of HTTP headers) that are either no longer -supported on Manifest V3, or that are only supported through laborious -workarounds (such as using the declarative Net Request API to explicitly -define what you want to intercept and what remote endpoints you want to call). - -**NOTE 2**: As of June 2025, the Platypush extension is only supported on -Firefox for desktop. A Firefox for Android version [is -work in progress](https://git.platypush.tech/platypush/platypush-webext/issues/1). - -Let's dig deeper into the individual components of this setup. - -## Redirector - - - -This is a nice addition if you want to automatically view some links through -the Internet Archive rather than the original site. - -You can install it from the [Firefox Add-ons site](https://addons.mozilla.org/en-US/firefox/addon/redirector/). -Once installed, you can create a bunch of rules (regular expressions are supported) -to redirect URLs from paywalled domains that you visit often to the Internet Archive. - -For example, this regular expression: - -``` -^(https://([\w-]+).substack.com/p/.*) -``` - -will match any Substack article URL, and you can redirect it to the Internet Archive -through this URL: - -``` -https://archive.is/$1 -``` - -Next time you open a Substack article, it will be automatically redirected to its -most recent archived version - or it will prompt you to archive the URL if it's not -been archived yet. - -## Wallabag - - - -Wallabag can easily be installed on any server [through Docker](https://doc.wallabag.org/developer/docker/). - -Follow the documentation for the set up of your user and create an API token from the Web interface. - -It's also advised to [set up a reverse -proxy](https://doc.wallabag.org/admin/installation/virtualhosts/#configuration-on-nginx) in front of Wallabag, so you -can easily access it over HTTPS. - -Once configured the reverse proxy, you can generate a certificate for it - for example, if you use -[`certbot`](https://certbot.eff.org/) and `nginx`: - -```bash -certbot --nginx -d your-domain.com -``` - -Then you can access your Wallabag instance at `https://your-domain.com` and log in with the user you created. - -Bonus: I personally find the Web interface of Wallabag quite ugly - the fluorescent light blue headers are distracting -and the default font and column width isn't ideal for my taste. So I made a [Greasemonkey/Tampermonkey -script](https://gist.manganiello.tech/fabio/ec9e28170988441d9a091b3fa6535038) to make it better if you want (see -screenshot above). - -## [_Optional_] ntfy - -[ntfy](https://ntfy.sh) is a simple HTTP-based pub/sub notification service that you can use to send notifications to -your devices or your browser. It provides both an [Android app](https://f-droid.org/en/packages/io.heckel.ntfy/) and a -[browser addon](https://addons.mozilla.org/en-US/firefox/addon/send-to-ntfy/) to send and receive notifications, -allowing you to open saved links directly on your phone or any other device subscribed to the same topic. - -Running it via docker-compose [is quite -straightforward](https://github.com/binwiederhier/ntfy/blob/main/docker-compose.yml). - -It's also advised to serve it behind a reverse proxy with HTTPS support, keeping in mind to set the right header for the -Websocket paths - example nginx configuration: - -```nginx -map $http_upgrade $connection_upgrade { - default upgrade; - '' close; -} - -server { - server_name notify.example.com; - - location / { - proxy_pass http://your-internal-ntfy-host:port; - - client_max_body_size 5M; - - proxy_read_timeout 60; - proxy_connect_timeout 60; - proxy_redirect off; - - proxy_set_header Host $http_host; - proxy_set_header X-Real-IP $remote_addr; - proxy_set_header X-Forwarded-Ssl on; - proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; - } - - location ~ .*/ws/?$ { - proxy_http_version 1.1; - proxy_set_header Upgrade $http_upgrade; - proxy_set_header Connection $connection_upgrade; - proxy_set_header Host $http_host; - proxy_pass http://your-internal-ntfy-host:port; - } -} -``` - -Once the server is running, you can check the connectivity by opening your server's main page in your browser. - -## Local Web server - -This approach uses an intermediary Web server to temporarily archive the distilled article content, if available, and -instructing Wallabag to parse it from there. - -## Platypush - -Create a new virtual environment and install Platypush with the `wallabag` and `rss` -plugin dependencies through `pip`: - -```bash -python3 -m venv venv -source venv/bin/activate -pip install 'platypush[wallabag,rss]' -``` - -Then create a new configuration file `~/.config/platypush/config.yaml` with the following configuration: - -```yaml -# Web server configuration -backend.http: - # - port: 8008 - -# Wallabag configuration -wallabag: - server_url: https://your-domain.com - client_id: your_client_id - client_secret: your_client_secret - # Your Wallabag user credentials are required for the first login. - # It's also advised to keep them here afterwards so the refresh - # token can be automatically updated. - username: your_username - password: your_password -``` - -Then you can start the service with: - -```bash -platypush -``` - -You can also create a systemd service to run Platypush in the background: - -```bash -mkdir -p ~/.config/systemd/user - -cat <<EOF > ~/.config/systemd/user/platypush.service -[Unit] -Description=Platypush service -After=network.target - -[Service] -ExecStart=/path/to/venv/bin/platypush -Restart=always -RestartSec=5 -EOF - -systemctl --user daemon-reload -systemctl --user enable --now platypush.service -``` - -After starting the service, head over to `http://your_platypush_host:8008` (or the port you configured in the -`backend.http` section) and create a new user account. - -It's also advised to serve the Platypush Web server behind a reverse proxy with HTTPS support if you want it to easily -be accessible from the browser extension - a basic `nginx` configuration [is available on the -repo](https://git.platypush.tech/platypush/platypush/src/branch/master/examples/nginx/nginx.sample.conf). - -## Platypush Web extension - -You can install the Platypush Web extension from the [Firefox Add-ons -site](https://addons.mozilla.org/en-US/firefox/addon/platypush/). - -After installing it, click on the extension popup and add the URL of your Platypush Web server. - - - -When successfully connected, you should see the device in the main menu, you can run commands on it and save actions. - -A good place to start familiarizing with the Platypush API is the _Run Action_ dialog, which allows you to run commands -on your server and provides autocomplete for the available actions, as well as documentation about their arguments. - - - -The default action mode is _Request_ (i.e. single requests against the API). You can also pack together more actions on -the backend [into -_procedures_](https://docs.platypush.tech/wiki/Quickstart.html#greet-me-with-lights-and-music-when-i-come-home), which -can be written either in the YAML config or as Python scripts (by default loaded from `~/.config/platypush/scripts`). -If correctly configured, procedures will be available in the _Run Action_ dialog. - -The other mode, which we'll use in this article, is _Script_. In this mode you can write custom JavaScript code that -can interact with your browser. - - - -[Here](https://gist.github.com/BlackLight/d80c571705215924abc06a80994fd5f4) is a sample script that you can use as a -reference for the API exposed by the extension. Some examples include: - -- `app.run`, to run an action on the Platypush backend - -- `app.getURL`, `app.setURL` and `app.openTab` to get and set the current URL, or open a new tab with a given URL - -- `app.axios.get`, `app.axios.post` etc. to perform HTTP requests to other external services through the Axios - library - -- `app.getDOM` and `app.setDOM` to get and set the current page DOM - -- `app.mercury.parse` to distill the current page content using the Mercury Parser API - -### Reader mode script - -We can put together the building blocks above to create our first script, which will distill the current page content -and replace the swap the current page DOM with the simplified content - with no ads, comments, or other distracting -visual elements. The full content of the script is available -[here](https://gist.manganiello.tech/fabio/c731b57ff6b24d21a8f43fbedde3dc30). - -This is akin to what Firefox' [Reader -Mode](https://support.mozilla.org/en-US/kb/firefox-reader-view-clutter-free-web-pages) does, but with much more room for -customization. - -Note that for this specific script we don't need any interactions with the Platypush backend. Everything happens on the -client, as the Mercury API is built into the Platypush Web extension. - -Switch to _Script_ mode in the _Run Action_ dialog, paste the script content and click on _Save Script_. You can also -choose a custom name, icon ([FontAwesome](https://fontawesome.com/icons) icon classes are supported), color and group -for the script. Quite importantly, you can also associate a keyboard shortcut to it, so you can quickly distill a page -without having to search for the command either in the extension popup or in the context menu. - -### Save to Wallabag script - -Now that we have a script to distill the current page content, we can create another script to save the distilled -content (if available) to Wallabag. Otherwise, it will just save the original page content. - -The full content of the script is available -[here](https://gist.manganiello.tech/fabio/8f5b08d8fbaa404bafc6fdeaf9b154b4).