Synchronize updates and deletions of posts and media attachments #6

Open
opened 2026-04-19 04:24:23 +02:00 by blacklight · 0 comments
Owner

Current issue: the archive scrapers operate on a last-seen-id basis.

They pull any new content of a user since a given last-seen post ID or attachment ID.

This means that updates or deletions performed to content submitted before that timestamp will not be reprocessed, thus deleted or changed content will not be synchronized.

Unfortunately, this problem is not easy to solve because of the limitations of Mastodon's API and the nature of the archive. Both the approaches are very hard to implement and come with many trade-offs

Full reconciliation

On every round of post fetches, the archive needs to pull all the posts for a given user (since the first one) and synchronize anything that has been either updated or deleted.

Why it's unfeasible

This approach would require an enormous amount of queries on every round of polling. This means:

  1. Each round of reconciliations may take several hours (in the best-case scenario)
  2. The delays would be further compounded by the very high likelihood of hitting rate limits on the instances' APIs. This is already getting dramatic sometimes when pulling the latest posts from mastodon.social, it will make the synchronization simply unfeasible if we try and pull potentially thousands of messages for hundreds of accounts

Streaming API

Hook to the events WebSocket API (used e.g. to power notifications on the frontend)

Why it's unfeasible

  1. The streaming API requires user authentication. gaza-verified accounts are registered on dozens of instances. It is unfeasible to create "follow bots" on all of them just to listen to their notifications. Plus, some instances also have closed registration forms or nobot policies.

Potentially feasible approach (but very high implementation cost):

Mock a Mastodon server

Use e.g. Pubby to implement a minimal ActivityPub server on the archive. The "instance" can expose a single user that follows all the verified accounts. This will enable notifications to be delivered to that user when an account posts or modifies an activity. That event can be intercepted by the rest of the archive machinery and appropriately parsed and stored.

Current issue: the archive scrapers operate on a last-seen-id basis. They pull any new content of a user since a given last-seen post ID or attachment ID. This means that updates or deletions performed to content submitted before that timestamp will not be reprocessed, thus deleted or changed content will not be synchronized. Unfortunately, this problem is not easy to solve because of the limitations of Mastodon's API and the nature of the archive. Both the approaches are very hard to implement and come with many trade-offs ## Full reconciliation On every round of post fetches, the archive needs to pull _all_ the posts for a given user (since the first one) and synchronize anything that has been either updated or deleted. ### Why it's unfeasible This approach would require an enormous amount of queries on every round of polling. This means: 1. Each round of reconciliations may take several hours (in the best-case scenario) 2. The delays would be further compounded by the very high likelihood of hitting rate limits on the instances' APIs. This is already getting dramatic sometimes when pulling the latest posts from mastodon.social, it will make the synchronization simply unfeasible if we try and pull potentially thousands of messages for hundreds of accounts ## Streaming API Hook to the events WebSocket API (used e.g. to power notifications on the frontend) ### Why it's unfeasible 1. The streaming API requires user authentication. gaza-verified accounts are registered on dozens of instances. It is unfeasible to create "follow bots" on all of them just to listen to their notifications. Plus, some instances also have closed registration forms or nobot policies. _Potentially_ feasible approach (but very high implementation cost): ## Mock a Mastodon server Use e.g. [Pubby](https://git.fabiomanganiello.com/pubby) to implement a minimal ActivityPub server on the archive. The "instance" can expose a single user that follows all the verified accounts. This will enable notifications to be delivered to that user when an account posts or modifies an activity. That event can be intercepted by the rest of the archive machinery and appropriately parsed and stored.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
blacklight/gaza-archive#6
No description provided.