Commit graph

83 commits

Author SHA1 Message Date
Edward Loveall
1f517f9031
Link to full Medium URL on error page
Previously the link on the error page was only linking to the path
component of the url, e.g. `/search` but ignoring any query params e.g.
`/search?q=hello`. This uses the HTTP::Request `resource` method which
appears to capture both.
2022-02-13 10:13:24 -05:00
Edward Loveall
24d3ab9ab3
Better article ID parsing
A new ArticleIdParser class takes in an HTTP::Request object and parses
the article ID from it. It intentinoally fails on tag, user, and search
pages and attempts to only catch articles.
2022-02-13 10:10:46 -05:00
Edward Loveall
f056a0b68a
Better error pages
Instead of showing the default Lucky error page, the styles now match
Scribe. In addition, if a URL can't be parsed, Scribe gives some
information as to why this might be (that Scribe can only deal with an
article pages)
2022-02-12 17:56:36 -05:00
Edward Loveall
7d0bc37efd
Fix markup errors caused by UTF-16/8 differences
Medium uses UTF-16 character offsets (likely to make it easier to parse
in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to
do offset calculation then back to UFT-8 fixes some markup bugs.

---

Medium calculates markup offsets using UTF-16 encoding. Some characters
like Emoji are count as multiple bytes which affects those offsets. For
example in UTF-16 💸 is worth two bytes, but Crystal strings only count
it as one. This is a problem for markup generation because it can
offset the markup and even cause out-of-range errors.

Take the following example:

💸💸!

Imagine that `!` was bold but the emoji isn't. For Crystal, this starts
at char index 2, end at char index 3. Medium's markup will say markup
goes from character 4 to 5. In a 3 character string like this, trying
to access character range 4...5 is an error because 5 is already out of
bounds.

My theory is that this is meant to be compatible with JavaScript's
string length calculations, as Medium is primarily a platform built for
the web:

```js
"a".length // 1
"💸".length // 2
"👩‍❤️‍💋‍👩".length // 11
```

To get these same numbers in Crystal strings must be converted to
UTF-16:

```crystal
"a".to_utf16.size # 1
"💸".to_utf16.size # 2
"👩‍❤️‍💋‍👩".to_utf16.size # 11
```

The MarkupConverter now converts text into UFT-16 byte arrays on
initialization. Once it's figured out the range of bytes needed for
each piece of markup, it converts it back into UTF-8 strings.
2022-01-30 11:53:22 -05:00
Edward Loveall
648a933b24
Provide a list of instances as JSON
This is for extensions or other tools that wish to have a list of
instances. It can be accessed by visiting the raw file on sourcehut:

https://git.sr.ht/~edwardloveall/scribe/blob/main/docs/instances.json
2022-01-29 12:58:08 -05:00
Edward Loveall
08f38a4d25
Add GitHub Gist authentication instructions 2022-01-23 16:08:23 -05:00
Edward Loveall
3a8ad82252
Add CHANGELOG 2022-01-23 15:06:01 -05:00
Edward Loveall
7518a035b1
Proxy GitHub gists with rate limiting
Previously, GitHub gists were embedded. The gist url would be detected
in a paragraph and the page would render a script like:

```html
<script src="https://gist.github.com/user/gist_id.js"></script>
```

The script would then embed the gist on the page. However, gists contain
multiple files. It's technically possible to embed a single file in the
same way by appending a `file` query param:

```html
<script
src="https://gist.github.com/user/gist_id.js?file=foo.txt"></script>
```

I wanted to try and tackle proxying gists instead.

Overview
--------

At a high level the PageConverter kicks off the work of fetching and
storing the gist content, then sends that content down to the
`ParagraphConverter`. When a paragraph comes up that contains a gist
embed, it retrieves the previously fetched content. This allows all the
necessary content to be fetched up front so the minimum number of
requests need to be made.

Fetching Gists
--------------

There is now a `GithubClient` class that gets gist content from GitHub's
ReST API. The gist API response looks something like this (non-relevant
keys removed):

```json
{
  "files": {
    "file-one.txt": {
      "filename": "file-one.txt",
      "raw_url":
"https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-o
ne.txt",
      "content": "..."
    },
    "file-two.txt": {
      "filename": "file-two.txt",
      "raw_url":
"https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-t
wo.txt",
      "content": "..."
    }
  }
}
```

That response gets turned into a bunch of `GistFile` objects that are
then stored in a request-level `GistStore`. Crystal's JSON parsing does
not make it easy to parse json with arbitrary keys into objects. This is
because each key corresponds to an object property, like `property name
: String`. If Crystal doesn't know the keys ahead of time, there's no
way to know what methods to create.

That's a problem here because the key for each gist file is the unique
filename. Fortunately, the keys for each _file_ follows the same pattern
and are easy to parse into a `GistFile` object. To turn gist file JSON
into Crystal objects, the `GithubClient` turns the whole response into a
`JSON::Any` which is like a Hash. Then it extracts just the file data
objects and parses those into `GistFile` objects.

Those `GistFile` objects are then cached in a `GistStore` that is shared
for the page, which means one gist cache per request/article. `GistFile`
objects can be fetched out of the store by file, or if no file is
specified, it returns all files in the gist.

The GistFile is rendered as a link of the file's name to the file in
the gist on GitHub, and then a code block of the contents of the file.

In summary, the `PageConverter`:

* Scans the paragraphs for GitHub gists using `GistScanner`
* Requests their data from GitHub using the `GithubClient`
* Parses the response into `GistFile`s and populates the `GistStore`
* Passes that `GistStore` to the `ParagraphConverter` to use when
  constructing the page nodes

Caching
-------

GitHub limits API requests to 5000/hour with a valid api token and
60/hour without. 60 is pretty tight for the usage that scribe.rip gets,
but 5000 is reasonable most of the time. Not every article has an
embedded gist, but some articles have multiple gists. A viral article
(of which Scribe has seen two at the time of this commit) might receive
a little over 127k hits/day, which is an average of over 5300/hour. If
that article had a gist, Scribe would reach the API limit during parts
of the day with high traffic. If it had multiple gists, it would hit it
even more. However, average traffic is around 30k visits/day which would
be well under the limit, assuming average load.

To help not hit that limit, a `GistStore` holds all the `GistFile`
objects per gist. The logic in `GistScanner` is smart enough to only
return unique gist URLs so each gist is only requested once even if
multiple files from one gist exist in an article. This limits the number
of times Scribe hits the GitHub API.

If Scribe is rate-limited, instead of populating a `GistCache` the
`PageConverter` will create a `RateLimitedGistStore`. This is an object
that acts like the `GistStore` but returns `RateLimitedGistFile` objects
instead of `GistFile` objects. This allows Scribe to gracefully degrade
in the event of reaching the rate limit.

If rate-limiting becomes a regular problem, Scribe could also be
reworked to fallback to the embedded gists again.

API Credentials
---------------

API credentials are in the form of a GitHub username and a personal
access token attached to that username. To get a token, visit
https://github.com/settings/tokens and create a new token. The only
permission it needs is `gist`.

This token is set via the `GITHUB_PERSONAL_ACCESS_TOKEN` environment
variable. The username also needs to be set via `GITHUB_USERNAME`. When
developing locally, these can both be set in the .env file.
Authentication is probably not necessary locally, but it's there if you
want to test. If either token is missing, unauthenticated requests are
made.

Rendering
---------

The node tree itself holds a `GithubGist` object. It has a reference to
the `GistStore` and the original gist URL. When it renders the page
requests the gist's `files`. The gist ID and optional file are detected,
and then used to request the file(s) from the `GistStore`. Gists render
as a list of each files contents and a link to the file on GitHub.

If the requests were rate limited, the store is a
`RateLimitedGistStore` and the files are `RateLimitedGistFile`s. These
rate-limited objects rendered with a link to the gist on GitHub and text
saying that Scribe has been rate-limited.

If somehow the file requested doesn't exist in the store, it displays
similarly to the rate-limited file but with "file missing" text instead
of "rate limited" text.

GitHub API docs: https://docs.github.com/en/rest/reference/gists
Rate Limiting docs:
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-
limiting
2022-01-23 15:05:46 -05:00
Edward Loveall
8737ca7897
Add scribe.bus-hit.me instance 2022-01-16 22:05:31 -05:00
Edward Loveall
27234bd32a
Ensure that scr/version is up-to-date when building
This is an experiment to see if it forces me to actually have updated
the version before I build. The idea is that I need to actually commit
the version which will make it more likely that all instances can pull
down the code and display the correct version if I've done it myself.

It uses `git show` to grab the committed contents of src/version then
checks to see if it matches today's date.
2022-01-15 16:31:02 -05:00
Edward Loveall
c775072b3d
Add instructions for Lucky config variables
The most common is "How do I set my custom domain" (answer: APP_DOMAIN)
but this also requires setting LUCKY_ENV=production which requires
SECRET_KEY_BASE, DATABASE_URL, and PORT
2022-01-15 16:29:46 -05:00
Edward Loveall
46d87930b8
Use FAQ entry to explain custom domains 2022-01-08 20:15:46 -05:00
Edward Loveall
037bc7cd0f
Add visible version
This is to be able to track which instances (including the main one)
have which fixes
2022-01-04 21:26:53 -05:00
Edward Loveall
f7e82ffd03
Home page instructions for custom domains 2022-01-04 21:16:56 -05:00
Edward Loveall
6ea0586423
Improve Redirector extension instructions
This specifies advanced options for configuring the Redirector
extension. If everything is let on (like images) things will break
(like images). It also improves the regular expression a bit to account
for the image CDN

Co-authored-by: Austin Huang <im@austinhuang.me>
2022-01-04 20:58:30 -05:00
miklobit
d8d4913913
update crystal version in Dockerfile 2021-12-15 21:29:41 -05:00
Edward Loveall
1449acc500
Upgrade Crystal to 1.2.1 and Lucky to 0.29.0 2021-12-12 12:01:55 -05:00
Edward Loveall
e365ee8be5
Add FAQ on how to use Scribe with custom domains
This is generic so as to not call out any specific website.
2021-12-04 14:05:39 -05:00
Edward Loveall
9f2b2a6096
Add citizen4.eu instance 2021-11-20 11:00:34 -05:00
Edward Loveall
66acd562ae
Update readme 2021-11-20 11:00:34 -05:00
Edward Loveall
25464acabe
Add instance docs 2021-11-11 11:33:22 -05:00
Edward Loveall
3f56fac408
Add project goals to README 2021-11-07 12:22:11 -05:00
Edward Loveall
027e59645d
Support null image widths and heights 2021-11-06 13:22:03 -04:00
Edward Loveall
4b354c659f
Add FAQ 2021-10-23 15:34:13 -04:00
Edward Loveall
5df9c44a5c
Support null text on paragraphs
I think this was an old feature on medium, but you can see examples of
null text on this post:

https://medium.com/message/the-joy-of-typing-fd8d091ab8ef
2021-10-20 20:40:43 -04:00
Edward Loveall
7166b7d834
Add SECTION_CAPTION paragraph type
This doesn't seem to be rendered on medium.com. Here's a post that has
one, but the text is nowhere on the page:
https://medium.com/message/the-joy-of-typing-fd8d091ab8ef

This help articles hints that it might have been a feature at one point
that they don't allow anymore:
https://medium.com/@Medium/images-652ee60abea6
2021-10-20 20:39:58 -04:00
Edward Loveall
513d590ce3
Point source link at sr.ht project page
Instead of the git page. That way it's easier to find the mailing lists
and whatnot.
2021-10-16 16:23:15 -04:00
Edward Loveall
f7ad92f4bf
Parsing Fix: Add H2 Paragraph type
The post id 34dead42a28 contained a new paragraph type: H2. Previously
the only known header types were H3 and H4. In this case, the paragraph
doesn't actually get rendered because it's the page title which is
removed from the page nodes (see commits 6baba803 and then fba87c10).
However, it somehow an author is able to get an H2 paragraph into the
page, it will display as an <h1> just as H3 displays as <h2> and H4
displays as <h3>.
2021-10-16 16:23:15 -04:00
tnp
eaf25ef23a
Add Dockerfile 2021-10-16 10:56:15 -04:00
Martin Puppe
0c018a898a
Add support for development with Nix
This patch adds support for development with the Nix package manager. In
order to support the traditional nix-shell tool as well as the (still
experimental) Nix Flakes feature of the upcoming version of Nix, this
patch adds shell.nix *and* flake.nix/flake.lock.  Usage instructions
have been added to the README.
2021-10-15 08:56:15 -04:00
Martin Puppe
56b6d546db
Further improve proposed pattern for Redirector
This patch further improves the proposed pattern for the Redirector
extension. In contrast to the old pattern, …

* … it will redirect the URL https://medium.com.
* … it will *not* redirect URLs with top-level domains like mediumXcom.
  (This point is purely theoretical, but it makes the regular expression
  more correct and consistent.)
* … it will *not* redirect URLs like https://link.medium.com/AXEtCilplkb
  which Scribe currently cannot handle. These are shortened URLs that
  users get when they use the Twitter button on Medium to share a post.

In order to implement the last point (not matching link.medium.com), the
pattern uses negative lookbehind. This feature of regular expressions is
supported by all recent browsers for which Redirector is available
(Firefox, Chrome, Edge, Opera)[^1], including the current version of
Firefox ESR (Extended Stability Release).

[^1]: https://caniuse.com/js-regexp-lookbehind
2021-10-15 08:55:26 -04:00
Edward Loveall
472b0092c8
Add mailing list for patches to README 2021-10-14 21:15:19 -04:00
Amolith
9fcf37f416
Use app_domain in Redirector example
In the current redirector example, "scribe.rip" is hardcoded as the
destination. This patch simply changes that to use the app_domain
environment variable, so people wanting to use a community instance
aren't mistakenly redirected to the main scribe.rip instance.
2021-10-14 18:10:46 -04:00
Martin Puppe
0d9170b8d6
Improve proposed pattern for Redirector extension
The old pattern matches all host names that end with medium.com. The new
pattern matches only medium.com and its sub-domains. For example, the
old pattern would have matched
https://foomedium.com/@user/post-123456abcdef.
2021-10-13 21:51:55 -04:00
Edward Loveall
e127a67c6b
Ensure gists display well at all device widths 2021-10-11 20:09:58 -04:00
Edward Loveall
91f4aae0bc
Add an example and tagline to homepage 2021-10-11 20:03:31 -04:00
Edward Loveall
bb94fb41b1
Support medium's redirectUrl query param
When a post has a gi= query param, Medium makes a global_identifier
"query". This redirects via a 307 temporary redirect to a url that
looks like this:

https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Fexample.c
om%2Fmy-post-000000000000

Previously, scribe looked for the Medium post id in the url's path, not
it's query params since query params can include other garbage like
medium_utm (not related to medium.com). Now it looks first for the post
id in the path, then looks to the redirectUrl as a fallback.
2021-10-11 12:04:17 -04:00
Edward Loveall
91687bb689
Add automatic redirect instructions to homepage 2021-10-10 15:05:56 -04:00
Edward Loveall
dbddfc9cb4
Update README 2021-10-10 14:52:37 -04:00
Edward Loveall
0998a87622
Remove postgress stuff from script/setup
This app doesn't use a database so there's no point.
2021-10-10 14:52:14 -04:00
Edward Loveall
fba87c1076
Improve title parsing
The subtitle has been removed because it's difficult to find and error
prone to guess at. It is somewhat accessible from the post's
previewContent field in GraphQL but that can be truncated.
2021-10-03 18:14:46 -04:00
Edward Loveall
2808505b4e
Add instructions on how to view a post 2021-10-03 17:21:17 -04:00
Edward Loveall
aacef34a14
Accept all known medium post path types
Including:

* https://example.com/my-cool-post-123456abcdef
* https://example.com/123456abcdef
* https://medium.com/@user/my-cool-post-123456abcdef
* https://medium.com/user/my-cool-post-123456abcdef
* https://medium.com/p/my-cool-post-123456abcdef
* https://medium.com/posts/my-cool-post-123456abcdef
* https://medium.com/p/123456abcdef

Replace any of those posts with the scribe domain and it should resolve
2021-10-03 16:45:20 -04:00
Edward Loveall
0f6a2a3e1e
Fix GitHub Gist width 2021-09-25 13:26:24 -04:00
Edward Loveall
bd56bfdd9f
Embed widths are now the same width as all content 2021-09-25 13:26:10 -04:00
Edward Loveall
561483cf9f
Link to the author's page
Right now this links to the user's medium page. It may link to an
internal page in the future.

Instead of the Page taking the author as a string, it now takes a
PostResponse::Creator object. The Articles::ShowPage then converts the
Creator (a name and user_id) to an author link.

Finally, I did some refactoring of UserAnchor (which I thought I was
going to use for this) to change it's userId attribute to user_id as is
Crystal convention.
2021-09-15 16:03:36 -04:00
Edward Loveall
1c20c81d06
Fix Blockquotes
In tufte.css blockquotes should contain a <p> that holds the content
and an optional <footer> for the source of the quote. Otherwise the
block quote text is unbounded and is way too wide. This wraps the
content in a paragraph
2021-09-15 15:25:34 -04:00
Edward Loveall
a6cafaa1fc
Render embedded content
PostResponse::Paragraph's that are of type IFRAME have extra data in the
iframe attribute to specify what's in the iframe. Not all data is the
same, however. I've identified three types and am using the new
EmbeddedConverter class to convert them:

* EmbeddedContent, the full iframe experience
* GithubGist, because medium or github treat embeds differently for
  whatever reason
* EmbeddedLink, the old style, just a link to the content. Effectively
  a fallback

The size of the original iframe is also specified as an attribute. This
code resizes it. The resizing is determined by figuring out the
width/height ratio and setting the width to 800.

EmbeddedContent can be displayed if we have an embed.ly url, which most
iframe response data has. GitHub gists are a notable exception. Gists
instead can be embedded simply by taking the gist URL and attaching .js
to the end. That becomes the iframe's src attribute.

The PostResponse::Paragraph's iframe attribute is nillable. Previous
code used lots of if-statements with variable bindings to work with the
possible nil values:

```crystal
if foo = obj.nillable_value
  # obj.nillable_value was not nil and foo contains the value
else
  # obj.nillable_value was nil so do something else
end
```

See https://crystal-lang.org/reference/syntax_and_semantics/if_var.html
for more info

In the EmbeddedConverter the monads library has been introduced to get
rid of at least one level of nillability. This wraps values in Maybe
which allows for a cleaner interface:

```crystal
Monads::Try(Value).new(->{ obj.nillable_value })
  .to_maybe
  .fmap(->(value: Value) { # do something with value })
  .value_or(# value was nil, do something else)
```

This worked to get the iframe attribute from a Paragraph:

```crystal
Monads::Try(PostResponse::IFrame).new(->{ paragraph.iframe })
  .to_maybe
  .fmap(->(iframe : PostResponse::IFrame) { # iframe is not nil! })
  .fmap(#and so on)
  .value_or(Empty.new)
```

iframe only has one attribute: mediaResource which contains the iframe
data. That was used to determine one of the three types above.

Finally, Tufte.css has options for iframes. They mostly look good except
for tweets which are too small and weirdly in the center of the page
which actually looks off-center. That's for another day though.
2021-09-15 15:18:08 -04:00
Edward Loveall
903f3f4b38
Add License 2021-09-12 17:34:48 -04:00
Edward Loveall
7851434952
Add script to build object file (.o) for Ubuntu
This ubuntu_server.o file then needs to be copied to the server and
linked.
2021-09-07 22:00:20 -04:00