scribe/src/models/nodes.cr
Edward Loveall 7518a035b1
Proxy GitHub gists with rate limiting
Previously, GitHub gists were embedded. The gist url would be detected
in a paragraph and the page would render a script like:

```html
<script src="https://gist.github.com/user/gist_id.js"></script>
```

The script would then embed the gist on the page. However, gists contain
multiple files. It's technically possible to embed a single file in the
same way by appending a `file` query param:

```html
<script
src="https://gist.github.com/user/gist_id.js?file=foo.txt"></script>
```

I wanted to try and tackle proxying gists instead.

Overview
--------

At a high level the PageConverter kicks off the work of fetching and
storing the gist content, then sends that content down to the
`ParagraphConverter`. When a paragraph comes up that contains a gist
embed, it retrieves the previously fetched content. This allows all the
necessary content to be fetched up front so the minimum number of
requests need to be made.

Fetching Gists
--------------

There is now a `GithubClient` class that gets gist content from GitHub's
ReST API. The gist API response looks something like this (non-relevant
keys removed):

```json
{
  "files": {
    "file-one.txt": {
      "filename": "file-one.txt",
      "raw_url":
"https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-o
ne.txt",
      "content": "..."
    },
    "file-two.txt": {
      "filename": "file-two.txt",
      "raw_url":
"https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-t
wo.txt",
      "content": "..."
    }
  }
}
```

That response gets turned into a bunch of `GistFile` objects that are
then stored in a request-level `GistStore`. Crystal's JSON parsing does
not make it easy to parse json with arbitrary keys into objects. This is
because each key corresponds to an object property, like `property name
: String`. If Crystal doesn't know the keys ahead of time, there's no
way to know what methods to create.

That's a problem here because the key for each gist file is the unique
filename. Fortunately, the keys for each _file_ follows the same pattern
and are easy to parse into a `GistFile` object. To turn gist file JSON
into Crystal objects, the `GithubClient` turns the whole response into a
`JSON::Any` which is like a Hash. Then it extracts just the file data
objects and parses those into `GistFile` objects.

Those `GistFile` objects are then cached in a `GistStore` that is shared
for the page, which means one gist cache per request/article. `GistFile`
objects can be fetched out of the store by file, or if no file is
specified, it returns all files in the gist.

The GistFile is rendered as a link of the file's name to the file in
the gist on GitHub, and then a code block of the contents of the file.

In summary, the `PageConverter`:

* Scans the paragraphs for GitHub gists using `GistScanner`
* Requests their data from GitHub using the `GithubClient`
* Parses the response into `GistFile`s and populates the `GistStore`
* Passes that `GistStore` to the `ParagraphConverter` to use when
  constructing the page nodes

Caching
-------

GitHub limits API requests to 5000/hour with a valid api token and
60/hour without. 60 is pretty tight for the usage that scribe.rip gets,
but 5000 is reasonable most of the time. Not every article has an
embedded gist, but some articles have multiple gists. A viral article
(of which Scribe has seen two at the time of this commit) might receive
a little over 127k hits/day, which is an average of over 5300/hour. If
that article had a gist, Scribe would reach the API limit during parts
of the day with high traffic. If it had multiple gists, it would hit it
even more. However, average traffic is around 30k visits/day which would
be well under the limit, assuming average load.

To help not hit that limit, a `GistStore` holds all the `GistFile`
objects per gist. The logic in `GistScanner` is smart enough to only
return unique gist URLs so each gist is only requested once even if
multiple files from one gist exist in an article. This limits the number
of times Scribe hits the GitHub API.

If Scribe is rate-limited, instead of populating a `GistCache` the
`PageConverter` will create a `RateLimitedGistStore`. This is an object
that acts like the `GistStore` but returns `RateLimitedGistFile` objects
instead of `GistFile` objects. This allows Scribe to gracefully degrade
in the event of reaching the rate limit.

If rate-limiting becomes a regular problem, Scribe could also be
reworked to fallback to the embedded gists again.

API Credentials
---------------

API credentials are in the form of a GitHub username and a personal
access token attached to that username. To get a token, visit
https://github.com/settings/tokens and create a new token. The only
permission it needs is `gist`.

This token is set via the `GITHUB_PERSONAL_ACCESS_TOKEN` environment
variable. The username also needs to be set via `GITHUB_USERNAME`. When
developing locally, these can both be set in the .env file.
Authentication is probably not necessary locally, but it's there if you
want to test. If either token is missing, unauthenticated requests are
made.

Rendering
---------

The node tree itself holds a `GithubGist` object. It has a reference to
the `GistStore` and the original gist URL. When it renders the page
requests the gist's `files`. The gist ID and optional file are detected,
and then used to request the file(s) from the `GistStore`. Gists render
as a list of each files contents and a link to the file on GitHub.

If the requests were rate limited, the store is a
`RateLimitedGistStore` and the files are `RateLimitedGistFile`s. These
rate-limited objects rendered with a link to the gist on GitHub and text
saying that Scribe has been rate-limited.

If somehow the file requested doesn't exist in the store, it displays
similarly to the rate-limited file but with "file missing" text instead
of "rate limited" text.

GitHub API docs: https://docs.github.com/en/rest/reference/gists
Rate Limiting docs:
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-
limiting
2022-01-23 15:05:46 -05:00

242 lines
4 KiB
Crystal

module Nodes
alias Embedded = EmbeddedLink | EmbeddedContent | GithubGist
alias Leaf = Text | Image | Embedded
alias Child = Container | Leaf | Empty
alias Children = Array(Child)
class Container
getter children : Children
def initialize(@children : Children)
end
def ==(other : Container)
other.children == children
end
def empty?
children.empty? || children.each(&.empty?)
end
end
class Empty
def empty?
true
end
end
class BlockQuote < Container
end
class Code < Container
end
class Emphasis < Container
end
class Figure < Container
end
class FigureCaption < Container
end
class Heading1 < Container
end
class Heading2 < Container
end
class Heading3 < Container
end
class ListItem < Container
end
class MixtapeEmbed < Container
end
class OrderedList < Container
end
class Paragraph < Container
end
class Preformatted < Container
end
class Strong < Container
end
class UnorderedList < Container
end
class Text
getter content : String
def initialize(@content : String)
end
def ==(other : Text)
other.content == content
end
def empty?
content.empty?
end
end
class Image
IMAGE_HOST = "https://cdn-images-1.medium.com/fit/c"
MAX_WIDTH = 800
FALLBACK_HEIGHT = 600
getter originalHeight : Int32
getter originalWidth : Int32
def initialize(
@src : String,
originalWidth : Int32?,
originalHeight : Int32?
)
@originalWidth = originalWidth || MAX_WIDTH
@originalHeight = originalHeight || FALLBACK_HEIGHT
end
def ==(other : Image)
other.src == src
end
def src
[IMAGE_HOST, width, height, @src].join("/")
end
def width
[originalWidth, MAX_WIDTH].min.to_s
end
def height
if originalWidth > MAX_WIDTH
(originalHeight * ratio).round.to_i.to_s
else
originalHeight.to_s
end
end
private def ratio
MAX_WIDTH / originalWidth
end
def empty?
false
end
end
class EmbeddedContent
MAX_WIDTH = 800
getter src : String
def initialize(@src : String, @originalWidth : Int32, @originalHeight : Int32)
end
def width
[@originalWidth, MAX_WIDTH].min.to_s
end
def height
if @originalWidth > MAX_WIDTH
(@originalHeight * ratio).round.to_i.to_s
else
@originalHeight.to_s
end
end
private def ratio
MAX_WIDTH / @originalWidth
end
def ==(other : EmbeddedContent)
other.src == src && other.width == width && other.height == height
end
def empty?
false
end
end
class EmbeddedLink
getter href : String
def initialize(@href : String)
end
def domain
URI.parse(href).host
end
def ==(other : EmbeddedLink)
other.href == href
end
def empty?
false
end
end
class Anchor < Container
getter href : String
def initialize(@children : Children, @href : String)
end
def ==(other : Anchor)
other.children == children && other.href == href
end
def empty?
false
end
end
class UserAnchor < Container
USER_BASE_URL = "https://medium.com/u/"
getter href : String
def initialize(@children : Children, user_id : String)
@href = USER_BASE_URL + user_id
end
def ==(other : UserAnchor)
other.children == children && other.href == href
end
def empty?
false
end
end
class GithubGist
getter gist_store : GistStore | RateLimitedGistStore
def initialize(@href : String, @gist_store : GistStore | RateLimitedGistStore)
end
def files : Array(GistFile) | Array(MissingGistFile) | Array(RateLimitedGistFile)
gist_store.get_gist_files(params.id, params.filename)
end
private def params
GistParams.extract_from_url(@href)
end
def ==(other : GithubGist)
other.gist_store == gist_store
end
def empty?
false
end
end
end