A new ArticleIdParser class takes in an HTTP::Request object and parses
the article ID from it. It intentinoally fails on tag, user, and search
pages and attempts to only catch articles.
Medium uses UTF-16 character offsets (likely to make it easier to parse
in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to
do offset calculation then back to UFT-8 fixes some markup bugs.
---
Medium calculates markup offsets using UTF-16 encoding. Some characters
like Emoji are count as multiple bytes which affects those offsets. For
example in UTF-16 💸 is worth two bytes, but Crystal strings only count
it as one. This is a problem for markup generation because it can
offset the markup and even cause out-of-range errors.
Take the following example:
💸💸!
Imagine that `!` was bold but the emoji isn't. For Crystal, this starts
at char index 2, end at char index 3. Medium's markup will say markup
goes from character 4 to 5. In a 3 character string like this, trying
to access character range 4...5 is an error because 5 is already out of
bounds.
My theory is that this is meant to be compatible with JavaScript's
string length calculations, as Medium is primarily a platform built for
the web:
```js
"a".length // 1
"💸".length // 2
"👩❤️💋👩".length // 11
```
To get these same numbers in Crystal strings must be converted to
UTF-16:
```crystal
"a".to_utf16.size # 1
"💸".to_utf16.size # 2
"👩❤️💋👩".to_utf16.size # 11
```
The MarkupConverter now converts text into UFT-16 byte arrays on
initialization. Once it's figured out the range of bytes needed for
each piece of markup, it converts it back into UTF-8 strings.
Previously, GitHub gists were embedded. The gist url would be detected
in a paragraph and the page would render a script like:
```html
<script src="https://gist.github.com/user/gist_id.js"></script>
```
The script would then embed the gist on the page. However, gists contain
multiple files. It's technically possible to embed a single file in the
same way by appending a `file` query param:
```html
<script
src="https://gist.github.com/user/gist_id.js?file=foo.txt"></script>
```
I wanted to try and tackle proxying gists instead.
Overview
--------
At a high level the PageConverter kicks off the work of fetching and
storing the gist content, then sends that content down to the
`ParagraphConverter`. When a paragraph comes up that contains a gist
embed, it retrieves the previously fetched content. This allows all the
necessary content to be fetched up front so the minimum number of
requests need to be made.
Fetching Gists
--------------
There is now a `GithubClient` class that gets gist content from GitHub's
ReST API. The gist API response looks something like this (non-relevant
keys removed):
```json
{
"files": {
"file-one.txt": {
"filename": "file-one.txt",
"raw_url":
"https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-o
ne.txt",
"content": "..."
},
"file-two.txt": {
"filename": "file-two.txt",
"raw_url":
"https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-t
wo.txt",
"content": "..."
}
}
}
```
That response gets turned into a bunch of `GistFile` objects that are
then stored in a request-level `GistStore`. Crystal's JSON parsing does
not make it easy to parse json with arbitrary keys into objects. This is
because each key corresponds to an object property, like `property name
: String`. If Crystal doesn't know the keys ahead of time, there's no
way to know what methods to create.
That's a problem here because the key for each gist file is the unique
filename. Fortunately, the keys for each _file_ follows the same pattern
and are easy to parse into a `GistFile` object. To turn gist file JSON
into Crystal objects, the `GithubClient` turns the whole response into a
`JSON::Any` which is like a Hash. Then it extracts just the file data
objects and parses those into `GistFile` objects.
Those `GistFile` objects are then cached in a `GistStore` that is shared
for the page, which means one gist cache per request/article. `GistFile`
objects can be fetched out of the store by file, or if no file is
specified, it returns all files in the gist.
The GistFile is rendered as a link of the file's name to the file in
the gist on GitHub, and then a code block of the contents of the file.
In summary, the `PageConverter`:
* Scans the paragraphs for GitHub gists using `GistScanner`
* Requests their data from GitHub using the `GithubClient`
* Parses the response into `GistFile`s and populates the `GistStore`
* Passes that `GistStore` to the `ParagraphConverter` to use when
constructing the page nodes
Caching
-------
GitHub limits API requests to 5000/hour with a valid api token and
60/hour without. 60 is pretty tight for the usage that scribe.rip gets,
but 5000 is reasonable most of the time. Not every article has an
embedded gist, but some articles have multiple gists. A viral article
(of which Scribe has seen two at the time of this commit) might receive
a little over 127k hits/day, which is an average of over 5300/hour. If
that article had a gist, Scribe would reach the API limit during parts
of the day with high traffic. If it had multiple gists, it would hit it
even more. However, average traffic is around 30k visits/day which would
be well under the limit, assuming average load.
To help not hit that limit, a `GistStore` holds all the `GistFile`
objects per gist. The logic in `GistScanner` is smart enough to only
return unique gist URLs so each gist is only requested once even if
multiple files from one gist exist in an article. This limits the number
of times Scribe hits the GitHub API.
If Scribe is rate-limited, instead of populating a `GistCache` the
`PageConverter` will create a `RateLimitedGistStore`. This is an object
that acts like the `GistStore` but returns `RateLimitedGistFile` objects
instead of `GistFile` objects. This allows Scribe to gracefully degrade
in the event of reaching the rate limit.
If rate-limiting becomes a regular problem, Scribe could also be
reworked to fallback to the embedded gists again.
API Credentials
---------------
API credentials are in the form of a GitHub username and a personal
access token attached to that username. To get a token, visit
https://github.com/settings/tokens and create a new token. The only
permission it needs is `gist`.
This token is set via the `GITHUB_PERSONAL_ACCESS_TOKEN` environment
variable. The username also needs to be set via `GITHUB_USERNAME`. When
developing locally, these can both be set in the .env file.
Authentication is probably not necessary locally, but it's there if you
want to test. If either token is missing, unauthenticated requests are
made.
Rendering
---------
The node tree itself holds a `GithubGist` object. It has a reference to
the `GistStore` and the original gist URL. When it renders the page
requests the gist's `files`. The gist ID and optional file are detected,
and then used to request the file(s) from the `GistStore`. Gists render
as a list of each files contents and a link to the file on GitHub.
If the requests were rate limited, the store is a
`RateLimitedGistStore` and the files are `RateLimitedGistFile`s. These
rate-limited objects rendered with a link to the gist on GitHub and text
saying that Scribe has been rate-limited.
If somehow the file requested doesn't exist in the store, it displays
similarly to the rate-limited file but with "file missing" text instead
of "rate limited" text.
GitHub API docs: https://docs.github.com/en/rest/reference/gists
Rate Limiting docs:
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-
limiting
The post id 34dead42a28 contained a new paragraph type: H2. Previously
the only known header types were H3 and H4. In this case, the paragraph
doesn't actually get rendered because it's the page title which is
removed from the page nodes (see commits 6baba803 and then fba87c10).
However, it somehow an author is able to get an H2 paragraph into the
page, it will display as an <h1> just as H3 displays as <h2> and H4
displays as <h3>.
The subtitle has been removed because it's difficult to find and error
prone to guess at. It is somewhat accessible from the post's
previewContent field in GraphQL but that can be truncated.
Right now this links to the user's medium page. It may link to an
internal page in the future.
Instead of the Page taking the author as a string, it now takes a
PostResponse::Creator object. The Articles::ShowPage then converts the
Creator (a name and user_id) to an author link.
Finally, I did some refactoring of UserAnchor (which I thought I was
going to use for this) to change it's userId attribute to user_id as is
Crystal convention.
PostResponse::Paragraph's that are of type IFRAME have extra data in the
iframe attribute to specify what's in the iframe. Not all data is the
same, however. I've identified three types and am using the new
EmbeddedConverter class to convert them:
* EmbeddedContent, the full iframe experience
* GithubGist, because medium or github treat embeds differently for
whatever reason
* EmbeddedLink, the old style, just a link to the content. Effectively
a fallback
The size of the original iframe is also specified as an attribute. This
code resizes it. The resizing is determined by figuring out the
width/height ratio and setting the width to 800.
EmbeddedContent can be displayed if we have an embed.ly url, which most
iframe response data has. GitHub gists are a notable exception. Gists
instead can be embedded simply by taking the gist URL and attaching .js
to the end. That becomes the iframe's src attribute.
The PostResponse::Paragraph's iframe attribute is nillable. Previous
code used lots of if-statements with variable bindings to work with the
possible nil values:
```crystal
if foo = obj.nillable_value
# obj.nillable_value was not nil and foo contains the value
else
# obj.nillable_value was nil so do something else
end
```
See https://crystal-lang.org/reference/syntax_and_semantics/if_var.html
for more info
In the EmbeddedConverter the monads library has been introduced to get
rid of at least one level of nillability. This wraps values in Maybe
which allows for a cleaner interface:
```crystal
Monads::Try(Value).new(->{ obj.nillable_value })
.to_maybe
.fmap(->(value: Value) { # do something with value })
.value_or(# value was nil, do something else)
```
This worked to get the iframe attribute from a Paragraph:
```crystal
Monads::Try(PostResponse::IFrame).new(->{ paragraph.iframe })
.to_maybe
.fmap(->(iframe : PostResponse::IFrame) { # iframe is not nil! })
.fmap(#and so on)
.value_or(Empty.new)
```
iframe only has one attribute: mediaResource which contains the iframe
data. That was used to determine one of the three types above.
Finally, Tufte.css has options for iframes. They mostly look good except
for tweets which are too small and weirdly in the center of the page
which actually looks off-center. That's for another day though.
Medium guides each post to have a Title and Subtitle. They are rendered
as the first two paragraphs: H3 and H4 respectively. If they exist, a
new PageConverter class extracts them and sets them on the page.
However, they aren't required. If the first two paragraphs aren't H3
and H4, the PageConverter falls back to using the first paragraph as
the title, and setting the subtitle to blank.
The remaining paragraphs are passed into the ParagraphConverter as
normal.
General CSS hygiene dictates that you shouldn't go beyond an H3 tag. H1
for the document title, H2 for section headings, and H3 for low-level
headings.
Example:
* Text: "strong and emphasized only"
* Markups:
* Strong: 0..10
* Emphasis: 7..21
First, get all the borders of the markups, including the start (0) and
end (text.size) indexes of the text in order:
```
[0, 7, 10, 21, 26]
```
Then attach markups to each range. Note that the ranges are exclusive;
they don't include the final number:
* 0...7: Strong
* 7...10: Strong, Emphasized
* 10...21: Emphasized
* 21...26: N/A
Bundle each range and it's related markups into a value object
RangeWithMarkup and return the list.
Loop through that list and recursively apply each markup to each
segment of text:
* Apply a `Strong` markup to the text "strong "
* Apply a `Strong` markup to the text "and"
* Wrap that in an `Emphasis` markup
* Apply an `Emphasis` markup to the text " emphasized"
* Leave the text " only" as is
---
This has the side effect of breaking up the nodes more than they need
to be broken up. For example right now the algorithm creates this HTML:
```
<strong>strong </strong><em><strong>and</strong></em>
```
instead of:
```
<strong>strong <em>and</em></strong>
```
But that's a task for another day.
The impetus for this change was to help make the MarkupConverter code
more robust. However, it's also possible that an Anchor can contain
styled text. For example, in markdown someone might write a link that
contains some <strong> text:
```markdown
[this link is so **good**](https://example.com)
```
This setup will now allow that. Unknown if UserAnchor can ever contain
any text that isn't just the user's name, but it's easy to deal with
and makes the typing much easier.
Instead of getting the full size image, the image can be fetched with a
width and height parameter so that only the resized data is
transferred. The url looks like this:
https://cdn-images-1.medium.com/fit/c/<width>/<height>/<media-id>
I picked a max image width of 800px. If the image width is more than
that, it scales the width down to 800, then applies that ratio to the
height. If it's smaller than that, the image is displayed as the
original.
Most images have a caption (or at least have the option of being
captioned). Instead of displaying the raw image, it's not rendered
inside a <figure> tag with a <figcaption> (possibly blank) as a
sibling. The <figcaption> can be marked up with links.
The API responds with a bunch of paragraphs which the client converts
into Paragraph objects.
This turns the paragraphs in a PostResponse's Paragraph objects into the
form needed to render them on a page. This includes converting flat list
elements into list elements nested by a UL. And adding a limited markups
along the way.
The array of paragraphs is passed to a recursive function. The function
takes the first paragraph and either wraps the (marked up) contents in a
container tag (like Paragraph or Heading3), and then moves onto the next
tag. If it finds a list, it starts parsing the next paragraphs as a list
instead.
Originally, this was implemented like so:
```crystal
paragraph = paragraphs.shift
if list?
convert_list([paragraph] + paragraphs)
end
```
However, passing the `paragraphs` after adding it to the already shifted
`paragraph` creates a new object. This means `paragraphs` won't be
mutated and once the list is parsed, it starts with the next element of
the list. Instead, the element is `shift`ed inside each converter.
```crystal
if paragraphs.first == list?
convert_list(paragraphs)
end
def convert_list(paragraphs)
paragraph = paragraphs.shift
# ...
end
```
When rendering, there is an Empty and Container object. These represent
a kind of "null object" for both leafs and parent objects respectively.
They should never actually render. Emptys are filtered out, and
Containers are never created explicitly but this will make the types
pass.
IFrames are a bit of a special case. Each IFrame has custom data on it
that this system would need to be aware of. For now, instead of trying
to parse the seemingly large number of iframe variations and dealing
with embedded iframe problems, this will just keep track of the source
page URL and send the user there with a link.