Commit Graph

7 Commits

Author SHA1 Message Date
Edward Loveall fb51270f87
Fix article ID parsing bug
Since the article ID regular expression wasn't anchored to the end of
the URL, it would grab characters after a / or - that were hex
characters. For example /@user/bacon-123abc would just grab `bac`. Not
great.

This anchors the ID at the end of the string so that it will be more
likely to catch IDs.
2022-02-13 21:07:50 -05:00
Edward Loveall 1f517f9031
Link to full Medium URL on error page
Previously the link on the error page was only linking to the path
component of the url, e.g. `/search` but ignoring any query params e.g.
`/search?q=hello`. This uses the HTTP::Request `resource` method which
appears to capture both.
2022-02-13 10:13:24 -05:00
Edward Loveall 24d3ab9ab3
Better article ID parsing
A new ArticleIdParser class takes in an HTTP::Request object and parses
the article ID from it. It intentinoally fails on tag, user, and search
pages and attempts to only catch articles.
2022-02-13 10:10:46 -05:00
Edward Loveall f056a0b68a
Better error pages
Instead of showing the default Lucky error page, the styles now match
Scribe. In addition, if a URL can't be parsed, Scribe gives some
information as to why this might be (that Scribe can only deal with an
article pages)
2022-02-12 17:56:36 -05:00
Edward Loveall 7d0bc37efd
Fix markup errors caused by UTF-16/8 differences
Medium uses UTF-16 character offsets (likely to make it easier to parse
in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to
do offset calculation then back to UFT-8 fixes some markup bugs.

---

Medium calculates markup offsets using UTF-16 encoding. Some characters
like Emoji are count as multiple bytes which affects those offsets. For
example in UTF-16 💸 is worth two bytes, but Crystal strings only count
it as one. This is a problem for markup generation because it can
offset the markup and even cause out-of-range errors.

Take the following example:

💸💸!

Imagine that `!` was bold but the emoji isn't. For Crystal, this starts
at char index 2, end at char index 3. Medium's markup will say markup
goes from character 4 to 5. In a 3 character string like this, trying
to access character range 4...5 is an error because 5 is already out of
bounds.

My theory is that this is meant to be compatible with JavaScript's
string length calculations, as Medium is primarily a platform built for
the web:

```js
"a".length // 1
"💸".length // 2
"👩‍❤️‍💋‍👩".length // 11
```

To get these same numbers in Crystal strings must be converted to
UTF-16:

```crystal
"a".to_utf16.size # 1
"💸".to_utf16.size # 2
"👩‍❤️‍💋‍👩".to_utf16.size # 11
```

The MarkupConverter now converts text into UFT-16 byte arrays on
initialization. Once it's figured out the range of bytes needed for
each piece of markup, it converts it back into UTF-8 strings.
2022-01-30 11:53:22 -05:00
Edward Loveall 648a933b24
Provide a list of instances as JSON
This is for extensions or other tools that wish to have a list of
instances. It can be accessed by visiting the raw file on sourcehut:

https://git.sr.ht/~edwardloveall/scribe/blob/main/docs/instances.json
2022-01-29 12:58:08 -05:00
Edward Loveall 3a8ad82252
Add CHANGELOG 2022-01-23 15:06:01 -05:00