Since the article ID regular expression wasn't anchored to the end of
the URL, it would grab characters after a / or - that were hex
characters. For example /@user/bacon-123abc would just grab `bac`. Not
great.
This anchors the ID at the end of the string so that it will be more
likely to catch IDs.
Previously the link on the error page was only linking to the path
component of the url, e.g. `/search` but ignoring any query params e.g.
`/search?q=hello`. This uses the HTTP::Request `resource` method which
appears to capture both.
A new ArticleIdParser class takes in an HTTP::Request object and parses
the article ID from it. It intentinoally fails on tag, user, and search
pages and attempts to only catch articles.
Instead of showing the default Lucky error page, the styles now match
Scribe. In addition, if a URL can't be parsed, Scribe gives some
information as to why this might be (that Scribe can only deal with an
article pages)
Medium uses UTF-16 character offsets (likely to make it easier to parse
in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to
do offset calculation then back to UFT-8 fixes some markup bugs.
---
Medium calculates markup offsets using UTF-16 encoding. Some characters
like Emoji are count as multiple bytes which affects those offsets. For
example in UTF-16 💸 is worth two bytes, but Crystal strings only count
it as one. This is a problem for markup generation because it can
offset the markup and even cause out-of-range errors.
Take the following example:
💸💸!
Imagine that `!` was bold but the emoji isn't. For Crystal, this starts
at char index 2, end at char index 3. Medium's markup will say markup
goes from character 4 to 5. In a 3 character string like this, trying
to access character range 4...5 is an error because 5 is already out of
bounds.
My theory is that this is meant to be compatible with JavaScript's
string length calculations, as Medium is primarily a platform built for
the web:
```js
"a".length // 1
"💸".length // 2
"👩❤️💋👩".length // 11
```
To get these same numbers in Crystal strings must be converted to
UTF-16:
```crystal
"a".to_utf16.size # 1
"💸".to_utf16.size # 2
"👩❤️💋👩".to_utf16.size # 11
```
The MarkupConverter now converts text into UFT-16 byte arrays on
initialization. Once it's figured out the range of bytes needed for
each piece of markup, it converts it back into UTF-8 strings.