scribe

Author	SHA1	Message	Date
Edward Loveall	fb51270f87	Fix article ID parsing bug Since the article ID regular expression wasn't anchored to the end of the URL, it would grab characters after a / or - that were hex characters. For example /@user/bacon-123abc would just grab `bac`. Not great. This anchors the ID at the end of the string so that it will be more likely to catch IDs.	2022-02-13 21:07:50 -05:00
Edward Loveall	1f517f9031	Link to full Medium URL on error page Previously the link on the error page was only linking to the path component of the url, e.g. `/search` but ignoring any query params e.g. `/search?q=hello`. This uses the HTTP::Request `resource` method which appears to capture both.	2022-02-13 10:13:24 -05:00
Edward Loveall	24d3ab9ab3	Better article ID parsing A new ArticleIdParser class takes in an HTTP::Request object and parses the article ID from it. It intentinoally fails on tag, user, and search pages and attempts to only catch articles.	2022-02-13 10:10:46 -05:00
Edward Loveall	f056a0b68a	Better error pages Instead of showing the default Lucky error page, the styles now match Scribe. In addition, if a URL can't be parsed, Scribe gives some information as to why this might be (that Scribe can only deal with an article pages)	2022-02-12 17:56:36 -05:00
Edward Loveall	7d0bc37efd	Fix markup errors caused by UTF-16/8 differences Medium uses UTF-16 character offsets (likely to make it easier to parse in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to do offset calculation then back to UFT-8 fixes some markup bugs. --- Medium calculates markup offsets using UTF-16 encoding. Some characters like Emoji are count as multiple bytes which affects those offsets. For example in UTF-16 💸 is worth two bytes, but Crystal strings only count it as one. This is a problem for markup generation because it can offset the markup and even cause out-of-range errors. Take the following example: 💸💸! Imagine that `!` was bold but the emoji isn't. For Crystal, this starts at char index 2, end at char index 3. Medium's markup will say markup goes from character 4 to 5. In a 3 character string like this, trying to access character range 4...5 is an error because 5 is already out of bounds. My theory is that this is meant to be compatible with JavaScript's string length calculations, as Medium is primarily a platform built for the web: ```js "a".length // 1 "💸".length // 2 "👩‍❤️‍💋‍👩".length // 11 ``` To get these same numbers in Crystal strings must be converted to UTF-16: ```crystal "a".to_utf16.size # 1 "💸".to_utf16.size # 2 "👩‍❤️‍💋‍👩".to_utf16.size # 11 ``` The MarkupConverter now converts text into UFT-16 byte arrays on initialization. Once it's figured out the range of bytes needed for each piece of markup, it converts it back into UTF-8 strings.	2022-01-30 11:53:22 -05:00
Edward Loveall	648a933b24	Provide a list of instances as JSON This is for extensions or other tools that wish to have a list of instances. It can be accessed by visiting the raw file on sourcehut: https://git.sr.ht/~edwardloveall/scribe/blob/main/docs/instances.json	2022-01-29 12:58:08 -05:00
Edward Loveall	3a8ad82252	Add CHANGELOG	2022-01-23 15:06:01 -05:00

7 commits