scribe

Author	SHA1	Message	Date
Edward Loveall	80b6b51804	Fix redirection pattern Commit `6ea0586423` improved redirection instructions, but regressed in one way. The "Redirect to" pattern specified a slash which was accounted for in the main pattern, which resulted in a double slash: https://medium.com/@user/post-123456abcdef would redirect to https://scribe.rip//@user/post-123456abcdef This removes the extra slash	2022-03-12 12:03:23 -05:00
Edward Loveall	fb51270f87	Fix article ID parsing bug Since the article ID regular expression wasn't anchored to the end of the URL, it would grab characters after a / or - that were hex characters. For example /@user/bacon-123abc would just grab `bac`. Not great. This anchors the ID at the end of the string so that it will be more likely to catch IDs.	2022-02-13 21:07:50 -05:00
Edward Loveall	1f517f9031	Link to full Medium URL on error page Previously the link on the error page was only linking to the path component of the url, e.g. `/search` but ignoring any query params e.g. `/search?q=hello`. This uses the HTTP::Request `resource` method which appears to capture both.	2022-02-13 10:13:24 -05:00
Edward Loveall	24d3ab9ab3	Better article ID parsing A new ArticleIdParser class takes in an HTTP::Request object and parses the article ID from it. It intentinoally fails on tag, user, and search pages and attempts to only catch articles.	2022-02-13 10:10:46 -05:00
Edward Loveall	f056a0b68a	Better error pages Instead of showing the default Lucky error page, the styles now match Scribe. In addition, if a URL can't be parsed, Scribe gives some information as to why this might be (that Scribe can only deal with an article pages)	2022-02-12 17:56:36 -05:00
Edward Loveall	7d0bc37efd	Fix markup errors caused by UTF-16/8 differences Medium uses UTF-16 character offsets (likely to make it easier to parse in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to do offset calculation then back to UFT-8 fixes some markup bugs. --- Medium calculates markup offsets using UTF-16 encoding. Some characters like Emoji are count as multiple bytes which affects those offsets. For example in UTF-16 💸 is worth two bytes, but Crystal strings only count it as one. This is a problem for markup generation because it can offset the markup and even cause out-of-range errors. Take the following example: 💸💸! Imagine that `!` was bold but the emoji isn't. For Crystal, this starts at char index 2, end at char index 3. Medium's markup will say markup goes from character 4 to 5. In a 3 character string like this, trying to access character range 4...5 is an error because 5 is already out of bounds. My theory is that this is meant to be compatible with JavaScript's string length calculations, as Medium is primarily a platform built for the web: ```js "a".length // 1 "💸".length // 2 "👩‍❤️‍💋‍👩".length // 11 ``` To get these same numbers in Crystal strings must be converted to UTF-16: ```crystal "a".to_utf16.size # 1 "💸".to_utf16.size # 2 "👩‍❤️‍💋‍👩".to_utf16.size # 11 ``` The MarkupConverter now converts text into UFT-16 byte arrays on initialization. Once it's figured out the range of bytes needed for each piece of markup, it converts it back into UTF-8 strings.	2022-01-30 11:53:22 -05:00
Edward Loveall	648a933b24	Provide a list of instances as JSON This is for extensions or other tools that wish to have a list of instances. It can be accessed by visiting the raw file on sourcehut: https://git.sr.ht/~edwardloveall/scribe/blob/main/docs/instances.json	2022-01-29 12:58:08 -05:00
Edward Loveall	3a8ad82252	Add CHANGELOG	2022-01-23 15:06:01 -05:00

8 commits