scribe

Author	SHA1	Message	Date
Edward Loveall	24d3ab9ab3	Better article ID parsing A new ArticleIdParser class takes in an HTTP::Request object and parses the article ID from it. It intentinoally fails on tag, user, and search pages and attempts to only catch articles.	2022-02-13 10:10:46 -05:00
Edward Loveall	f056a0b68a	Better error pages Instead of showing the default Lucky error page, the styles now match Scribe. In addition, if a URL can't be parsed, Scribe gives some information as to why this might be (that Scribe can only deal with an article pages)	2022-02-12 17:56:36 -05:00
Edward Loveall	7d0bc37efd	Fix markup errors caused by UTF-16/8 differences Medium uses UTF-16 character offsets (likely to make it easier to parse in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to do offset calculation then back to UFT-8 fixes some markup bugs. --- Medium calculates markup offsets using UTF-16 encoding. Some characters like Emoji are count as multiple bytes which affects those offsets. For example in UTF-16 💸 is worth two bytes, but Crystal strings only count it as one. This is a problem for markup generation because it can offset the markup and even cause out-of-range errors. Take the following example: 💸💸! Imagine that `!` was bold but the emoji isn't. For Crystal, this starts at char index 2, end at char index 3. Medium's markup will say markup goes from character 4 to 5. In a 3 character string like this, trying to access character range 4...5 is an error because 5 is already out of bounds. My theory is that this is meant to be compatible with JavaScript's string length calculations, as Medium is primarily a platform built for the web: ```js "a".length // 1 "💸".length // 2 "👩‍❤️‍💋‍👩".length // 11 ``` To get these same numbers in Crystal strings must be converted to UTF-16: ```crystal "a".to_utf16.size # 1 "💸".to_utf16.size # 2 "👩‍❤️‍💋‍👩".to_utf16.size # 11 ``` The MarkupConverter now converts text into UFT-16 byte arrays on initialization. Once it's figured out the range of bytes needed for each piece of markup, it converts it back into UTF-8 strings.	2022-01-30 11:53:22 -05:00
Edward Loveall	648a933b24	Provide a list of instances as JSON This is for extensions or other tools that wish to have a list of instances. It can be accessed by visiting the raw file on sourcehut: https://git.sr.ht/~edwardloveall/scribe/blob/main/docs/instances.json	2022-01-29 12:58:08 -05:00
Edward Loveall	08f38a4d25	Add GitHub Gist authentication instructions	2022-01-23 16:08:23 -05:00
Edward Loveall	3a8ad82252	Add CHANGELOG	2022-01-23 15:06:01 -05:00
Edward Loveall	7518a035b1	Proxy GitHub gists with rate limiting Previously, GitHub gists were embedded. The gist url would be detected in a paragraph and the page would render a script like: ```html <script src="https://gist.github.com/user/gist_id.js"></script> ``` The script would then embed the gist on the page. However, gists contain multiple files. It's technically possible to embed a single file in the same way by appending a `file` query param: ```html <script src="https://gist.github.com/user/gist_id.js?file=foo.txt"></script> ``` I wanted to try and tackle proxying gists instead. Overview -------- At a high level the PageConverter kicks off the work of fetching and storing the gist content, then sends that content down to the `ParagraphConverter`. When a paragraph comes up that contains a gist embed, it retrieves the previously fetched content. This allows all the necessary content to be fetched up front so the minimum number of requests need to be made. Fetching Gists -------------- There is now a `GithubClient` class that gets gist content from GitHub's ReST API. The gist API response looks something like this (non-relevant keys removed): ```json { "files": { "file-one.txt": { "filename": "file-one.txt", "raw_url": "https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-o ne.txt", "content": "..." }, "file-two.txt": { "filename": "file-two.txt", "raw_url": "https://gist.githubusercontent.com/<username>/<id>/raw/<file_id>/file-t wo.txt", "content": "..." } } } ``` That response gets turned into a bunch of `GistFile` objects that are then stored in a request-level `GistStore`. Crystal's JSON parsing does not make it easy to parse json with arbitrary keys into objects. This is because each key corresponds to an object property, like `property name : String`. If Crystal doesn't know the keys ahead of time, there's no way to know what methods to create. That's a problem here because the key for each gist file is the unique filename. Fortunately, the keys for each _file_ follows the same pattern and are easy to parse into a `GistFile` object. To turn gist file JSON into Crystal objects, the `GithubClient` turns the whole response into a `JSON::Any` which is like a Hash. Then it extracts just the file data objects and parses those into `GistFile` objects. Those `GistFile` objects are then cached in a `GistStore` that is shared for the page, which means one gist cache per request/article. `GistFile` objects can be fetched out of the store by file, or if no file is specified, it returns all files in the gist. The GistFile is rendered as a link of the file's name to the file in the gist on GitHub, and then a code block of the contents of the file. In summary, the `PageConverter`: * Scans the paragraphs for GitHub gists using `GistScanner` * Requests their data from GitHub using the `GithubClient` * Parses the response into `GistFile`s and populates the `GistStore` * Passes that `GistStore` to the `ParagraphConverter` to use when constructing the page nodes Caching ------- GitHub limits API requests to 5000/hour with a valid api token and 60/hour without. 60 is pretty tight for the usage that scribe.rip gets, but 5000 is reasonable most of the time. Not every article has an embedded gist, but some articles have multiple gists. A viral article (of which Scribe has seen two at the time of this commit) might receive a little over 127k hits/day, which is an average of over 5300/hour. If that article had a gist, Scribe would reach the API limit during parts of the day with high traffic. If it had multiple gists, it would hit it even more. However, average traffic is around 30k visits/day which would be well under the limit, assuming average load. To help not hit that limit, a `GistStore` holds all the `GistFile` objects per gist. The logic in `GistScanner` is smart enough to only return unique gist URLs so each gist is only requested once even if multiple files from one gist exist in an article. This limits the number of times Scribe hits the GitHub API. If Scribe is rate-limited, instead of populating a `GistCache` the `PageConverter` will create a `RateLimitedGistStore`. This is an object that acts like the `GistStore` but returns `RateLimitedGistFile` objects instead of `GistFile` objects. This allows Scribe to gracefully degrade in the event of reaching the rate limit. If rate-limiting becomes a regular problem, Scribe could also be reworked to fallback to the embedded gists again. API Credentials --------------- API credentials are in the form of a GitHub username and a personal access token attached to that username. To get a token, visit https://github.com/settings/tokens and create a new token. The only permission it needs is `gist`. This token is set via the `GITHUB_PERSONAL_ACCESS_TOKEN` environment variable. The username also needs to be set via `GITHUB_USERNAME`. When developing locally, these can both be set in the .env file. Authentication is probably not necessary locally, but it's there if you want to test. If either token is missing, unauthenticated requests are made. Rendering --------- The node tree itself holds a `GithubGist` object. It has a reference to the `GistStore` and the original gist URL. When it renders the page requests the gist's `files`. The gist ID and optional file are detected, and then used to request the file(s) from the `GistStore`. Gists render as a list of each files contents and a link to the file on GitHub. If the requests were rate limited, the store is a `RateLimitedGistStore` and the files are `RateLimitedGistFile`s. These rate-limited objects rendered with a link to the gist on GitHub and text saying that Scribe has been rate-limited. If somehow the file requested doesn't exist in the store, it displays similarly to the rate-limited file but with "file missing" text instead of "rate limited" text. GitHub API docs: https://docs.github.com/en/rest/reference/gists Rate Limiting docs: https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate- limiting	2022-01-23 15:05:46 -05:00
Edward Loveall	8737ca7897	Add scribe.bus-hit.me instance	2022-01-16 22:05:31 -05:00
Edward Loveall	27234bd32a	Ensure that scr/version is up-to-date when building This is an experiment to see if it forces me to actually have updated the version before I build. The idea is that I need to actually commit the version which will make it more likely that all instances can pull down the code and display the correct version if I've done it myself. It uses `git show` to grab the committed contents of src/version then checks to see if it matches today's date.	2022-01-15 16:31:02 -05:00
Edward Loveall	c775072b3d	Add instructions for Lucky config variables The most common is "How do I set my custom domain" (answer: APP_DOMAIN) but this also requires setting LUCKY_ENV=production which requires SECRET_KEY_BASE, DATABASE_URL, and PORT	2022-01-15 16:29:46 -05:00
Edward Loveall	46d87930b8	Use FAQ entry to explain custom domains	2022-01-08 20:15:46 -05:00
Edward Loveall	037bc7cd0f	Add visible version This is to be able to track which instances (including the main one) have which fixes	2022-01-04 21:26:53 -05:00
Edward Loveall	f7e82ffd03	Home page instructions for custom domains	2022-01-04 21:16:56 -05:00
Edward Loveall	6ea0586423	Improve Redirector extension instructions This specifies advanced options for configuring the Redirector extension. If everything is let on (like images) things will break (like images). It also improves the regular expression a bit to account for the image CDN Co-authored-by: Austin Huang <im@austinhuang.me>	2022-01-04 20:58:30 -05:00
miklobit	d8d4913913	update crystal version in Dockerfile	2021-12-15 21:29:41 -05:00
Edward Loveall	1449acc500	Upgrade Crystal to 1.2.1 and Lucky to 0.29.0	2021-12-12 12:01:55 -05:00
Edward Loveall	e365ee8be5	Add FAQ on how to use Scribe with custom domains This is generic so as to not call out any specific website.	2021-12-04 14:05:39 -05:00
Edward Loveall	9f2b2a6096	Add citizen4.eu instance	2021-11-20 11:00:34 -05:00
Edward Loveall	66acd562ae	Update readme	2021-11-20 11:00:34 -05:00
Edward Loveall	25464acabe	Add instance docs	2021-11-11 11:33:22 -05:00
Edward Loveall	3f56fac408	Add project goals to README	2021-11-07 12:22:11 -05:00
Edward Loveall	027e59645d	Support null image widths and heights	2021-11-06 13:22:03 -04:00
Edward Loveall	4b354c659f	Add FAQ	2021-10-23 15:34:13 -04:00
Edward Loveall	5df9c44a5c	Support null text on paragraphs I think this was an old feature on medium, but you can see examples of null text on this post: https://medium.com/message/the-joy-of-typing-fd8d091ab8ef	2021-10-20 20:40:43 -04:00
Edward Loveall	7166b7d834	Add SECTION_CAPTION paragraph type This doesn't seem to be rendered on medium.com. Here's a post that has one, but the text is nowhere on the page: https://medium.com/message/the-joy-of-typing-fd8d091ab8ef This help articles hints that it might have been a feature at one point that they don't allow anymore: https://medium.com/@Medium/images-652ee60abea6	2021-10-20 20:39:58 -04:00
Edward Loveall	513d590ce3	Point source link at sr.ht project page Instead of the git page. That way it's easier to find the mailing lists and whatnot.	2021-10-16 16:23:15 -04:00
Edward Loveall	f7ad92f4bf	Parsing Fix: Add H2 Paragraph type The post id 34dead42a28 contained a new paragraph type: H2. Previously the only known header types were H3 and H4. In this case, the paragraph doesn't actually get rendered because it's the page title which is removed from the page nodes (see commits `6baba803` and then `fba87c10`). However, it somehow an author is able to get an H2 paragraph into the page, it will display as an <h1> just as H3 displays as <h2> and H4 displays as <h3>.	2021-10-16 16:23:15 -04:00
tnp	eaf25ef23a	Add Dockerfile	2021-10-16 10:56:15 -04:00
Martin Puppe	0c018a898a	Add support for development with Nix This patch adds support for development with the Nix package manager. In order to support the traditional nix-shell tool as well as the (still experimental) Nix Flakes feature of the upcoming version of Nix, this patch adds shell.nix and flake.nix/flake.lock. Usage instructions have been added to the README.	2021-10-15 08:56:15 -04:00
Martin Puppe	56b6d546db	Further improve proposed pattern for Redirector This patch further improves the proposed pattern for the Redirector extension. In contrast to the old pattern, … * … it will redirect the URL https://medium.com. * … it will not redirect URLs with top-level domains like mediumXcom. (This point is purely theoretical, but it makes the regular expression more correct and consistent.) * … it will not redirect URLs like https://link.medium.com/AXEtCilplkb which Scribe currently cannot handle. These are shortened URLs that users get when they use the Twitter button on Medium to share a post. In order to implement the last point (not matching link.medium.com), the pattern uses negative lookbehind. This feature of regular expressions is supported by all recent browsers for which Redirector is available (Firefox, Chrome, Edge, Opera)[^1], including the current version of Firefox ESR (Extended Stability Release). [^1]: https://caniuse.com/js-regexp-lookbehind	2021-10-15 08:55:26 -04:00
Edward Loveall	472b0092c8	Add mailing list for patches to README	2021-10-14 21:15:19 -04:00
Amolith	9fcf37f416	Use app_domain in Redirector example In the current redirector example, "scribe.rip" is hardcoded as the destination. This patch simply changes that to use the app_domain environment variable, so people wanting to use a community instance aren't mistakenly redirected to the main scribe.rip instance.	2021-10-14 18:10:46 -04:00
Martin Puppe	0d9170b8d6	Improve proposed pattern for Redirector extension The old pattern matches all host names that end with medium.com. The new pattern matches only medium.com and its sub-domains. For example, the old pattern would have matched https://foomedium.com/@user/post-123456abcdef.	2021-10-13 21:51:55 -04:00
Edward Loveall	e127a67c6b	Ensure gists display well at all device widths	2021-10-11 20:09:58 -04:00
Edward Loveall	91f4aae0bc	Add an example and tagline to homepage	2021-10-11 20:03:31 -04:00
Edward Loveall	bb94fb41b1	Support medium's redirectUrl query param When a post has a gi= query param, Medium makes a global_identifier "query". This redirects via a 307 temporary redirect to a url that looks like this: https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Fexample.c om%2Fmy-post-000000000000 Previously, scribe looked for the Medium post id in the url's path, not it's query params since query params can include other garbage like medium_utm (not related to medium.com). Now it looks first for the post id in the path, then looks to the redirectUrl as a fallback.	2021-10-11 12:04:17 -04:00
Edward Loveall	91687bb689	Add automatic redirect instructions to homepage	2021-10-10 15:05:56 -04:00
Edward Loveall	dbddfc9cb4	Update README	2021-10-10 14:52:37 -04:00
Edward Loveall	0998a87622	Remove postgress stuff from script/setup This app doesn't use a database so there's no point.	2021-10-10 14:52:14 -04:00
Edward Loveall	fba87c1076	Improve title parsing The subtitle has been removed because it's difficult to find and error prone to guess at. It is somewhat accessible from the post's previewContent field in GraphQL but that can be truncated.	2021-10-03 18:14:46 -04:00
Edward Loveall	2808505b4e	Add instructions on how to view a post	2021-10-03 17:21:17 -04:00
Edward Loveall	aacef34a14	Accept all known medium post path types Including: * https://example.com/my-cool-post-123456abcdef * https://example.com/123456abcdef * https://medium.com/@user/my-cool-post-123456abcdef * https://medium.com/user/my-cool-post-123456abcdef * https://medium.com/p/my-cool-post-123456abcdef * https://medium.com/posts/my-cool-post-123456abcdef * https://medium.com/p/123456abcdef Replace any of those posts with the scribe domain and it should resolve	2021-10-03 16:45:20 -04:00
Edward Loveall	0f6a2a3e1e	Fix GitHub Gist width	2021-09-25 13:26:24 -04:00
Edward Loveall	bd56bfdd9f	Embed widths are now the same width as all content	2021-09-25 13:26:10 -04:00
Edward Loveall	561483cf9f	Link to the author's page Right now this links to the user's medium page. It may link to an internal page in the future. Instead of the Page taking the author as a string, it now takes a PostResponse::Creator object. The Articles::ShowPage then converts the Creator (a name and user_id) to an author link. Finally, I did some refactoring of UserAnchor (which I thought I was going to use for this) to change it's userId attribute to user_id as is Crystal convention.	2021-09-15 16:03:36 -04:00
Edward Loveall	1c20c81d06	Fix Blockquotes In tufte.css blockquotes should contain a <p> that holds the content and an optional <footer> for the source of the quote. Otherwise the block quote text is unbounded and is way too wide. This wraps the content in a paragraph	2021-09-15 15:25:34 -04:00
Edward Loveall	a6cafaa1fc	Render embedded content PostResponse::Paragraph's that are of type IFRAME have extra data in the iframe attribute to specify what's in the iframe. Not all data is the same, however. I've identified three types and am using the new EmbeddedConverter class to convert them: * EmbeddedContent, the full iframe experience * GithubGist, because medium or github treat embeds differently for whatever reason * EmbeddedLink, the old style, just a link to the content. Effectively a fallback The size of the original iframe is also specified as an attribute. This code resizes it. The resizing is determined by figuring out the width/height ratio and setting the width to 800. EmbeddedContent can be displayed if we have an embed.ly url, which most iframe response data has. GitHub gists are a notable exception. Gists instead can be embedded simply by taking the gist URL and attaching .js to the end. That becomes the iframe's src attribute. The PostResponse::Paragraph's iframe attribute is nillable. Previous code used lots of if-statements with variable bindings to work with the possible nil values: ```crystal if foo = obj.nillable_value # obj.nillable_value was not nil and foo contains the value else # obj.nillable_value was nil so do something else end ``` See https://crystal-lang.org/reference/syntax_and_semantics/if_var.html for more info In the EmbeddedConverter the monads library has been introduced to get rid of at least one level of nillability. This wraps values in Maybe which allows for a cleaner interface: ```crystal Monads::Try(Value).new(->{ obj.nillable_value }) .to_maybe .fmap(->(value: Value) { # do something with value }) .value_or(# value was nil, do something else) ``` This worked to get the iframe attribute from a Paragraph: ```crystal Monads::Try(PostResponse::IFrame).new(->{ paragraph.iframe }) .to_maybe .fmap(->(iframe : PostResponse::IFrame) { # iframe is not nil! }) .fmap(#and so on) .value_or(Empty.new) ``` iframe only has one attribute: mediaResource which contains the iframe data. That was used to determine one of the three types above. Finally, Tufte.css has options for iframes. They mostly look good except for tweets which are too small and weirdly in the center of the page which actually looks off-center. That's for another day though.	2021-09-15 15:18:08 -04:00
Edward Loveall	903f3f4b38	Add License	2021-09-12 17:34:48 -04:00
Edward Loveall	7851434952	Add script to build object file (.o) for Ubuntu This ubuntu_server.o file then needs to be copied to the server and linked.	2021-09-07 22:00:20 -04:00
Edward Loveall	9770ff5c7a	Add MIXTAPE_EMBED paragraph type	2021-09-07 21:13:28 -04:00

1 2

82 commits