Commit Graph

8 Commits

Author SHA1 Message Date
Edward Loveall 7d0bc37efd
Fix markup errors caused by UTF-16/8 differences
Medium uses UTF-16 character offsets (likely to make it easier to parse
in JavaScript) but Crystal uses UTF-8. Converting strings to UTF-16 to
do offset calculation then back to UFT-8 fixes some markup bugs.

---

Medium calculates markup offsets using UTF-16 encoding. Some characters
like Emoji are count as multiple bytes which affects those offsets. For
example in UTF-16 💸 is worth two bytes, but Crystal strings only count
it as one. This is a problem for markup generation because it can
offset the markup and even cause out-of-range errors.

Take the following example:

💸💸!

Imagine that `!` was bold but the emoji isn't. For Crystal, this starts
at char index 2, end at char index 3. Medium's markup will say markup
goes from character 4 to 5. In a 3 character string like this, trying
to access character range 4...5 is an error because 5 is already out of
bounds.

My theory is that this is meant to be compatible with JavaScript's
string length calculations, as Medium is primarily a platform built for
the web:

```js
"a".length // 1
"💸".length // 2
"👩‍❤️‍💋‍👩".length // 11
```

To get these same numbers in Crystal strings must be converted to
UTF-16:

```crystal
"a".to_utf16.size # 1
"💸".to_utf16.size # 2
"👩‍❤️‍💋‍👩".to_utf16.size # 11
```

The MarkupConverter now converts text into UFT-16 byte arrays on
initialization. Once it's figured out the range of bytes needed for
each piece of markup, it converts it back into UTF-8 strings.
2022-01-30 11:53:22 -05:00
Edward Loveall 561483cf9f
Link to the author's page
Right now this links to the user's medium page. It may link to an
internal page in the future.

Instead of the Page taking the author as a string, it now takes a
PostResponse::Creator object. The Articles::ShowPage then converts the
Creator (a name and user_id) to an author link.

Finally, I did some refactoring of UserAnchor (which I thought I was
going to use for this) to change it's userId attribute to user_id as is
Crystal convention.
2021-09-15 16:03:36 -04:00
Edward Loveall 09995cde5c
Overlapping refactor
Example:

* Text: "strong and emphasized only"
* Markups:
  * Strong: 0..10
  * Emphasis: 7..21

First, get all the borders of the markups, including the start (0) and
end (text.size) indexes of the text in order:

```
[0, 7, 10, 21, 26]
```

Then attach markups to each range. Note that the ranges are exclusive;
they don't include the final number:

* 0...7: Strong
* 7...10: Strong, Emphasized
* 10...21: Emphasized
* 21...26: N/A

Bundle each range and it's related markups into a value object
RangeWithMarkup and return the list.

Loop through that list and recursively apply each markup to each
segment of text:

* Apply a `Strong` markup to the text "strong "
* Apply a `Strong` markup to the text "and"
  * Wrap that in an `Emphasis` markup
* Apply an `Emphasis` markup to the text " emphasized"
* Leave the text " only" as is

---

This has the side effect of breaking up the nodes more than they need
to be broken up. For example right now the algorithm creates this HTML:

```
<strong>strong </strong><em><strong>and</strong></em>
```

instead of:

```
<strong>strong <em>and</em></strong>
```

But that's a task for another day.
2021-08-08 15:08:43 -04:00
Edward Loveall 31f7d6956c
Anchor and UserAnchor nodes can contain children
The impetus for this change was to help make the MarkupConverter code
more robust. However, it's also possible that an Anchor can contain
styled text. For example, in markdown someone might write a link that
contains some <strong> text:

```markdown
[this link is so **good**](https://example.com)
```

This setup will now allow that. Unknown if UserAnchor can ever contain
any text that isn't just the user's name, but it's easy to deal with
and makes the typing much easier.
2021-08-08 14:34:40 -04:00
Edward Loveall 130b235a6c
crystal tool format 2021-08-08 14:23:38 -04:00
Edward Loveall 743d9e5fa9
Render a User Anchor 2021-07-04 17:37:45 -04:00
Edward Loveall bc356baa45
Render a Link Anchor
As opposed to a user anchor
2021-07-04 17:28:19 -04:00
Edward Loveall 5a5f68bcf8
First step rendering a page
The API responds with a bunch of paragraphs which the client converts
into Paragraph objects.

This turns the paragraphs in a PostResponse's Paragraph objects into the
form needed to render them on a page. This includes converting flat list
elements into list elements nested by a UL. And adding a limited markups
along the way.

The array of paragraphs is passed to a recursive function. The function
takes the first paragraph and either wraps the (marked up) contents in a
container tag (like Paragraph or Heading3), and then moves onto the next
tag. If it finds a list, it starts parsing the next paragraphs as a list
instead.

Originally, this was implemented like so:

```crystal
paragraph = paragraphs.shift
if list?
  convert_list([paragraph] + paragraphs)
end
```

However, passing the `paragraphs` after adding it to the already shifted
`paragraph` creates a new object. This means `paragraphs` won't be
mutated and once the list is parsed, it starts with the next element of
the list. Instead, the element is `shift`ed inside each converter.

```crystal
if paragraphs.first == list?
  convert_list(paragraphs)
end

def convert_list(paragraphs)
  paragraph = paragraphs.shift
  # ...
end
```

When rendering, there is an Empty and Container object. These represent
a kind of "null object" for both leafs and parent objects respectively.
They should never actually render. Emptys are filtered out, and
Containers are never created explicitly but this will make the types
pass.

IFrames are a bit of a special case. Each IFrame has custom data on it
that this system would need to be aware of. For now, instead of trying
to parse the seemingly large number of iframe variations and dealing
with embedded iframe problems, this will just keep track of the source
page URL and send the user there with a link.
2021-07-04 16:28:03 -04:00