Tokenizers lex their own child tokens #2124

calculuschild · 2021-06-30T20:03:01Z

Based on the conversation in #2112 (comment), this is the start of an attempt to have the Tokenizers handle the lexing of their own children tokens, rather than making the Lexer.js do it.

For block tokens this was relatively simple. For inline tokens it's also not a huge issue, except for the ugliness that comes with passing in inRawBlock and inLink to a bunch of the Tokenizers since it kind of muddies up the legibility in what the Tokenizers are actually doing. Passing those values around seems like a code smell we could avoid but I don't know how, or how much those variables actually need to be passed around. Any thoughts? I wanted to get some early feedback before going through the whole thing.

Edit : What about refactoring the inLink and inRawBlock flags to instead be properties of the Lexer? I.e. in the constructor:

lexer.state = { //or lexer.flags, etc...
  inLink : false,
  inRawBlock : false
}

And a second question: do we want the logic of inline() from Lexer.js to also be handled by the Tokenizers themselves?

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

CI is green (no forced merge required).
Squash and Merge PR following conventional commit guidelines.

vercel · 2021-06-30T20:03:06Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/wA4Xa4rc9Kz4J8GUSfwU33a5zJYR
✅ Preview: https://markedjs-git-fork-calculuschild-tokenizershandl-72751a-markedjs.vercel.app

UziTech · 2021-07-02T05:33:50Z

What about refactoring the inLink and inRawBlock flags to instead be properties of the Lexer?

I like that idea. Then extensions can use those properties as well if they need to.

do we want the logic of inline() from Lexer.js to also be handled by the Tokenizers themselves?

Ya, The way I see it is the Lexer handles the overall state (including which order to call the tokenizers) and the tokenizers handle everything about creating the token (including children).

Currently the tokenizers handle everything about creating the token except for the children but as we see in #2112 (comment) sometimes the tokens need to know the children to create the token.

calculuschild · 2021-07-02T17:27:07Z

What about refactoring the inLink and inRawBlock flags to instead be properties of the Lexer?

Alright. I've added this now. Just need to finish moving inline() over.

calculuschild · 2021-07-02T18:04:55Z

@UziTech New issue I'm running into in moving inline() to the Tokenizer. In the case of adjacent tokens that we merge together (such as paragraph and text), generating child tokens before the merge leads to errors. If you try to do:

        if (lastToken && lastToken.type === 'text') {
          lastToken.raw += '\n' + token.raw;
          lastToken.text += '\n' + token.text;
          lastToken.tokens = lastToken.tokens.concat(token.tokens); //<== Trying to merge tokens
        }

You end up with trailing/starting spaces between the two text tokens being trimmed:

Expected: <ol><li><p>A paragraph with two lines.</p><
  Actual: <ol><li><p>A paragraphwith two lines.</p><p

This can also lead to some child tokens not being detected because the ending half is in the other token before merging.

Any ideas?

nptable is 95% identical to table, except for handling a weird special case. Special case now handled in splitCells()

UziTech · 2021-07-03T04:28:41Z

Would it be feasible to only do inline tokens for paragraph and text in the lexer after all block tokens? Do you feel that would be too inconsistent?

calculuschild · 2021-07-03T04:32:08Z

I might be able to manage the other ones. I'm pretty sure I can get Tables at least.

It would be kind of inconsistent but it might be something we just need to leave until a deeper revision down the road.

calculuschild · 2021-07-05T04:43:27Z

@UziTech Sigh... new problem.

Reflinks that are defined after they are referenced don't get handled since they aren't registered to the this.tokens.links yet when we try to parse child inlineTokens.

# [Foo]  //<-- parsing child tokens of this Heading before `[foo]` is defined.
[foo]: /url

I'm stumped on this one. How do we approach this? It seems like we would have to make a first pass just to get the link definitions before parsing anything else.

UziTech · 2021-07-05T05:28:06Z

Hmm. Maybe we need to have a property on the token that is a function that the lexer will call to parse the inline tokens after all block tokens are handled? Not sure what that would do with the benchmarks but It might slow it down too much.

Right now the extensions parse their inline tokens right away but we might want them to wait for all block tokens as well in case they need reflinks.

UziTech · 2021-07-05T05:40:17Z

or instead of tokenizers calling token.tokens = this.inlineTokens(token.text) we could have an inline queue so they could call this.inline(token.text, token.tokens).

// in lexer
  inline(src, tokens) {
    this.inlineQueue.push({src, tokens});
  }

then after the block tokens are complete we would run

while (const next = this.inlineQueue.shift()) {
  this.inlineTokens(next.src, next.tokens);
}

calculuschild · 2021-07-05T05:46:26Z

we could have an inline queue

We could do something like that. That might also better work with paragraphs and text that can't inline until they are merged.

The only problem I see with that, though, is the original problem of #2112 (comment) is broken again, since the parent token won't be able to see its child tokens until its too late.

UziTech · 2021-07-05T15:47:24Z

I guess it would have to be a function to accomplish setting token properties based on children. Or should that be what walkTokens is for?

calculuschild · 2021-07-05T16:11:23Z

Maybe we could use walktokens. I'm a little hesitant to break token function apart further but that might make sense.

Hm... What about handling it in the Renderer? The token can organize all the text info but the Renderer decides how to render it based on what the children look like? Meh...

calculuschild · 2021-07-19T03:28:35Z

I updated the documentation. Double-check it for typos if you like. I think everything else up to this point is resolved now?

UziTech

Nice work! 💯

UziTech · 2021-07-19T16:40:36Z

Do we want to close #2126 since this PR also removes the nptable tokenizer? or this one will have to be rebased after that one is merged.

calculuschild · 2021-07-19T16:42:07Z

Yep, did it just now. 👍

calculuschild · 2021-07-19T16:42:57Z

#2112 might need tweaking now, but it should go out in the same major version bump after this is merged.

calculuschild · 2021-07-27T22:10:28Z

@davisjam @joshbruce @styfle Is anyone able to take a look at this one? I'm eager to get this one merged!

calculuschild · 2021-08-02T04:58:25Z

Hm, do we need to add to the documentation mention of the lexer.state properties? Since they have been removed from the function signatures?

UziTech · 2021-08-02T17:37:08Z

There are the following breaking changes that I can think of in this PR:

Tokenizers will create their own tokens with this.lexer.inline(text, tokens). The inline function will queue the token creation until after all block tokens are rendered.
nptable tokenizer is removed and merged with table tokenizer.
Extensions tokenizer this object will include the lexer as a property. this.inlineTokens becomes this.lexer.inline.
Extensions parser this object will include the parser as a property. this.parseInline becomes this.parser.parseInline.
tag and inlineText tokenizer function signatures have changed.

Am I missing any?

UziTech · 2021-08-02T17:37:41Z

do we need to add to the documentation mention of the lexer.state properties?

That wouldn't hurt but it could be done in a separate PR.

calculuschild · 2021-08-02T19:05:44Z

Am I missing any?

Just that some function signatures have changed. If people were overwriting any of those functions they might not work anymore. I guess that's kind of covered by the "tokenizers left their own child tokens" though.

Edit: nevermind you got those listed already.

calculuschild · 2021-08-02T19:14:02Z

One minor thing is the naming of inline() and inlineTokens(). I have to keep reminding myself what the difference is. Is there another way we could name things? Maybe something like inline => inlineTokens and inlineTokens => nestedInlineTokens ?

Might make it easier for users to understand when to use each one.

UziTech · 2021-08-02T19:24:08Z

Might make it easier for users to understand when to use each one.

I don't think inlineTokens is something extension creators should call. Ideally inlineTokens should be a private method and inline should be a public method that should always be used to queue the creation of inline tokens.

We could change them to queueInlineTokens and createInlineTokens. That seems to be the most intuitive.

github-actions · 2021-08-16T03:11:26Z

🎉 This PR is included in version 3.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Initial commit

d5c4850

vercel bot deployed to Preview June 30, 2021 20:03 View deployment

calculuschild requested a review from UziTech June 30, 2021 20:03

Move inLink and inRawBlock to Lexer.state

acaa742

vercel bot deployed to Preview July 2, 2021 17:22 View deployment

Move Def child tokens to tokenizer

595834b

vercel bot deployed to Preview July 2, 2021 17:28 View deployment

Remove nptable tokenizer

a7a6051

nptable is 95% identical to table, except for handling a weird special case. Special case now handled in splitCells()

calculuschild mentioned this pull request Jul 2, 2021

Remove nptable tokenizer #2126

Closed

5 tasks

Remove rules.js for nptable

79cde34

calculuschild added 2 commits July 4, 2021 22:38

Merge branch 'removeNPTable' into TokenizersHandleChildTokens

c6d8ff6

Move inline -> table to tokenizer

136baeb

vercel bot deployed to Preview July 5, 2021 03:22 View deployment

Lint

a5e17df

vercel bot deployed to Preview July 5, 2021 03:22 View deployment

Move inline -> blockquote to tokenizer

6681659

vercel bot deployed to Preview July 5, 2021 03:29 View deployment

calculuschild requested a review from UziTech July 19, 2021 15:05

UziTech approved these changes Jul 19, 2021

View reviewed changes

UziTech requested review from davisjam, joshbruce and styfle July 19, 2021 17:00

styfle approved these changes Aug 1, 2021

View reviewed changes

UziTech mentioned this pull request Aug 2, 2021

drop node 10 support #2157

Merged

5 tasks

calculuschild mentioned this pull request Aug 2, 2021

fix: Full Commonmark compliance for Lists #2112

Merged

5 tasks

correct function signatures for tokenizers

6edcfce

vercel bot deployed to Preview August 2, 2021 04:52 View deployment

UziTech merged commit 288f1cb into markedjs:master Aug 2, 2021

calculuschild added a commit to calculuschild/marked that referenced this pull request Aug 5, 2021

Redo of Full List compliance based on markedjs#2124

de5011c

calculuschild added a commit to calculuschild/marked that referenced this pull request Aug 6, 2021

Rebase onto markedjs#2124

7140744

github-actions bot added the released label Aug 16, 2021

stevenjoezhang added a commit to hexojs/hexo-renderer-marked that referenced this pull request Sep 19, 2021

fix: markedjs/marked#2124

ab0e0a1

stevenjoezhang mentioned this pull request Sep 19, 2021

chore(deps): bump marked from 2.1.3 to 3.0.4 hexojs/hexo-renderer-marked#208

Merged

ashharrison90 mentioned this pull request Jan 19, 2022

Reference-style links inside tables are broken since version 3 #2217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers lex their own child tokens #2124

Tokenizers lex their own child tokens #2124

calculuschild commented Jun 30, 2021 •

edited

Loading

vercel bot commented Jun 30, 2021 •

edited

Loading

UziTech commented Jul 2, 2021 •

edited

Loading

calculuschild commented Jul 2, 2021

calculuschild commented Jul 2, 2021 •

edited

Loading

UziTech commented Jul 3, 2021 •

edited

Loading

calculuschild commented Jul 3, 2021

calculuschild commented Jul 5, 2021 •

edited

Loading

UziTech commented Jul 5, 2021 •

edited

Loading

UziTech commented Jul 5, 2021 •

edited

Loading

calculuschild commented Jul 5, 2021 •

edited

Loading

UziTech commented Jul 5, 2021

calculuschild commented Jul 5, 2021

calculuschild commented Jul 19, 2021

UziTech left a comment

UziTech commented Jul 19, 2021 •

edited

Loading

calculuschild commented Jul 19, 2021

calculuschild commented Jul 19, 2021 •

edited

Loading

calculuschild commented Jul 27, 2021

calculuschild commented Aug 2, 2021 •

edited

Loading

UziTech commented Aug 2, 2021

UziTech commented Aug 2, 2021

calculuschild commented Aug 2, 2021 •

edited

Loading

calculuschild commented Aug 2, 2021

UziTech commented Aug 2, 2021 •

edited

Loading

github-actions bot commented Aug 16, 2021

Tokenizers lex their own child tokens #2124

Tokenizers lex their own child tokens #2124

Conversation

calculuschild commented Jun 30, 2021 • edited Loading

Contributor

Committer

vercel bot commented Jun 30, 2021 • edited Loading

UziTech commented Jul 2, 2021 • edited Loading

calculuschild commented Jul 2, 2021

calculuschild commented Jul 2, 2021 • edited Loading

UziTech commented Jul 3, 2021 • edited Loading

calculuschild commented Jul 3, 2021

calculuschild commented Jul 5, 2021 • edited Loading

UziTech commented Jul 5, 2021 • edited Loading

UziTech commented Jul 5, 2021 • edited Loading

calculuschild commented Jul 5, 2021 • edited Loading

UziTech commented Jul 5, 2021

calculuschild commented Jul 5, 2021

calculuschild commented Jul 19, 2021

UziTech left a comment

Choose a reason for hiding this comment

UziTech commented Jul 19, 2021 • edited Loading

calculuschild commented Jul 19, 2021

calculuschild commented Jul 19, 2021 • edited Loading

calculuschild commented Jul 27, 2021

calculuschild commented Aug 2, 2021 • edited Loading

UziTech commented Aug 2, 2021

UziTech commented Aug 2, 2021

calculuschild commented Aug 2, 2021 • edited Loading

calculuschild commented Aug 2, 2021

UziTech commented Aug 2, 2021 • edited Loading

github-actions bot commented Aug 16, 2021

calculuschild commented Jun 30, 2021 •

edited

Loading

vercel bot commented Jun 30, 2021 •

edited

Loading

UziTech commented Jul 2, 2021 •

edited

Loading

calculuschild commented Jul 2, 2021 •

edited

Loading

UziTech commented Jul 3, 2021 •

edited

Loading

calculuschild commented Jul 5, 2021 •

edited

Loading

UziTech commented Jul 5, 2021 •

edited

Loading

UziTech commented Jul 5, 2021 •

edited

Loading

calculuschild commented Jul 5, 2021 •

edited

Loading

UziTech commented Jul 19, 2021 •

edited

Loading

calculuschild commented Jul 19, 2021 •

edited

Loading

calculuschild commented Aug 2, 2021 •

edited

Loading

calculuschild commented Aug 2, 2021 •

edited

Loading

UziTech commented Aug 2, 2021 •

edited

Loading