Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers lex their own child tokens #2124

Merged
merged 18 commits into from
Aug 2, 2021

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented Jun 30, 2021

Based on the conversation in #2112 (comment), this is the start of an attempt to have the Tokenizers handle the lexing of their own children tokens, rather than making the Lexer.js do it.

For block tokens this was relatively simple. For inline tokens it's also not a huge issue, except for the ugliness that comes with passing in inRawBlock and inLink to a bunch of the Tokenizers since it kind of muddies up the legibility in what the Tokenizers are actually doing. Passing those values around seems like a code smell we could avoid but I don't know how, or how much those variables actually need to be passed around. Any thoughts? I wanted to get some early feedback before going through the whole thing.

Edit : What about refactoring the inLink and inRawBlock flags to instead be properties of the Lexer? I.e. in the constructor:

lexer.state = { //or lexer.flags, etc...
  inLink : false,
  inRawBlock : false
}

And a second question: do we want the logic of inline() from Lexer.js to also be handled by the Tokenizers themselves?

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Jun 30, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/wA4Xa4rc9Kz4J8GUSfwU33a5zJYR
✅ Preview: https://markedjs-git-fork-calculuschild-tokenizershandl-72751a-markedjs.vercel.app

@UziTech
Copy link
Member

UziTech commented Jul 2, 2021

What about refactoring the inLink and inRawBlock flags to instead be properties of the Lexer?

I like that idea. Then extensions can use those properties as well if they need to.

do we want the logic of inline() from Lexer.js to also be handled by the Tokenizers themselves?

Ya, The way I see it is the Lexer handles the overall state (including which order to call the tokenizers) and the tokenizers handle everything about creating the token (including children).

Currently the tokenizers handle everything about creating the token except for the children but as we see in #2112 (comment) sometimes the tokens need to know the children to create the token.

@calculuschild
Copy link
Contributor Author

What about refactoring the inLink and inRawBlock flags to instead be properties of the Lexer?

Alright. I've added this now. Just need to finish moving inline() over.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jul 2, 2021

@UziTech New issue I'm running into in moving inline() to the Tokenizer. In the case of adjacent tokens that we merge together (such as paragraph and text), generating child tokens before the merge leads to errors. If you try to do:

        if (lastToken && lastToken.type === 'text') {
          lastToken.raw += '\n' + token.raw;
          lastToken.text += '\n' + token.text;
          lastToken.tokens = lastToken.tokens.concat(token.tokens); //<== Trying to merge tokens
        }

You end up with trailing/starting spaces between the two text tokens being trimmed:

Expected: <ol><li><p>A paragraph with two lines.</p><
  Actual: <ol><li><p>A paragraphwith two lines.</p><p

This can also lead to some child tokens not being detected because the ending half is in the other token before merging.

Any ideas?

nptable is 95% identical to table, except for handling a weird special case.  Special case now handled in splitCells()
@calculuschild calculuschild mentioned this pull request Jul 2, 2021
5 tasks
@UziTech
Copy link
Member

UziTech commented Jul 3, 2021

Would it be feasible to only do inline tokens for paragraph and text in the lexer after all block tokens? Do you feel that would be too inconsistent?

@calculuschild
Copy link
Contributor Author

I might be able to manage the other ones. I'm pretty sure I can get Tables at least.

It would be kind of inconsistent but it might be something we just need to leave until a deeper revision down the road.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jul 5, 2021

@UziTech Sigh... new problem.

Reflinks that are defined after they are referenced don't get handled since they aren't registered to the this.tokens.links yet when we try to parse child inlineTokens.

# [Foo]  //<-- parsing child tokens of this Heading before `[foo]` is defined.
[foo]: /url

I'm stumped on this one. How do we approach this? It seems like we would have to make a first pass just to get the link definitions before parsing anything else.

@UziTech
Copy link
Member

UziTech commented Jul 5, 2021

Hmm. Maybe we need to have a property on the token that is a function that the lexer will call to parse the inline tokens after all block tokens are handled? Not sure what that would do with the benchmarks but It might slow it down too much.

Right now the extensions parse their inline tokens right away but we might want them to wait for all block tokens as well in case they need reflinks.

@UziTech
Copy link
Member

UziTech commented Jul 5, 2021

or instead of tokenizers calling token.tokens = this.inlineTokens(token.text) we could have an inline queue so they could call this.inline(token.text, token.tokens).

// in lexer
  inline(src, tokens) {
    this.inlineQueue.push({src, tokens});
  }

then after the block tokens are complete we would run

while (const next = this.inlineQueue.shift()) {
  this.inlineTokens(next.src, next.tokens);
}

@calculuschild
Copy link
Contributor Author

calculuschild commented Jul 5, 2021

we could have an inline queue

We could do something like that. That might also better work with paragraphs and text that can't inline until they are merged.

The only problem I see with that, though, is the original problem of #2112 (comment) is broken again, since the parent token won't be able to see its child tokens until its too late.

@UziTech
Copy link
Member

UziTech commented Jul 5, 2021

I guess it would have to be a function to accomplish setting token properties based on children. Or should that be what walkTokens is for?

@calculuschild
Copy link
Contributor Author

Maybe we could use walktokens. I'm a little hesitant to break token function apart further but that might make sense.

Hm... What about handling it in the Renderer? The token can organize all the text info but the Renderer decides how to render it based on what the children look like? Meh...

@calculuschild
Copy link
Contributor Author

I updated the documentation. Double-check it for typos if you like. I think everything else up to this point is resolved now?

Copy link
Member

@UziTech UziTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 💯

@UziTech
Copy link
Member

UziTech commented Jul 19, 2021

Do we want to close #2126 since this PR also removes the nptable tokenizer? or this one will have to be rebased after that one is merged.

@calculuschild
Copy link
Contributor Author

Yep, did it just now. 👍

@calculuschild
Copy link
Contributor Author

calculuschild commented Jul 19, 2021

#2112 might need tweaking now, but it should go out in the same major version bump after this is merged.

@calculuschild
Copy link
Contributor Author

@davisjam @joshbruce @styfle Is anyone able to take a look at this one? I'm eager to get this one merged!

@UziTech UziTech mentioned this pull request Aug 2, 2021
5 tasks
@calculuschild
Copy link
Contributor Author

calculuschild commented Aug 2, 2021

Hm, do we need to add to the documentation mention of the lexer.state properties? Since they have been removed from the function signatures?

@UziTech
Copy link
Member

UziTech commented Aug 2, 2021

There are the following breaking changes that I can think of in this PR:

  • Tokenizers will create their own tokens with this.lexer.inline(text, tokens). The inline function will queue the token creation until after all block tokens are rendered.
  • nptable tokenizer is removed and merged with table tokenizer.
  • Extensions tokenizer this object will include the lexer as a property. this.inlineTokens becomes this.lexer.inline.
  • Extensions parser this object will include the parser as a property. this.parseInline becomes this.parser.parseInline.
  • tag and inlineText tokenizer function signatures have changed.

Am I missing any?

@UziTech
Copy link
Member

UziTech commented Aug 2, 2021

do we need to add to the documentation mention of the lexer.state properties?

That wouldn't hurt but it could be done in a separate PR.

@calculuschild
Copy link
Contributor Author

calculuschild commented Aug 2, 2021

Am I missing any?

Just that some function signatures have changed. If people were overwriting any of those functions they might not work anymore. I guess that's kind of covered by the "tokenizers left their own child tokens" though.

Edit: nevermind you got those listed already.

@UziTech UziTech merged commit 288f1cb into markedjs:master Aug 2, 2021
@calculuschild
Copy link
Contributor Author

One minor thing is the naming of inline() and inlineTokens(). I have to keep reminding myself what the difference is. Is there another way we could name things? Maybe something like inline => inlineTokens and inlineTokens => nestedInlineTokens ?

Might make it easier for users to understand when to use each one.

@UziTech
Copy link
Member

UziTech commented Aug 2, 2021

Might make it easier for users to understand when to use each one.

I don't think inlineTokens is something extension creators should call. Ideally inlineTokens should be a private method and inline should be a public method that should always be used to queue the creation of inline tokens.

We could change them to queueInlineTokens and createInlineTokens. That seems to be the most intuitive.

calculuschild added a commit to calculuschild/marked that referenced this pull request Aug 5, 2021
calculuschild added a commit to calculuschild/marked that referenced this pull request Aug 6, 2021
@github-actions
Copy link

🎉 This PR is included in version 3.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants