Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(markdown) Disallow _ from triggering emphasis/bold inside words #3152

Closed

Conversation

polygitapp
Copy link

Followup from #3098

Changes

Disallow underscores from triggering emphasis/bold inside of words. Note that asterisks still work inside of words.

@joshgoebel You mentioned either using beforeMatch or match all words to achieve this. BeforeMatch appears to have been removed. It's replacement, match with an array of regexes, afaict doesn't handle end blocks. And adding a rule to match all words, I don't think works with words like _WORD_WITH_UNDERSCORE_. So I ended up using a nested rule where the outer rule matches a space and the underscore. And the inner rule matches the underscore and also begins the class.

Checklist

  • Added markup tests, or they don't apply here because...
  • Updated the changelog at CHANGES.md

@@ -6,13 +6,19 @@ __Bold__
__Bold *then italic*__
**Bold _then italic_**
**Bold and <br/>**
**_[this](https://google.com)_**
__*[this](https://google.com)*__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Generally we should add new tests, not change existing ones. The former still seems like a valid test to me, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule to match underscores checks if there is a space or newline before it. This is how it prevents matching underscores inside of words. Since the containing bold has already matched the asterisks. The underscore rule can't match the space/newline. A workarounds might be to add a rule specifically for **_, ___ and *__ but its a tradeoff in complexity. Can add rules for that if you'd like.

Copy link
Member

@joshgoebel joshgoebel Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. Are you saying you removed these tests because they now fail? That's not what we need to be doing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I make changes so that the parser is more correct. It's too complex and you'd rather have something that is simpler and "good enough". If I go the simpler route, certain edge cases are going to regress. But other, more common use cases will now be improved. Its a tradeoff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's certainly possible we may need tradeoffs. Yet please don't hide these decisions by silently removing tests... If a change is deciding to make a tradeoff then it should be mentioned explicitly in the PR notes (with examples) what that comprise is, what the thinking is behind the change, etc... ie if you're breaking an edge case and improving a common case that thinking should be outlined in the PR notes... so that it can be easily reviewed to decide if the tradeoff is an acceptable one or not.


> quoted **Bold _then italic_**
> _[this](https://google.com)_
> *[this](https://google.com)*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, the quoted rule has matched the > immediately before the underscore, preventing the underscore rule from matching the space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could I think be resolved by simple changing block quote to use a look-ahead for the spaces:

  const BLOCKQUOTE = {
    className: 'quote',
    begin: /^>(?=\s+)/,
    contains: inline(),
    end: '$'
  };

@joshgoebel
Copy link
Member

joshgoebel commented Apr 19, 2021

Yeah, this is way too much complexity. What about:

--- a/src/languages/markdown.js
+++ b/src/languages/markdown.js
@@ -161,7 +161,7 @@ export default function(hljs) {
       },
       {
         begin: /_(?!_)/,
-        end: /_/,
+        end: /_(?!\w)/,
         relevance: 0
       }
     ]
@@ -221,6 +221,9 @@ export default function(hljs) {
       HEADER,
       INLINE_HTML,
       LIST,
+      {
+        match: /(?!_)[\w_]+/
+      },
       BOLD,
       ITALIC,
       BLOCKQUOTE,

Seems to pass existing tests plus your new tests, please confirm?

@joshgoebel
Copy link
Member

joshgoebel commented Apr 19, 2021

 _italic_word_with_underscore_
 *italic*wordwith*asterisks*

Screen Shot 2021-04-19 at 3 11 23 PM

Why is the behavior of these not consistent?

Why should *italic* be processing n its own but not _italic_?

@polygitapp
Copy link
Author

polygitapp commented Apr 19, 2021

That's just markdown... ¯_(ツ)_/¯.

Text Markdown
_italic_word_with_underscore_ italic_word_with_underscore
*italic*wordwith*asterisks* italicwordwithasterisks

@joshgoebel
Copy link
Member

.That's just markdown... ¯_(ツ)_/¯.

Ok, that's a fine answer for that. :-) markdown be weird sometimes. Not sure I was aware of that quirk.

@joshgoebel
Copy link
Member

Did you see my proposed solution above? Can you confirm it resolves all test cases (including the existing ones)?

@polygitapp
Copy link
Author

Yeah, this is way too much complexity.

I don't know how to answer this. What seems like complexity to one person, appears like the correctness to another. And what seems like simplicity to one person, feels ugly to another. I wanted to integrate the changes from my other PR after this but does not appear like thats going to happen.

Seems to pass existing tests plus your new tests, please confirm?

It does appear to work.

@polygitapp polygitapp closed this Apr 19, 2021
@joshgoebel
Copy link
Member

joshgoebel commented Apr 19, 2021

@polygitapp I appreciate all your effort on this and I'm truly sorry you're frustrated. From my perspective we were making progress - this PR was much simpler and more focused and therefor easier to review. Often new contributors PRs are far more complex than necessary. That's truly not a knock on you or your PR or skills. It's rather common and expected of new contributors. This is a complex project; language grammars are complex. That's why we're here to review PRs in the first place and find simpler solutions when possible.

Right now what we have is:

  • a simple, ~2 line patch to Markdown
  • it passes all existing tests and your new tests
  • thereby solving the stated problem completely "Disallow underscores from triggering emphasis/bold inside of words"

This result seems more correct and a win for all Markdown users - with zero regressions.

Your effort and involvement has driven these changes even if the final solution was proposed by the core team. Hopefully you would next update this PR the the simpler approach proposed above. Then we could do another review and hopefully get this merged.

I wanted to integrate the changes from my other PR after this

I'm happy to work with you on whatever item is might be next on the list after we get this merged. There will hopefully be simpler approaches to those issues as well. It might save you some time (and be easier) to first open an issue clearly explaining a particular problem to get some feedback before making a PR - particularly if your finding the solution to be non-trivial.

@joshgoebel
Copy link
Member

joshgoebel commented Apr 19, 2021

I don't know how to answer this.

It didn't really require an answer - that's why I immediately followed it with a proposed alternative.

What seems like complexity to one person, appears like the correctness to another.

The markup tests are our ultimate arbiter of "correctness". You provided two new tests - and the solution I proposed produces "correct" results for both your new tests as well as correct results for two tests you removed. How is this not a "correct" solution? It solves the stated problem while introducing zero regressions.

And what seems like simplicity to one person, feels ugly to another.

I don't know how to answer this. :-)

@joshgoebel joshgoebel reopened this Apr 19, 2021
@joshgoebel joshgoebel changed the title Disallow underscores from triggering emphasis/bold inside of words enh(markdown) Disallow underscores from triggering emphasis/bold inside words Apr 19, 2021
@joshgoebel joshgoebel changed the title enh(markdown) Disallow underscores from triggering emphasis/bold inside words enh(markdown) Disallow _ from triggering emphasis/bold inside words Apr 19, 2021
@polygitapp
Copy link
Author

With your changes highlighter.js, still isn't going to handle things like the following correctly.

# 1*2=3
abc

Which maybe fine for general purpose syntax highlighting. But for my use case I need more accuracy. And if a 40+ line change, is too complex, there is no way I'm getting the 200+ line changes from #3098 merged in, which builds on top of that. I appreciate the work you've done on highlighter, it has been very helpful for my project. But I think I just need to maintain my own fork of the markdown syntax.

@joshgoebel
Copy link
Member

joshgoebel commented Apr 19, 2021

With your changes highlighter.js, still isn't going to handle things like the following correctly.

There would of course be no easy way for me to know that. Your test coverage did not include that test. :-)

# 1*2=3

What is the problem there? That * would trigger bold incorrectly? Could that not be solved with a simple look-ahead to make sure there is a closing * to be found "up ahead"? If so that would be another one line patch... if not, it could certainly be done with a on:begin callback (and far fewer LOC), though we don't encourage callbacks in core since there are real performance implications to using them. A short callback might be far simpler than the approach seen here.

But for my use case I need more accuracy.

I don't think we are never going to be 100% accurate here. Markdown is crazy complex to parse. There is a reason a correct Markdown parser is ~20-40kb gzipped (I think that's what I found when I looked the other day) and our grammar is 1.5kb gzipped. We are not looking to be a fully correct Markdown parser (or a fully correct parser for any complex language).

And if a 40+ line change, is too complex, there is no way I'm getting the 200+ line changes from #3098 merged in,

Probably not if your goal is getting your solution merged as-is rather than working on a "good enough" solution that might improve the library for the other 99% of our target audience. It's also of course STILL possible some/many of the things you want to do can be done much simpler, but if we don't end up exploring them, I understand. And again 100% accuracy is probably not realistic in core.

I just need to maintain my own fork of the markdown syntax

Here are some thoughts on going down your own road:

  • Take a look at on:begin to see if it might help you
  • Find a real markdown parser that generates an AST then plug that back into Highlight.js

I'd love to see a 3rd party grammar that does the latter. I believe it's fully possible to do with our existing plugin support. I've encouraged a few people to hook us up to a full parser, but I haven't gotten any bites yet. If you wanted our highlighted but with 100% fidelity the simplest way to get it would be to just use an existing Markdown parser (if you can stomach the download size). AND it's trivially easy - IF you can get an AST generated easily from another source.

The question is how easily can you get an AST from a markdown parser vs say HTML.

@joshgoebel
Copy link
Member

https://github.com/remarkjs/remark/tree/main/packages/remark-parse

Looks like exactly what you'd want...

@polygitapp
Copy link
Author

What is the problem there? That * would trigger bold incorrectly? Could that not be solved with a simple look-ahead to make sure there is a closing * to be found "up ahead"? If so that would be another one line patch... if not, it could certainly be done with a on:begin callback (and far fewer LOC), though we don't encourage callbacks in core since there are real performance implications to using them. A short callback might be far simpler than the approach seen here.

This is the frustrating part. You dismissing it as too much complexity without realizing how badly the existing markdown parser behaves with real world content. And immediately pushing your own "simpler" solutions without trying to engage with the actual issues that I'm trying to solve. To expand on the example.

  1. # 1*2=3\nabc should NOT italicize, because there is no closing asterisk
  2. # 1*2=3\nabc* should NOT italicize, because the header and the paragraph are different blocks.
  3. 1*2=3\nabc* should italicize, because it is a single paragraph
  4. 1*2=3\n> abc* should NOT italicize, because the paragraph and the block quote are different blocks.
  5. > 1*2=3\n> abc* should italicize, because it is a single block quote.

Would using a lookahead or on:begin be able to match those? Maybe? You seem like you could figure it out. But we never got close to discussing these nuances.

Maybe I should have opened my PR with a list of all the cases that it would fix. I definitely could have been more elaborate and descriptive in my description. And I will note that my original PR does not exactly highlight the above examples as described. But when it fails, it does so in a very reasonable manner and doesn't mess up the rest of the document.

I don't think we are never going to be 100% accurate here. Markdown is crazy complex to parse. There is a reason a correct Markdown parser is ~20-40kb gzipped (I think that's what I found when I looked the other day) and our grammar is 1.5kb gzipped. We are not looking to be a fully correct Markdown parser (or a fully correct parser for any complex language).

And neither was I trying to make it 100% accurate. My PR isn't 20kb gzipped.

Find a real markdown parser that generates an AST then plug that back into Highlight.js

That is a really great idea, I'll look into that. Anyways please leave this closed. I don't want to continue the contribution process.

@polygitapp polygitapp closed this Apr 21, 2021
@joshgoebel
Copy link
Member

You dismissing it as too much complexity without realizing how badly the existing markdown parser behaves with real world content.

Since we are NOT interested in implementing the "full spec" (which one?) we can only look at isolated examples... pointing to some "real world content" might help... but if someone wishes to improve our Markdown support we need to pick very particular examples and then decide if those are fixable with a reasonable amount of complexity and effort. We purposely are NOT a full parser for any language. You can do a LOT of things with our grammars that we do NOT encourage or desire in the core library (because it's too complex and not easily maintainable). Our goal is NOT to fully understand a language.

That philosophy is kind of at odds with something [super complex] like Markdown. The Github spec alone has 12 rules for the interactions between emphasis and strong emphasis and then 5 additional rules to attempt to clarify the ambiguities of the first 12 rules. And 130 different examples showing how all these rules intermingle.

without trying to engage with the actual issues that I'm trying to solve.

Again, the problem is your tests were insufficient and the problem wasn't fully declared. You provided only 2 tests and also broke 2 tests. My solution passed your tests and our prior tests.

If you had included additional tests with edge cases I might have been able to respond better.

I don't want to continue the contribution process.

It's for me, not you... I may still apply the fix I suggested when I have time... though perhaps I can re-open this as a new issue.

That is a really great idea, I'll look into that.

Would love to see what you come up with.


Maybe I should have opened my PR with a list of all the cases that it would fix.

That might have helped, but it's a balance. You make it sound like our Markdown support is terrible yet you're the first person in years to raise this as an issue. That of course doesn't mean your mistaken, only that it doesn't' seem to be bothering that many people. Perhaps we highlight a lot of Markdown better than you think? All I have to go on are the issues people file.

And I will note that my original PR does not exactly highlight the above examples as described. But when it fails, it does so in a very reasonable manner

"very reasonable manner" is of course subjective.

and doesn't mess up the rest of the document.

I'd wager some of those concerns could also be solved (more or less) with very small fixes also. For example, adding end: /\n\n/ to emphasis modes to terminate them at paragraph breaks. Now we may incorrectly trigger emphasis but it wouldn't linger on thru the WHOLE document... it would stop when it hits a paragraph break.

"Doesn't mess up the rest of the document" is a slightly higher priority for us than "correctly highlight all cases".

@joshgoebel
Copy link
Member

Doesn't mess up the rest of the document

If you'd be willing to point to a few actual real life cases of this happening I'd love to review them.

@joshgoebel
Copy link
Member

# 1*2=3\nabc should NOT italicize, because there is no closing asterisk

This is hard one since I'm not sure you can say with regex alone "does not include \n\n"... it's easy enough to look forward to find a match, but hard to stop when you hit a paragraph break. That's what on:begin could do easily, but triggering it on every * would be impossibly expensive.

# 1*2=3\nabc* should NOT italicize, because the header and the paragraph are different blocks.

This is getting into parsing (vs pattern matching), which is something we try and stay out of. And complicated the above case since we don't really seem to care about "paragraph breaks" after all we care about "blocks"... which is seems their are many different ways to create a new block scope.

A separate rule for "emphasis inside header" would be needed.

1*2=3\n> abc* should NOT italicize, because the paragraph and the block quote are different blocks.

Here it sounds like we're getting into full parsing again to know that > starts a new block. But really the problem here is the same as the first one... is it possible to use a look-ahead or not?

> 1*2=3\n> abc* should italicize, because it is a single block quote.

We don't really have any facilities for handling anything like this. I guess depending on the end rule this might "just work" if there was a clear regex to detect the end of a sequence of quotes... like \n\n or a line not starting with >... but this is going to start getting convoluted since you need to rematch sub-lines within the block to nicely highlight the >.


Lets take just the # 1*2=3\nabc* should NOT italicize example...

Knowing we're not interested in solving the full spec it's hard to talk about this without seeing real-life examples (and not just artificial ones). Obviously I can write a heading and throw a * in the middle of it, but how often does that happen in practice? How much complexity is worth adding to chase these edge cases?

Since a stray * here might indeed "break the whole document" (at least until the next *) it might be worth considering, but should it be solved with a whole new variant of emphasis or just with a helper:

{
  match: HEADER_RE,
  contains: {
    endsWithParent(EMPHASIS_MODE)
  }
}

Not 100% correct, but prevents entire document breakage... the problem then becomes if this is necessary for every single block item, etc... and slowly I start to see your rational for making blocks of everything, but very fast you're crossing over into "parse markdown" vs "pattern match" it...

@joshgoebel
Copy link
Member

joshgoebel commented Apr 21, 2021

You dismissing it as too much complexity

To be clear I was dismissing it as too much complexity given the stated problem and test cases provided. Things like this are always evaluated in the larger context.

I just realized that even TextMate doesn't highlight * spread across multiple lines... so obviously it's using a look-ahead trying to find the closing * on the same line.

*this is emphasis* for sure
but *is this emphasis
as well*

GitHub highlighting makes the same tradeoff. We'd be in good company if we adopted a similar solution... and it would add very little complexity. Right now though I wonder how many Markdown document's we're truly performing poorly on vs how many we're actually doing a better job highlighting than GitHub. :-) Though I do agree a false-positive here does seem to be worse than a false-negative...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants