-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh(markdown) Disallow _ from triggering emphasis/bold inside words #3152
enh(markdown) Disallow _ from triggering emphasis/bold inside words #3152
Conversation
@@ -6,13 +6,19 @@ __Bold__ | |||
__Bold *then italic*__ | |||
**Bold _then italic_** | |||
**Bold and <br/>** | |||
**_[this](https://google.com)_** | |||
__*[this](https://google.com)*__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Generally we should add new tests, not change existing ones. The former still seems like a valid test to me, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule to match underscores checks if there is a space or newline before it. This is how it prevents matching underscores inside of words. Since the containing bold has already matched the asterisks. The underscore rule can't match the space/newline. A workarounds might be to add a rule specifically for **_
, ___
and *__
but its a tradeoff in complexity. Can add rules for that if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. Are you saying you removed these tests because they now fail? That's not what we need to be doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I make changes so that the parser is more correct. It's too complex and you'd rather have something that is simpler and "good enough". If I go the simpler route, certain edge cases are going to regress. But other, more common use cases will now be improved. Its a tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's certainly possible we may need tradeoffs. Yet please don't hide these decisions by silently removing tests... If a change is deciding to make a tradeoff then it should be mentioned explicitly in the PR notes (with examples) what that comprise is, what the thinking is behind the change, etc... ie if you're breaking an edge case and improving a common case that thinking should be outlined in the PR notes... so that it can be easily reviewed to decide if the tradeoff is an acceptable one or not.
|
||
> quoted **Bold _then italic_** | ||
> _[this](https://google.com)_ | ||
> *[this](https://google.com)* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, the quoted rule has matched the >
immediately before the underscore, preventing the underscore rule from matching the space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could I think be resolved by simple changing block quote to use a look-ahead for the spaces:
const BLOCKQUOTE = {
className: 'quote',
begin: /^>(?=\s+)/,
contains: inline(),
end: '$'
};
Yeah, this is way too much complexity. What about:
Seems to pass existing tests plus your new tests, please confirm? |
That's just markdown... ¯_(ツ)_/¯.
|
Ok, that's a fine answer for that. :-) markdown be weird sometimes. Not sure I was aware of that quirk. |
Did you see my proposed solution above? Can you confirm it resolves all test cases (including the existing ones)? |
I don't know how to answer this. What seems like complexity to one person, appears like the correctness to another. And what seems like simplicity to one person, feels ugly to another. I wanted to integrate the changes from my other PR after this but does not appear like thats going to happen.
It does appear to work. |
@polygitapp I appreciate all your effort on this and I'm truly sorry you're frustrated. From my perspective we were making progress - this PR was much simpler and more focused and therefor easier to review. Often new contributors PRs are far more complex than necessary. That's truly not a knock on you or your PR or skills. It's rather common and expected of new contributors. This is a complex project; language grammars are complex. That's why we're here to review PRs in the first place and find simpler solutions when possible. Right now what we have is:
This result seems more correct and a win for all Markdown users - with zero regressions. Your effort and involvement has driven these changes even if the final solution was proposed by the core team. Hopefully you would next update this PR the the simpler approach proposed above. Then we could do another review and hopefully get this merged.
I'm happy to work with you on whatever item is might be next on the list after we get this merged. There will hopefully be simpler approaches to those issues as well. It might save you some time (and be easier) to first open an issue clearly explaining a particular problem to get some feedback before making a PR - particularly if your finding the solution to be non-trivial. |
It didn't really require an answer - that's why I immediately followed it with a proposed alternative.
The
I don't know how to answer this. :-) |
With your changes highlighter.js, still isn't going to handle things like the following correctly.
Which maybe fine for general purpose syntax highlighting. But for my use case I need more accuracy. And if a 40+ line change, is too complex, there is no way I'm getting the 200+ line changes from #3098 merged in, which builds on top of that. I appreciate the work you've done on highlighter, it has been very helpful for my project. But I think I just need to maintain my own fork of the markdown syntax. |
There would of course be no easy way for me to know that. Your test coverage did not include that test. :-)
What is the problem there? That
I don't think we are never going to be 100% accurate here. Markdown is crazy complex to parse. There is a reason a correct Markdown parser is ~20-40kb gzipped (I think that's what I found when I looked the other day) and our grammar is 1.5kb gzipped. We are not looking to be a fully correct Markdown parser (or a fully correct parser for any complex language).
Probably not if your goal is getting your solution merged as-is rather than working on a "good enough" solution that might improve the library for the other 99% of our target audience. It's also of course STILL possible some/many of the things you want to do can be done much simpler, but if we don't end up exploring them, I understand. And again 100% accuracy is probably not realistic in core.
Here are some thoughts on going down your own road:
I'd love to see a 3rd party grammar that does the latter. I believe it's fully possible to do with our existing plugin support. I've encouraged a few people to hook us up to a full parser, but I haven't gotten any bites yet. If you wanted our highlighted but with 100% fidelity the simplest way to get it would be to just use an existing Markdown parser (if you can stomach the download size). AND it's trivially easy - IF you can get an AST generated easily from another source. The question is how easily can you get an AST from a markdown parser vs say HTML. |
https://github.com/remarkjs/remark/tree/main/packages/remark-parse Looks like exactly what you'd want... |
This is the frustrating part. You dismissing it as too much complexity without realizing how badly the existing markdown parser behaves with real world content. And immediately pushing your own "simpler" solutions without trying to engage with the actual issues that I'm trying to solve. To expand on the example.
Would using a lookahead or Maybe I should have opened my PR with a list of all the cases that it would fix. I definitely could have been more elaborate and descriptive in my description. And I will note that my original PR does not exactly highlight the above examples as described. But when it fails, it does so in a very reasonable manner and doesn't mess up the rest of the document.
And neither was I trying to make it 100% accurate. My PR isn't 20kb gzipped.
That is a really great idea, I'll look into that. Anyways please leave this closed. I don't want to continue the contribution process. |
Since we are NOT interested in implementing the "full spec" (which one?) we can only look at isolated examples... pointing to some "real world content" might help... but if someone wishes to improve our Markdown support we need to pick very particular examples and then decide if those are fixable with a reasonable amount of complexity and effort. We purposely are NOT a full parser for any language. You can do a LOT of things with our grammars that we do NOT encourage or desire in the core library (because it's too complex and not easily maintainable). Our goal is NOT to fully understand a language. That philosophy is kind of at odds with something [super complex] like Markdown. The Github spec alone has 12 rules for the interactions between emphasis and strong emphasis and then 5 additional rules to attempt to clarify the ambiguities of the first 12 rules. And 130 different examples showing how all these rules intermingle.
Again, the problem is your tests were insufficient and the problem wasn't fully declared. You provided only 2 tests and also broke 2 tests. My solution passed your tests and our prior tests. If you had included additional tests with edge cases I might have been able to respond better.
It's for me, not you... I may still apply the fix I suggested when I have time... though perhaps I can re-open this as a new issue.
Would love to see what you come up with.
That might have helped, but it's a balance. You make it sound like our Markdown support is terrible yet you're the first person in years to raise this as an issue. That of course doesn't mean your mistaken, only that it doesn't' seem to be bothering that many people. Perhaps we highlight a lot of Markdown better than you think? All I have to go on are the issues people file.
"very reasonable manner" is of course subjective.
I'd wager some of those concerns could also be solved (more or less) with very small fixes also. For example, adding "Doesn't mess up the rest of the document" is a slightly higher priority for us than "correctly highlight all cases". |
If you'd be willing to point to a few actual real life cases of this happening I'd love to review them. |
This is hard one since I'm not sure you can say with regex alone "does not include \n\n"... it's easy enough to look forward to find a match, but hard to stop when you hit a paragraph break. That's what
This is getting into parsing (vs pattern matching), which is something we try and stay out of. And complicated the above case since we don't really seem to care about "paragraph breaks" after all we care about "blocks"... which is seems their are many different ways to create a new block scope. A separate rule for "emphasis inside header" would be needed.
Here it sounds like we're getting into full parsing again to know that > starts a new block. But really the problem here is the same as the first one... is it possible to use a look-ahead or not?
We don't really have any facilities for handling anything like this. I guess depending on the Lets take just the Knowing we're not interested in solving the full spec it's hard to talk about this without seeing real-life examples (and not just artificial ones). Obviously I can write a heading and throw a Since a stray {
match: HEADER_RE,
contains: {
endsWithParent(EMPHASIS_MODE)
}
} Not 100% correct, but prevents entire document breakage... the problem then becomes if this is necessary for every single block item, etc... and slowly I start to see your rational for making blocks of everything, but very fast you're crossing over into "parse markdown" vs "pattern match" it... |
To be clear I was dismissing it as too much complexity given the stated problem and test cases provided. Things like this are always evaluated in the larger context. I just realized that even TextMate doesn't highlight *this is emphasis* for sure
but *is this emphasis
as well* GitHub highlighting makes the same tradeoff. We'd be in good company if we adopted a similar solution... and it would add very little complexity. Right now though I wonder how many Markdown document's we're truly performing poorly on vs how many we're actually doing a better job highlighting than GitHub. :-) Though I do agree a false-positive here does seem to be worse than a false-negative... |
Followup from #3098
Changes
Disallow underscores from triggering emphasis/bold inside of words. Note that asterisks still work inside of words.
@joshgoebel You mentioned either using beforeMatch or match all words to achieve this. BeforeMatch appears to have been removed. It's replacement,
match
with an array of regexes, afaict doesn't handle end blocks. And adding a rule to match all words, I don't think works with words like_WORD_WITH_UNDERSCORE_
. So I ended up using a nested rule where the outer rule matches a space and the underscore. And the inner rule matches the underscore and also begins the class.Checklist
CHANGES.md