properly support lookahead expresions in begin matches #2135

joshgoebel · 2019-09-23T18:00:29Z

passes the FULL remaining buffer to processLexeme so
that when it attemps to detemine the next mode it can
run the ruleset against the FULL buffer, not just the
terminator which resulted in a match (which will not
be sufficient enough since lookaheads aren't included
in the actual match)

joshgoebel · 2019-09-23T18:04:19Z

I think we may need to do this same thing for endOfMode to support end matches, can anyone confirm? The use case I cared about was begin matches, and that's what I've tested.

If this looks good there are a few syntaxes that can be updated now that we have PROPER lookahead support - since a few current have HACKS to deal with the prior broken support such as:

XML.js

        /*
        The lookahead pattern (?=...) ensures that 'begin' only matches
        '<style' as a single word, followed by a whitespace or an
        ending braket. The '$' is needed for the lexeme to be recognized
        by hljs.subMode() that tests lexemes outside the stream.
        */
        begin: '<style(?=\\s|>|$)', end: '>',

joshgoebel · 2019-09-23T18:08:45Z

Can someone tel me the correct way to run tests from console? npm test seems to be confused about where ../../build is supposed to be.

Ok got it now, will get the tests passing. :-)

joshgoebel · 2019-09-23T19:09:48Z

Ha, now the issue is we have "broken"? rulesets that will actually match over and over if they can see the full data, such as Fortran:

/(?=\b|\+|\-|\.)(?=\.\d|\d)(?:\d+)?(?:\.?\d*)(?:[de][+-]?\d+)?\b\.?/im

This can actually return a 0 length match, which won't push the position forward, then it'll find AGAIN the same 0 length match in an infinite loop. :-)

I'm thinking perhaps we should ignore 0 length matches.

egor-rogov · 2019-10-04T08:41:11Z

Hey @yyyc514, how is it going with this PR? Haven't heard from you in a while.

marcoscaceres · 2019-10-07T00:52:09Z

Heh, so this is how it all started 😂

joshgoebel · 2019-10-07T00:54:17Z

I got a little distracted, it's true. But I haven't forgotten this. :-)

joshgoebel · 2019-10-10T15:28:52Z

Diff gets really confused around line 490 or so... probably easier to read lines 539 to 644 outside the diff. They should be fairly easy to follow.

I shrunk processLexeme and moved a lot of the function into doBeginMatch and doEndMatch.

The end of compileMode moves into buildModeRegex which builds a 'super matcher' or whatever you'd like to call it. :-) It pretends to be a regex (supporting exec) but also returns metadata reporting WHICH mode matched, etc... so we don't have to recheck.

joshgoebel · 2019-10-10T15:40:25Z

Broke the browser build somehow, no idea how.

egor-rogov · 2019-10-10T18:34:23Z

I'm gonna look into this, but do not expect a quick feedback (:

joshgoebel · 2019-10-10T19:34:19Z

No problem. Nothing too advanced here I don’t think. It’s really just a purer implementation of the goals of the original system. Maybe it didn’t occur to them at the time you could just use capture groups to track which rule matched without the need to re-run the rule sets.

joshgoebel · 2019-10-10T23:25:41Z

Not saying it's worth the time but so many places where we use returnBegin right now could be replaced with look-aheads with no negative issues, and perhaps even resulting in simplifying the grammar, like in the abnf case. (since most of the time you're replacing nesting with just a single, simple regex)

egor-rogov · 2019-10-12T13:47:24Z

I wish I had more time to grok this...
Anyway, what I see:

all tests are passing
speed remains roughly the same (around 9 sec on my laptop for npm test)
lookaheads work!

So I can see no reason to delay this change. However I suggest to split it in 2 (maybe more) PRs: one for highlight.js itself, and the other(s) for improving grammar of languages.

@marcoscaceres You may want to have a look either, 'cause it changes the core of the system.

egor-rogov · 2019-10-12T13:56:26Z

I think we may need to do this same thing for endOfMode to support end matches, can anyone confirm? The use case I cared about was begin matches, and that's what I've tested.

I think it makes perfect sense.
@yyyc514 it would be great if you can add a section to the developer's docs on what regexp features are supported (like lookaheads (thanks to this PR) and backreferences (thanks to #1897)) and what are not (lookbehinds?).

joshgoebel · 2019-10-12T14:10:05Z

Well, look behinds technically aren't in JS yet, are they?... but when they are added this PR should allow them to work just fine with begin matchers, and we should keep that in mind for end matchers also.

When you say "developer docs" which docs are you referring to exactly?

joshgoebel · 2019-10-12T14:12:51Z

However I suggest to split it in 2 (maybe more) PRs: one for highlight.js itself, and the other(s) for improving grammar of languages.

I can definitely do that. If we didn't "Always squash" this could stand-alone. I'm not sure I'm a fan of this always squashing. I see how it makes it easier for people new to git, but when you have something like this that's already packaged in nice small commits it's kind of annoying to have to split it out AGAIN.

I see a dropdown, perhaps it's possible to change it to non-squash... in which case I'd just clean up the one weird commit and then leave the history as is, rebase onto master, and then do a clean fast-forward merge.

egor-rogov · 2019-10-12T15:01:46Z

When you say "developer docs" which docs are you referring to exactly?

The one in docs/ folder (which goes to https://highlightjs.readthedocs.io/en/latest/)

joshgoebel · 2019-10-12T15:05:03Z

Would you suggest a new document to talk about regex features?

egor-rogov · 2019-10-12T15:06:31Z

language-guide looks like a proper place for it.

joshgoebel · 2019-10-12T15:07:17Z

I think I might keep it small and focus on what we DON'T support... I think the idea is its regex... it should just work... If someone wants to learn about regex there are all sorts of resources.

So right now the caveats (after this PR is merged):

"look-ahead" doesn't work properly in end expressions.
"look-behind" (when JS supports it) also doesn't work properly in end expressions.

egor-rogov · 2019-10-12T15:09:26Z

I thought that what we do support (and since what version) may be also important for those who previously stumbled on something unsupported.

joshgoebel · 2019-10-15T13:22:00Z

Does that addition to docs help?

joshgoebel · 2019-10-17T14:23:59Z

@egor-rogov We good here yet?

docs/language-guide.rst

egor-rogov · 2019-10-17T14:58:32Z

Ok, here we go!
I think we shouldn’t squash to keep separate commits.

joshgoebel · 2019-10-17T14:59:46Z

Agree. I'll rebase and fixup the last commit and the plain jain merge when ready.

- begin matches are matched a single time (they no longer need to be rematched after found) - look-ahead should now work properly for begin matches because of this change - should be a tiny bit faster Before The old parser would build a list of regexes per mode and then combine that into a large regex. This is what was used to scan your code for matches. But after a match was found it had no way on known WHICH match - so it would then have to re-run all the rules sequentially on the bit of match text trying to figure out which rule had matched. The problem is while the original matcher was running agianst the full code this "rematch" was only running aginst the matched text. So look-ahead matches would naturally fail becasue the content they were tryign to look-ahead to was no longer present. After We take the list of regexes per mode and combine then into a larger regex, but with match groups. We keep track of which match group position correspond to which rule. Now when we hit a match we can check which match group was matched and find the associated rule/mode that was matched withotu having to double check. Look-ahead begin matches now "just work" because the rules are always running against the full body of text and not just a subset. Caveats This doesn't solve look-ahead for end matching so naturally it also does nothing for endSameAsBegin. IE, don't expect look-aheads to work properly in those situations yet.

joshgoebel mentioned this pull request Sep 23, 2019

Request: Support regex look-ahead for begin and end matchers #1349

Closed

egor-rogov mentioned this pull request Sep 25, 2019

Some existing rules are unrepeatable and should be corrected #2140

Closed

joshgoebel added enhancement An enhancement or new feature on hold labels Oct 7, 2019

joshgoebel force-pushed the proper_lookahead_support branch from 2313427 to 2981f8e Compare October 10, 2019 15:24

joshgoebel changed the title ~~properly support lookahead expresions in matches~~ [WIP] properly support lookahead expresions in matches Oct 10, 2019

joshgoebel force-pushed the proper_lookahead_support branch from fd9b786 to d7bdc7f Compare October 10, 2019 16:09

joshgoebel removed the on hold label Oct 11, 2019

joshgoebel changed the title ~~[WIP] properly support lookahead expresions in matches~~ properly support lookahead expresions in matches Oct 11, 2019

joshgoebel self-assigned this Oct 11, 2019

joshgoebel changed the title ~~properly support lookahead expresions in matches~~ properly support lookahead expresions in begin matches Oct 11, 2019

joshgoebel force-pushed the proper_lookahead_support branch from 2d8e2de to 4240c13 Compare October 12, 2019 14:17

joshgoebel added this to the 9.15.11 milestone Oct 12, 2019

joshgoebel added the parser label Oct 13, 2019

egor-rogov requested changes Oct 17, 2019

View reviewed changes

docs/language-guide.rst Outdated Show resolved Hide resolved

egor-rogov approved these changes Oct 17, 2019

View reviewed changes

joshgoebel added 10 commits October 19, 2019 17:05

remove |$ hack from xml

4d9c06d

remove |$ hack from stata

b46e6a3

remove |$ hack from livescript

2466174

remove incorrect comment

6987a21

remove |$ hack from coffeescript

80413f5

(abnf) use simpler look-ahead rule

1902fd2

(brainfuck) use look-ahead vs returnBegin

2bc4024

(stylus) simplify with look-aheads

33601b4

first pass as docs for what regex features we support

bec856c

joshgoebel force-pushed the proper_lookahead_support branch from 9187e19 to bec856c Compare October 19, 2019 21:06

joshgoebel merged commit cf37996 into highlightjs:master Oct 19, 2019

joshgoebel mentioned this pull request Oct 20, 2019

Provide access to a parse tree as an alternative to HTML output #1086

Closed

joshgoebel mentioned this pull request Oct 30, 2019

Highlightjs not respecting parent regex of a sub mode #2238

Closed

joshgoebel mentioned this pull request Jan 31, 2020

Freezing issue with coffeescript regex under some circumstances scrivo/highlight.php#66

Closed

gregives mentioned this pull request Feb 8, 2020

[plugin] Idea: Support YAML front matter in Markdown files #2391

Closed

joshgoebel deleted the proper_lookahead_support branch February 15, 2020 13:08

joshgoebel mentioned this pull request Oct 5, 2021

enh(fsharp) Global overhaul #3348

Merged

2 tasks

properly support lookahead expresions in begin matches #2135

properly support lookahead expresions in begin matches #2135

Uh oh!

Conversation

joshgoebel commented Sep 23, 2019

Uh oh!

joshgoebel commented Sep 23, 2019

Uh oh!

joshgoebel commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshgoebel commented Sep 23, 2019

Uh oh!

egor-rogov commented Oct 4, 2019

Uh oh!

marcoscaceres commented Oct 7, 2019

Uh oh!

joshgoebel commented Oct 7, 2019

Uh oh!

joshgoebel commented Oct 10, 2019

Uh oh!

joshgoebel commented Oct 10, 2019

Uh oh!

egor-rogov commented Oct 10, 2019

Uh oh!

joshgoebel commented Oct 10, 2019

Uh oh!

joshgoebel commented Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egor-rogov commented Oct 12, 2019

Uh oh!

egor-rogov commented Oct 12, 2019

Uh oh!

joshgoebel commented Oct 12, 2019

Uh oh!

joshgoebel commented Oct 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egor-rogov commented Oct 12, 2019

Uh oh!

joshgoebel commented Oct 12, 2019

Uh oh!

egor-rogov commented Oct 12, 2019

Uh oh!

joshgoebel commented Oct 12, 2019

Uh oh!

egor-rogov commented Oct 12, 2019

Uh oh!

joshgoebel commented Oct 15, 2019

Uh oh!

joshgoebel commented Oct 17, 2019

Uh oh!

Uh oh!

egor-rogov commented Oct 17, 2019

Uh oh!

joshgoebel commented Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshgoebel commented Sep 23, 2019 •

edited

Loading

joshgoebel commented Oct 10, 2019 •

edited

Loading

joshgoebel commented Oct 12, 2019 •

edited

Loading

joshgoebel commented Oct 17, 2019 •

edited

Loading