-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
properly support lookahead expresions in begin matches #2135
properly support lookahead expresions in begin matches #2135
Conversation
I think we may need to do this same thing for If this looks good there are a few syntaxes that can be updated now that we have PROPER lookahead support - since a few current have HACKS to deal with the prior broken support such as: XML.js
|
Can someone tel me the correct way to run tests from console? Ok got it now, will get the tests passing. :-) |
Ha, now the issue is we have "broken"? rulesets that will actually match over and over if they can see the full data, such as Fortran: /(?=\b|\+|\-|\.)(?=\.\d|\d)(?:\d+)?(?:\.?\d*)(?:[de][+-]?\d+)?\b\.?/im This can actually return a 0 length match, which won't push the position forward, then it'll find AGAIN the same 0 length match in an infinite loop. :-) I'm thinking perhaps we should ignore 0 length matches. |
Hey @yyyc514, how is it going with this PR? Haven't heard from you in a while. |
Heh, so this is how it all started 😂 |
I got a little distracted, it's true. But I haven't forgotten this. :-) |
2313427
to
2981f8e
Compare
Diff gets really confused around line 490 or so... probably easier to read lines 539 to 644 outside the diff. They should be fairly easy to follow. I shrunk The end of |
Broke the browser build somehow, no idea how. |
fd9b786
to
d7bdc7f
Compare
I'm gonna look into this, but do not expect a quick feedback (: |
No problem. Nothing too advanced here I don’t think. It’s really just a purer implementation of the goals of the original system. Maybe it didn’t occur to them at the time you could just use capture groups to track which rule matched without the need to re-run the rule sets. |
Not saying it's worth the time but so many places where we use returnBegin right now could be replaced with look-aheads with no negative issues, and perhaps even resulting in simplifying the grammar, like in the abnf case. (since most of the time you're replacing nesting with just a single, simple regex) |
I wish I had more time to grok this...
So I can see no reason to delay this change. However I suggest to split it in 2 (maybe more) PRs: one for @marcoscaceres You may want to have a look either, 'cause it changes the core of the system. |
I think it makes perfect sense. |
Well, look behinds technically aren't in JS yet, are they?... but when they are added this PR should allow them to work just fine with begin matchers, and we should keep that in mind for end matchers also. When you say "developer docs" which docs are you referring to exactly? |
I can definitely do that. If we didn't "Always squash" this could stand-alone. I'm not sure I'm a fan of this always squashing. I see how it makes it easier for people new to git, but when you have something like this that's already packaged in nice small commits it's kind of annoying to have to split it out AGAIN. I see a dropdown, perhaps it's possible to change it to non-squash... in which case I'd just clean up the one weird commit and then leave the history as is, rebase onto master, and then do a clean fast-forward merge. |
2d8e2de
to
4240c13
Compare
The one in |
Would you suggest a new document to talk about regex features? |
|
I think I might keep it small and focus on what we DON'T support... I think the idea is its regex... it should just work... If someone wants to learn about regex there are all sorts of resources. So right now the caveats (after this PR is merged):
|
I thought that what we do support (and since what version) may be also important for those who previously stumbled on something unsupported. |
Does that addition to docs help? |
@egor-rogov We good here yet? |
Ok, here we go! |
Agree. I'll rebase and fixup the last commit and the plain jain merge when ready. |
- begin matches are matched a single time (they no longer need to be rematched after found) - look-ahead should now work properly for begin matches because of this change - should be a tiny bit faster Before The old parser would build a list of regexes per mode and then combine that into a large regex. This is what was used to scan your code for matches. But after a match was found it had no way on known WHICH match - so it would then have to re-run all the rules sequentially on the bit of match text trying to figure out which rule had matched. The problem is while the original matcher was running agianst the full code this "rematch" was only running aginst the matched text. So look-ahead matches would naturally fail becasue the content they were tryign to look-ahead to was no longer present. After We take the list of regexes per mode and combine then into a larger regex, but with match groups. We keep track of which match group position correspond to which rule. Now when we hit a match we can check which match group was matched and find the associated rule/mode that was matched withotu having to double check. Look-ahead begin matches now "just work" because the rules are always running against the full body of text and not just a subset. Caveats This doesn't solve look-ahead for end matching so naturally it also does nothing for endSameAsBegin. IE, don't expect look-aheads to work properly in those situations yet.
9187e19
to
bec856c
Compare
processLexeme
sothat when it attemps to detemine the next mode it can
run the ruleset against the FULL buffer, not just the
terminator which resulted in a match (which will not
be sufficient enough since lookaheads aren't included
in the actual match)