-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for backreferences #132
base: main
Are you sure you want to change the base?
Conversation
👎 This falls afoul of our commandment not to parse RegExps. The reason we don't is because there are lots of edge cases and compatibility oddities we don't want to test or deal with; i.e., it's extremely difficult to do correctly. Off the top of my head, your code treats the following incorrectly:
|
That's fair, I was not comfortable with having to even avoid the Do you have a better way of handling this particular issue? At first I was going to solve it by using states; plan was to match In case you're not familiar with the dollar quoting, here's some examples: -- Empty string
select $$$$;
-- String with a dollar in it
select $$Blah $ blah$$;
-- Dollar on its own
select $_$$$_$;
-- or
select $anythinghere$$$anythinghere$;
-- String with `$$` in it
select $_$Blah $$ blah$_$;
-- Using a different tag so we can include both $$ and $_$
select $CustomTag$Blah $$ $_$ blah$CustomTag$; (I'm very new to parsers/lexers/etc) Any advice you can give would be very helpful. |
This doesn't solve the "don't parse regexps" commandment, but I've pushed another commit to address your examples:
NO LONGER MATCHED: we no longer look for backreferences if there aren't any capture groups; this should bring this back to being a non-breaking change.
NO LONGER MATCHED: I've made it so we only look for backreferences when there is a capture group, so we're not matching this any more. ALREADY INVALID: this isn't valid in moo currently anyway because it references the first capture group which moo makes via
VALID: this will still cause issues, but the "no octal escapes" rule should address this.
ALREADY INVALID: this won't match the number 9 if you have more than 8 rules, in which case it matches the moo-generated 9th capture group.
NO LONGER MATCHED: we no longer match references that start with a zero (i.e.
FIXED: oops, yeah I wrote I'm not expecting you to merge this; your very sensible commandment definitely makes sense. I do think it would be wise to advise users not to use octal escapes in the README independent of this PR. |
Hi @benjie This is a cosmic co-incidence because I too am writing a parser for Postgresql using Moo and nearley. And I have run afoul of the exact same issue - dollar quoted strings.
I've been thinking about this problem, and propose this type of approach, but in a way thou shall not break the do not parse RegExp rule by omitting the RegExp merging optimisation for such cases.
Presently, you can match tokens using either literal strings or RegExps. What if we allow the To capture values during lexing, and make them available to the closure in subsequent matching, we could introduce a new class to the Moo API. E.g. class capture {
constructor() {
this.val = '' ;
}
set value(val) {
this.val = val ;
}
get value() {
return this.val ;
}
get escaped() {
return this.val.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') ;
}
toString() {
return this.value ;
}
} Moo could be used somehow like the following (though it may not be strictly syntax correct): const moo = require('moo') ;
let dolq = '$' ;
let dolqlabel = /[A-Za-z\200-\377_]+?/ ;
let c = new moo.capture() ;
moo.state({
start: {
dolqstart: {match: dolq, push: 'state_dolq'},
},
state_dolq: {
// 'capture' passes the moo.capture instance so moo knows where to store what is matched by 'match', so it can be recalled later
dolqlabel: {match: dolqlabel, capture: c},
dolqstop: {match: dolq, pop: true},
},
state_dolq_literal: {
dolq_stop: {match: () => `$${c}$`, pop: true},
// RE match any string up to, but not including our captured delimiter 'd' (and its literal method has escaped for our RE)
dolq_literal: {match: () => new RegExp(`.*?(?=\\$${c.escaped}\\$)`), lineBreaks: true},
// If no closing dollar quote, matching everything we have in buffer
dolq_literal: {match: () => new RegExp(`.*?(?!\\$${c.escaped}\\$)`), lineBreaks: true}
}
}) ;
} ; Obviously with this implementation, there will be a performance impact because the RE merging optimisation won't be possible. From the source: var pat = reUnion(match.map(regexpOrLiteral)) and this source: parts.push(reCapture(pat)) These lines appear to implement the consolidation of all the What I propose is, for the case of dynamic token matching, not to optimise the lexer into a single RE, but leave it as an array of REs, and execute them in turn through a loop until a match is found. If the object type of a This unoptimised algorithm would only be in play when dynamic lexing is to be performed and only for the scope of the state in which it occurs. In all other cases, the existing algorithm prevails. The impact is probably quite small where it is used because the dynamic parsing would happen within a higher state level in the stack than the main parser. So the optimisation would only be lost when in that state of the lexer. And a slower implementation of dynamic lexing is better than no implementation of dynamic lexing. Thoughts @benjie and @nathan ? Damo. |
I'm far too inexperienced at lexers to pass comment, sorry. |
@tjvr thoughts? I really don't think we should use multiple RegExps at runtime, but parsing RegExps to find and alter backreferences still seems error-prone. However, it might be worthwhile here, because this type of literal is quite common in programming languages. All the same, if I remember correctly, we dropped support for capture groups because the use case didn't justify the additional complexity, and this adds even more complexity. |
All good points @nathan. You're right we dropped support for capture groups because they added complexity, but also because we added value transforms which solve the same problem. I can't think of an obvious workaround for backreferences. Warning against using I think you're right that we won't merge this, since it requires parsing RegExps. I'm afraid I can't think of a good workaround right now, but I might think about it. :) Sent with GitHawk |
Are there other reasons apart from the performance impact nathan? The lexer could only fall back to multiple REs for states that require it, and it could determine this at compile time. Or have I underestimated the performance impact of this approach? Like @benjie , I'm no expert on lexing.
Glad to be challenging great minds. :) D. |
Hey @tjvr; interested to hear if you've managed to think of a better solution to this. I'm keen to use it for tokenising PostgreSQL queries, but cannot see an easy way to achieve it without this feature. |
I'm trying to use moo to parse PostgreSQL syntax and everything's amazing except I can't parse the dollar-quoted strings because they need backreferences.
https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING
This PR adds backreference support.
This should be a completely non-breaking change; I've even persisted the old behaviour of throwing an error for capture groups if there are no backreferences.
I'm used to my code being auto-formatted by prettier and/or ESLint so apologies if I've made any formatting errors.