-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update tokenizer to use Moo instead #224
Update tokenizer to use Moo instead #224
Conversation
inferrinizzard
commented
Jun 5, 2022
- added moo as tokenizer library
- temporary converter from moo tokens into custom tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's a draft PR and not meant for merging, but had a quick glance and a few things caught my eye.
@@ -36,7 +37,7 @@ | |||
"@typescript-eslint/indent": "off", | |||
"@typescript-eslint/lines-between-class-members": "off", | |||
"@typescript-eslint/naming-convention": "error", | |||
"@typescript-eslint/no-unused-vars": "error", | |||
"@typescript-eslint/no-unused-vars": "warn", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why warnings instead of errors?
I have found that it's better to not have the middle-ground that "warn" provides. It raises a question of if it's only a warning then should it be fixed or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this rule in particular blocks tests from being run if lines are partially commented, it should eventually be addressed before being merged in but while work is in progress, it's mostly a hindrance
...Object.values(reservedFunctions).reduce((acc, arr) => [...acc, ...arr], []), | ||
...Object.values(reservedKeywords).reduce((acc, arr) => [...acc, ...arr], []), | ||
...Object.values(reservedFunctions).reduce((acc, arr) => acc.concat(arr), []), | ||
...Object.values(reservedKeywords).reduce((acc, arr) => acc.concat(arr), []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this whole "replace spread with concat"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- faster for large lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether there's any speed difference really comes down to how does a browser or NodeJS implement the spread operator. Ideally these run at the exact same speed, like Babel will compile the spread to concat when targeting platforms without spread support.
As always with performance, we should actually measure to see whether there is any practical difference in our case.
IMHO a larger problem is that we're repeating this .reduce((acc, arr) => acc.concat(arr), [])
snippet all over. Essentially it's a flatten()
function. Extracting a separate function for this would also provide better optimization opportunities if needed - like one could instead use a for loop instead of reduce.
'[]': '(?:\\[[^\\]]*(?:$|\\]))(\\][^\\]]*(?:$|\\]))*', | ||
'""': `(?:${stringPrefixes}"[^"\\\\]*(?:\\\\.[^"\\\\]*)*(?:"|$))+`, | ||
"''": `(?:${stringPrefixes}'[^'\\\\]*(?:\\\\.[^'\\\\]*)*(?:'|$))+`, | ||
// '$$': '(?<tag>\\$\\w*\\$)[\\s\\S]*?(?:\\k<tag>|$)', // does not work with moo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this work with Moo and what could be done about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no-context/moo#132
moo doesn't allow capturing groups since it tries to avoid parsing regex in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be a pretty serious limitation. This issue is there since 2019 and no progress has really happened about this in Moo since then :(
I see three possibilities:
- Drop support for
$$
strings from sql-formatter. Not really feasible. - Attempt fixing this issue in Moo. Will take quite some time at very minimum. Might never happen.
- Use patched version of Moo (applying this back-reference support PR). Not future-proof. Better to be avoided.
- Don't use Moo. Find some other lexer or enhance our home-grown one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$$
strings do work, the issue is only for tagged $$
strings such as $xx$string_content$xx$
moo can match a token of the form $xx$string_content$yy$
, it just can't ensure that xx == yy without the backreference
we could either
- parse
$xx$
, string_content,$yy$
as 3 separate tokens + validate and join in post processor - or just parse
$xx$string_content$yy$
as is, and leave it up to the user to ensure xx == yy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the problem is, that these two approaches don't work.
- In first case the content between
$xx$..$xx$
is not necessarily valid SQL. For example one could have SQL like this:SELECT $xx$some "content$xx$;
. This would get tokenized toSELECT
,$xx
,some
,"content$xx$;
. - The second case will fail with a string like
$xx$ some price: $18$xx$
.
Closing as the prettier-sql:moo/lexer branch is now merged into moo/lexer, will open new PR when ready |