Update tokenizer to use Moo instead #224

inferrinizzard · 2022-06-05T14:58:18Z

added moo as tokenizer library
temporary converter from moo tokens into custom tokens

nene

I know it's a draft PR and not meant for merging, but had a quick glance and a few things caught my eye.

nene · 2022-06-05T17:13:33Z

.eslintrc

@@ -36,7 +37,7 @@
    "@typescript-eslint/indent": "off",
    "@typescript-eslint/lines-between-class-members": "off",
    "@typescript-eslint/naming-convention": "error",
-    "@typescript-eslint/no-unused-vars": "error",
+    "@typescript-eslint/no-unused-vars": "warn",


Why warnings instead of errors?

I have found that it's better to not have the middle-ground that "warn" provides. It raises a question of if it's only a warning then should it be fixed or not?

this rule in particular blocks tests from being run if lines are partially commented, it should eventually be addressed before being merged in but while work is in progress, it's mostly a hindrance

nene · 2022-06-05T17:17:04Z

src/languages/bigquery.formatter.ts

-        ...Object.values(reservedFunctions).reduce((acc, arr) => [...acc, ...arr], []),
-        ...Object.values(reservedKeywords).reduce((acc, arr) => [...acc, ...arr], []),
+        ...Object.values(reservedFunctions).reduce((acc, arr) => acc.concat(arr), []),
+        ...Object.values(reservedKeywords).reduce((acc, arr) => acc.concat(arr), []),


Why this whole "replace spread with concat"?

faster for large lists

Whether there's any speed difference really comes down to how does a browser or NodeJS implement the spread operator. Ideally these run at the exact same speed, like Babel will compile the spread to concat when targeting platforms without spread support.

As always with performance, we should actually measure to see whether there is any practical difference in our case.

IMHO a larger problem is that we're repeating this .reduce((acc, arr) => acc.concat(arr), []) snippet all over. Essentially it's a flatten() function. Extracting a separate function for this would also provide better optimization opportunities if needed - like one could instead use a for loop instead of reduce.

nene · 2022-06-05T17:43:40Z

src/lexer/regexFactory.ts

+  '[]': '(?:\\[[^\\]]*(?:$|\\]))(\\][^\\]]*(?:$|\\]))*',
+  '""': `(?:${stringPrefixes}"[^"\\\\]*(?:\\\\.[^"\\\\]*)*(?:"|$))+`,
+  "''": `(?:${stringPrefixes}'[^'\\\\]*(?:\\\\.[^'\\\\]*)*(?:'|$))+`,
+  // '$$': '(?<tag>\\$\\w*\\$)[\\s\\S]*?(?:\\k<tag>|$)', // does not work with moo


Why doesn't this work with Moo and what could be done about this?

no-context/moo#132
moo doesn't allow capturing groups since it tries to avoid parsing regex in general

Seems to be a pretty serious limitation. This issue is there since 2019 and no progress has really happened about this in Moo since then :(

I see three possibilities:

Drop support for $$ strings from sql-formatter. Not really feasible.

Attempt fixing this issue in Moo. Will take quite some time at very minimum. Might never happen.

Use patched version of Moo (applying this back-reference support PR). Not future-proof. Better to be avoided.

Don't use Moo. Find some other lexer or enhance our home-grown one.

$$ strings do work, the issue is only for tagged $$ strings such as $xx$string_content$xx$
moo can match a token of the form $xx$string_content$yy$, it just can't ensure that xx == yy without the backreference

we could either

parse $xx$ , string_content, $yy$ as 3 separate tokens + validate and join in post processor

or just parse $xx$string_content$yy$ as is, and leave it up to the user to ensure xx == yy

Well, the problem is, that these two approaches don't work.

In first case the content between $xx$..$xx$ is not necessarily valid SQL. For example one could have SQL like this: SELECT $xx$some "content$xx$;. This would get tokenized to SELECT, $xx, some, "content$xx$;.

The second case will fail with a string like $xx$ some price: $18$xx$.

inferrinizzard · 2022-06-10T15:26:20Z

Closing as the prettier-sql:moo/lexer branch is now merged into moo/lexer, will open new PR when ready

inferrinizzard added 25 commits January 16, 2022 20:08

install moo

23cdb66

basic word, operator lexer

db38b66

add babel plugin to inline import sql

bc88e99

update regex functions for word, string to moo

a9dd035

update comments, case/end, operators, numbers for moo

a80cfc8

add moo support for keywords

9473e19

add support for placeholders

39e4989

rename mooRegex functions

755ffb0

Merge branch 'master' into 'moo/lexer'

9117964

update tsconfig to allow absolute imports

7a8cc4d

update import paths for src

e863898

update import paths for test

79ed1f7

update typescript version

c1d4150

update partial type imports

dca3a8f

add webpack alias for absolute imports

5509e31

update jest config to alias absolute path

bbeb179

♻️

ba2d7a8

Merge branch 'repo/hooks'

f63f78d

Merge branch 'repo/absolute-imports'

c7ce114

Merge branch 'master' into moo/lexer

bec0ac7

get moo test working with ts-jest

5201f7c

remove ^ and \b on moo regex

a2f4e05

replace spread with concat

cb5a6c9

move regexFactory to lexer

87f786f

add converter between moo tokens and bespoke tokens

436c162

nene reviewed Jun 5, 2022

View reviewed changes

inferrinizzard closed this Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update tokenizer to use Moo instead #224

Update tokenizer to use Moo instead #224

inferrinizzard commented Jun 5, 2022

nene left a comment

nene Jun 5, 2022

inferrinizzard Jun 5, 2022

nene Jun 5, 2022

inferrinizzard Jun 5, 2022

nene Jun 10, 2022

nene Jun 5, 2022

inferrinizzard Jun 5, 2022

nene Jun 10, 2022

inferrinizzard Jun 10, 2022 •

edited

Loading

nene Jun 10, 2022 •

edited

Loading

inferrinizzard commented Jun 10, 2022

Update tokenizer to use Moo instead #224

Update tokenizer to use Moo instead #224

Conversation

inferrinizzard commented Jun 5, 2022

nene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inferrinizzard Jun 10, 2022 • edited Loading

Choose a reason for hiding this comment

nene Jun 10, 2022 • edited Loading

Choose a reason for hiding this comment

inferrinizzard commented Jun 10, 2022

inferrinizzard Jun 10, 2022 •

edited

Loading

nene Jun 10, 2022 •

edited

Loading