-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ignoreCase flag #122
base: main
Are you sure you want to change the base?
Add ignoreCase flag #122
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/u
is definitely a good thing to have. /i
is also really useful, but it worries me that I can completely change the behavior of these rules:
moo.compile({
cat: 'cat',
bat: 'bat',
loudCat: 'CAT',
loudBat: 'BAT',
sep: ';',
})
by adding a rule at the end:
moo.compile({
cat: 'cat',
bat: 'bat',
loudCat: 'CAT',
loudBat: 'BAT',
sep: ';',
anyHat: /hat/i,
})
That being said, I'm still not really a fan of requiring something verbose {ignoreCase: true}
on every string. Does requiring an options object only when there are string literals make sense? E.g., both of these would work:
moo.compile({
anyCat: 'cat',
anyBat: 'bat',
sep: ';',
anyHat: /(hat|chapeau)/i,
}, {ignoreCase: true})
moo.compile({
anyCat: /cat/i,
anyBat: /bat/i,
sep: /;/i,
anyHat: /hat/i,
})
test/test.js
Outdated
@@ -29,16 +29,42 @@ describe('compiler', () => { | |||
expect(lex4.next()).toMatchObject({type: 'err', text: 'nope!'}) | |||
}) | |||
|
|||
test("warns for /g, /y, /i, /m, /u", () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also remove /u
from the test name.
test/test.js
Outdated
expect(lexer.next()).toMatchObject({value: "FoO"}) | ||
expect(lexer.next()).toMatchObject({value: "bAr"}) | ||
expect(lexer.next()).toMatchObject({value: "QuXx"}) | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should include a supports unicode
test (e.g., check that an astral plane character in a character set doesn't match a lone surrogate or that /\u{1D306}/u
matches '\u{1D306}'
and not '\\u{1D306}'
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind giving an example? <3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of both:
test("supports unicode", () => {
const lexer = compile({
a: /[𝌆]/u,
})
lexer.reset("𝌆")
expect(lexer.next()).toMatchObject({value: "𝌆"})
lexer.reset("𝌆".charCodeAt(0))
expect(() => lexer.next()).toThrow()
const lexer2 = compile({
a: /\u{1D356}/u,
})
lexer2.reset("𝍖")
expect(lexer2.next()).toMatchObject({value: "𝍖"})
lexer2.reset("\\u{1D356}")
expect(() => lexer2.next()).toThrow()
})
I agree; as I implied in the PR description, it feels weird leaving implicit how strings are handled. I slightly prefer I could be persuaded to add an options dict, though. Sent with GitHawk |
We can't do this properly for Unicode unless we include the entire case-folding map (which admittedly isn't terribly large). But I can see how having case-insensitive tokens in a case-sensitive language could be useful. I think I keep imagining a user writing a long list of string literals like this: moo.compile({
if: 'if',
else: 'else',
then: 'then',
...
}) to make keywords (and having to write moo.compile({
cat: /cat/i,
bat: /bat/i,
comma: ',',
semi: ';',
lparen: '(',
rparen: ')',
lbrace: '{',
rbrace: '}',
lbracket: '[',
rbracket: ']',
and: '&&',
or: '||',
bitand: '&',
bitor: '|',
...
}) instead of this: moo.compile({
cat: /cat/i,
bat: /bat/i,
comma: {match: ',', ignoreCase: true},
semi: {match: ';', ignoreCase: true},
lparen: {match: '(', ignoreCase: true},
rparen: {match: ')', ignoreCase: true},
lbrace: {match: '{', ignoreCase: true},
rbrace: {match: '}', ignoreCase: true},
lbracket: {match: '[', ignoreCase: true},
rbracket: {match: ']', ignoreCase: true},
and: {match: '&&', ignoreCase: true},
or: {match: '||', ignoreCase: true},
bitand: {match: '&', ignoreCase: true},
bitor: {match: '|', ignoreCase: true},
...
}) (Arguably, this would be better than either: moo.compile({
cat: /cat/i,
bat: /bat/i,
op: {match: /[,;(){}[\]]|\|\|?|&&?/i, type: v => v},
}) but there are stylistic/interoperability reasons to prefer alphanumeric token types.) |
Do we need the case-folding map, or can we use the Sounds like we're agreed about making ignoreCase per-string. I agree it shouldn't be required when it's irrelevant. Sent with GitHawk |
Strictly speaking, EDIT: a less esoteric example would be Greek: |
Is that for If the latter, then I think it would be reasonable to only support Sent with GitHawk |
Only If you're worried about the size of the map, it's large but not horribly so. Here are the simple and common mappings in CaseFolding.txt (all that we need to implement
That comes out to <6KB gzipped UTF-8. EDIT: That can actually probably be made much smaller, because |
I added an ignoreCase option for literals. We don't yet allow it to be used in isolation; that can come in a future PR, so you can write:
...although I imagine the next feature request will relate to case-insensitive literals, so perhaps this needs more thought. Note that the check I'm using for whether case is relevant for a literal is probably insufficient, for the same case-folding-related reason as you've explained above. Sent with GitHawk |
Here are 809 bytes (gzipped) that generate the full map: function d(r){for(var a=Array.from(r),o=[],i=0;i<a.length;){var t=a[i++],e=-1,f=a[i]&&a[i].charCodeAt(0);if(f<64){e=31&f;var n=a[++i]&&a[i].charCodeAt(0);n<64&&(e|=(31&n)<<5,++i)}if(o.push(t),-1<e)for(var d=0,A=t.codePointAt(0);d<=e>>1;++d)A+=1+(1&e),o.push(String.fromCodePoint(A))}return o}for(var CASE_FOLD={},i=(a=d('A0!À*!Ø*Ā-!IJ#Ĺ-Ŋ-!Ź#Ɓ Ƅ!Ƈ!Ɗ Ǝ$Ɠ Ɩ"Ɯ Ɵ Ƣ#Ƨ!Ƭ!Ư!Ʋ Ƶ!ƸƼDŽ LJ NJ Ǎ-Ǟ/DZ Ǵ!Ƿ Ǻ7!Ⱥ Ƚ Ɂ!Ʉ"Ɉ%Ͱ!ͶͿΆ!Ή Ό!Ώ!Β<Σ.ϏϘ5ϴϷ!ϺϽ"#Ѡ?Ҋ5!Ӂ+Ӑ="Ա("Ⴀ("ჇჍᲐ2"Ჽ"Ḁ3$ẞ?"Ἀ,Ἐ(Ἠ,Ἰ,Ὀ(Ὑ%Ὠ,ᾈ,ᾘ,ᾨ,Ᾰ&Ὲ&Ῐ$Ῠ&Ὸ&ΩK ℲⅠ<ↃⒶ0!Ⰰ:"Ⱡ!Ᵽ Ⱨ%Ɱ"ⱲⱵⱾ"Ⲃ?"Ⳬ!ⳲꙀ+!Ꚁ9Ꜣ+Ꜳ;!Ꝺ#Ꝿ\'Ꞌ!Ꞑ!Ꞗ3Ɜ$Ʞ&Ꞷ!A0!𐐀,"𐒰$"𐲀"#𑢠<!𖹀<!𞤀 "')).length;i--;)CASE_FOLD[a[i]]=a[i].toLowerCase();var a=d("µſͅςϐ ϕ ϰ ϵᏸ(ᲀ.ẛιꭰ<$"),b=d("μsισβθφπκρεᏰ(в!ос тъѣꙋṡιᎠ<$");for(i=a.length;i--;)CASE_FOLD[a[i]]=b[i]; And a gist with the code I used to generate them. EDIT: This relies on Array.from(String), String.fromCodePoint, and String.prototype.codePointAt, all of which we will likely need for Unicode support in other places too, and all of which have fairly concise shims.
Not sure what you mean by this.
I believe it actually is both necessary and sufficient. Since every character not in CaseFolding.txt maps to itself, we can just exhaustively check the ones in CaseFolding.txt: > itt.entries(CASE_FOLD).flatten().every(c => c.toLowerCase() !== c.toUpperCase())
true EDIT: even more exhaustive: // CASE_FOLD_CPS is a set of every code point in CaseFolding.txt (including T and F mappings)
> itt.range(0x10FFFF).every(c =>
... CASE_FOLD_CPS.has(c) ||
... String.fromCodePoint(c).toUpperCase() === String.fromCodePoint(c).toLowerCase())
true |
Nice one! Would you like to PR that? (Perhaps unminified?)
Nice -- thanks for comprehensively confirming that. Regarding my comment about keywords: I have two remaining concerns with this approach. One is that it feels a little bit "magic"; I'm not convinced it's easy to explain when And in particular, now that const lexer = moo.compile({
ws: /[ \t]/i,
word: {
match: /[a-z]+/i,
ignoreCase: true,
type: moo.keywords({
if: "if",
else: "else",
}),
},
})
lexer.reset("foo IF")
// word foo
// ws
// word IF |
Those are good points. Perhaps we should separate the
I can work on a PR (definitely unminified) for unicode ignoreCase after we sort out the design we want for |
I was thinking the same thing. I opened #123, which adds only the unicode flag. Once that's merged, I'll rebase this PR, to keep the conversation about ignoreCase in one place... although this is getting quite long 😬 In conclusion, do you prefer the options dict approach, or the |
I think they both make the keywords scenario pretty confusing and unintuitive. Neither this: moo.compile({
ws: /[ \t]/i,
word: {
match: /[a-z]+/i,
ignoreCase: true,
type: moo.keywords({
if: "if",
else: "else",
}),
},
}) nor this: moo.compile({
ws: /[ \t]/i,
word: {
match: /[a-z]+/i,
type: moo.keywords({
if: "if",
else: "else",
}),
},
}, {ignoreCase: true}) actually does what you'd expect (i.e., treats "If" and "iF" and "IF" as |
Any status on this? I'm trying to write a SQL parser and therefor need case independent keyword parsing. |
I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:
any word on this? |
If you don't care about Unicode, you can use something like this to transform the RegExp: function insensitive(r) {
const esc = (s, a = '', b = '') => s.replace(/[a-z]/gi, c =>
`${a}${c.toUpperCase()}${c.toLowerCase()}${b}`)
const PART = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)|(\[(?:\\.|[^\]])*\])/
const ESCAPE = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)/
return new RegExp(r.source.split(PART).map((s, i) =>
i % 3 === 1 ? s :
i % 3 ? s && s.split(ESCAPE).map((t, j) =>
j % 2 ? t : esc(t)).join('') :
esc(s, '[', ']')).join(''), r.flags.replace('i', ''))
}
insensitive(/was\s+not\s+in|is\s+not|not\s+in|was\s+not|was\s+in|is|in|was|changed/i)
// => /[Ww][Aa][Ss]\s+[Nn][Oo][Tt]\s+[Ii][Nn]|[Ii][Ss]\s+[Nn][Oo][Tt]|[Nn][Oo][Tt]\s+[Ii][Nn]|[Ww][Aa][Ss]\s+[Nn][Oo][Tt]|[Ww][Aa][Ss]\s+[Ii][Nn]|[Ii][Ss]|[Ii][Nn]|[Ww][Aa][Ss]|[Cc][Hh][Aa][Nn][Gg][Ee][Dd]/ |
/i
, if every RegExp sets the/i
flag, to solve Case sensitivity handling and note about in the docs #117 (comment). Note that this implicitly handles string literals; we might want to add a separate option for those./u
, if every RegExp sets the/u
flag, to solve Support for Unicode property escapes (and/u
flag) #116.