Add ignoreCase flag #122

tjvr · 2019-02-23T14:53:46Z

Mark the final regex as /i, if every RegExp sets the /i flag, to solve Case sensitivity handling and note about in the docs #117 (comment). Note that this implicitly handles string literals; we might want to add a separate option for those.
Mark the final regex as /u, if every RegExp sets the /uflag, to solve Support for Unicode property escapes (and /u flag) #116.

nathan

/u is definitely a good thing to have. /i is also really useful, but it worries me that I can completely change the behavior of these rules:

moo.compile({
  cat: 'cat',
  bat: 'bat',
  loudCat: 'CAT',
  loudBat: 'BAT',
  sep: ';',
})

by adding a rule at the end:

moo.compile({
  cat: 'cat',
  bat: 'bat',
  loudCat: 'CAT',
  loudBat: 'BAT',
  sep: ';',
  anyHat: /hat/i,
})

That being said, I'm still not really a fan of requiring something verbose {ignoreCase: true} on every string. Does requiring an options object only when there are string literals make sense? E.g., both of these would work:

moo.compile({
  anyCat: 'cat',
  anyBat: 'bat',
  sep: ';',
  anyHat: /(hat|chapeau)/i,
}, {ignoreCase: true})

moo.compile({
  anyCat: /cat/i,
  anyBat: /bat/i,
  sep: /;/i,
  anyHat: /hat/i,
})

nathan · 2019-02-23T15:41:18Z

test/test.js

@@ -29,16 +29,42 @@ describe('compiler', () => {
    expect(lex4.next()).toMatchObject({type: 'err', text: 'nope!'})
  })

-  test("warns for /g, /y, /i, /m, /u", () => {


Should also remove /u from the test name.

nathan · 2019-02-23T15:45:53Z

test/test.js

+    expect(lexer.next()).toMatchObject({value: "FoO"})
+    expect(lexer.next()).toMatchObject({value: "bAr"})
+    expect(lexer.next()).toMatchObject({value: "QuXx"})
+  })



Should include a supports unicode test (e.g., check that an astral plane character in a character set doesn't match a lone surrogate or that /\u{1D306}/u matches '\u{1D306}' and not '\\u{1D306}').

Would you mind giving an example? <3

Here's an example of both:

test("supports unicode", () => { const lexer = compile({ a: /[𝌆]/u, }) lexer.reset("𝌆") expect(lexer.next()).toMatchObject({value: "𝌆"}) lexer.reset("𝌆".charCodeAt(0)) expect(() => lexer.next()).toThrow() const lexer2 = compile({ a: /\u{1D356}/u, }) lexer2.reset("𝍖") expect(lexer2.next()).toMatchObject({value: "𝍖"}) lexer2.reset("\\u{1D356}") expect(() => lexer2.next()).toThrow() })

moo.js

tjvr · 2019-02-23T17:40:43Z

I agree; as I implied in the PR description, it feels weird leaving implicit how strings are handled.

I slightly prefer ignoreCase: true, because we can compile string literals into RegExps like /[Bb][Aa][Rr]/.

I could be persuaded to add an options dict, though.

_{Sent with GitHawk}

nathan · 2019-02-23T18:09:07Z

we can compile string literals into RegExps like /[Bb][Aa][Rr]/

We can't do this properly for Unicode unless we include the entire case-folding map (which admittedly isn't terribly large). But I can see how having case-insensitive tokens in a case-sensitive language could be useful.

I think I keep imagining a user writing a long list of string literals like this:

moo.compile({
  if: 'if',
  else: 'else',
  then: 'then',
  ...
})

to make keywords (and having to write {ignoreCase: true, match: …} for every single one to make them case-insensitive), when really she'd just use metaprogramming / a type transform. So I think just a per-string ignoreCase is fine, as long as we don't require ignoreCase on string literals for which case is irrelevant. I want a user to be able to write this:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  comma: ',',
  semi: ';',
  lparen: '(',
  rparen: ')',
  lbrace: '{',
  rbrace: '}',
  lbracket: '[',
  rbracket: ']',
  and: '&&',
  or: '||',
  bitand: '&',
  bitor: '|',
  ...
})

instead of this:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  comma: {match: ',', ignoreCase: true},
  semi: {match: ';', ignoreCase: true},
  lparen: {match: '(', ignoreCase: true},
  rparen: {match: ')', ignoreCase: true},
  lbrace: {match: '{', ignoreCase: true},
  rbrace: {match: '}', ignoreCase: true},
  lbracket: {match: '[', ignoreCase: true},
  rbracket: {match: ']', ignoreCase: true},
  and: {match: '&&', ignoreCase: true},
  or: {match: '||', ignoreCase: true},
  bitand: {match: '&', ignoreCase: true},
  bitor: {match: '|', ignoreCase: true},
  ...
})

(Arguably, this would be better than either:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  op: {match: /[,;(){}[\]]|\|\|?|&&?/i, type: v => v},
})

but there are stylistic/interoperability reasons to prefer alphanumeric token types.)

tjvr · 2019-02-23T18:27:14Z

Do we need the case-folding map, or can we use the .toUpperCase() and .toLowerCase() built-ins?

Sounds like we're agreed about making ignoreCase per-string. I agree it shouldn't be required when it's irrelevant.

_{Sent with GitHawk}

nathan · 2019-02-23T19:00:23Z

can we use the .toUpperCase() and .toLowerCase() built-ins?

Strictly speaking, toUpperCase and toLowerCase are insufficient; e.g., s should map to [Ssſ] (including U+017F LATIN SMALL LETTER LONG S, which the case-folding built-ins won't produce). See, e.g., http://unicode.org/faq/casemap_charprop.html#2

EDIT: a less esoteric example would be Greek: σ must match [σςΣ] (U+03A3 GREEK CAPITAL LETTER SIGMA, U+03C3 GREEK SMALL LETTER SIGMA, and U+03C2 GREEK SMALL LETTER FINAL SIGMA).

tjvr · 2019-02-23T22:43:23Z

Is that for /s/i, or only /s/ui?

If the latter, then I think it would be reasonable to only support {match: "s", ignoreCase: true} when the unicode flag is not used.

_{Sent with GitHawk}

nathan · 2019-02-24T14:48:37Z

Is that for /s/i, or only /s/ui?

Only /s/ui matches ſ (/s/i does not). However, /σ/i and /σ/ui must both match σ, ς, and Σ. See the definition of Canonicalize in the spec (and Note 4 below it): /.../i uses Unicode case folding, but refuses to map characters outside of the Basic Latin range (U+0000 through U+007f) into it.

If you're worried about the size of the map, it's large but not horribly so. Here are the simple and common mappings in CaseFolding.txt (all that we need to implement /i properly) in a fairly compact notation:

itt("AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZzµμÀàÁáÂâÃãÄäÅåÆæÇçÈèÉéÊêËëÌìÍíÎîÏïÐðÑñÒòÓóÔôÕõÖöØøÙùÚúÛûÜüÝýÞþĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįĲĳĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸÿŹźŻżŽžſsƁɓƂƃƄƅƆɔƇƈƉɖƊɗƋƌƎǝƏəƐɛƑƒƓɠƔɣƖɩƗɨƘƙƜɯƝɲƟɵƠơƢƣƤƥƦʀƧƨƩʃƬƭƮʈƯưƱʊƲʋƳƴƵƶƷʒƸƹƼƽǄǆǅǆǇǉǈǉǊǌǋǌǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǱǳǲǳǴǵǶƕǷƿǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠƞȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȺⱥȻȼȽƚȾⱦɁɂɃƀɄʉɅʌɆɇɈɉɊɋɌɍɎɏͅιͰͱͲͳͶͷͿϳΆάΈέΉήΊίΌόΎύΏώΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσΤτΥυΦφΧχΨψΩωΪϊΫϋςσϏϗϐβϑθϕφϖπϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰκϱρϴθϵεϷϸϹϲϺϻϽͻϾͼϿͽЀѐЁёЂђЃѓЄєЅѕІіЇїЈјЉљЊњЋћЌќЍѝЎўЏџАаБбВвГгДдЕеЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯяѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӏӁӂӃӄӅӆӇӈӉӊӋӌӍӎӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯԱաԲբԳգԴդԵեԶզԷէԸըԹթԺժԻիԼլԽխԾծԿկՀհՁձՂղՃճՄմՅյՆնՇշՈոՉչՊպՋջՌռՍսՎվՏտՐրՑցՒւՓփՔքՕօՖֆႠⴀႡⴁႢⴂႣⴃႤⴄႥⴅႦⴆႧⴇႨⴈႩⴉႪⴊႫⴋႬⴌႭⴍႮⴎႯⴏႰⴐႱⴑႲⴒႳⴓႴⴔႵⴕႶⴖႷⴗႸⴘႹⴙႺⴚႻⴛႼⴜႽⴝႾⴞႿⴟჀⴠჁⴡჂⴢჃⴣჄⴤჅⴥჇⴧჍⴭᏸᏰᏹᏱᏺᏲᏻᏳᏼᏴᏽᏵᲀвᲁдᲂоᲃсᲄтᲅтᲆъᲇѣᲈꙋᲐაᲑბᲒგᲓდᲔეᲕვᲖზᲗთᲘიᲙკᲚლᲛმᲜნᲝოᲞპᲟჟᲠრᲡსᲢტᲣუᲤფᲥქᲦღᲧყᲨშᲩჩᲪცᲫძᲬწᲭჭᲮხᲯჯᲰჰᲱჱᲲჲᲳჳᲴჴᲵჵᲶჶᲷჷᲸჸᲹჹᲺჺᲽჽᲾჾᲿჿḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓẔẕẛṡẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹỺỻỼỽỾỿἈἀἉἁἊἂἋἃἌἄἍἅἎἆἏἇἘἐἙἑἚἒἛἓἜἔἝἕἨἠἩἡἪἢἫἣἬἤἭἥἮἦἯἧἸἰἹἱἺἲἻἳἼἴἽἵἾἶἿἷὈὀὉὁὊὂὋὃὌὄὍὅὙὑὛὓὝὕὟὗὨὠὩὡὪὢὫὣὬὤὭὥὮὦὯὧᾸᾰᾹᾱᾺὰΆάιιῈὲΈέῊὴΉήῘῐῙῑῚὶΊίῨῠῩῡῪὺΎύῬῥῸὸΌόῺὼΏώΩωKkÅåℲⅎⅠⅰⅡⅱⅢⅲⅣⅳⅤⅴⅥⅵⅦⅶⅧⅷⅨⅸⅩⅹⅪⅺⅫⅻⅬⅼⅭⅽⅮⅾⅯⅿↃↄⒶⓐⒷⓑⒸⓒⒹⓓⒺⓔⒻⓕⒼⓖⒽⓗⒾⓘⒿⓙⓀⓚⓁⓛⓂⓜⓃⓝⓄⓞⓅⓟⓆⓠⓇⓡⓈⓢⓉⓣⓊⓤⓋⓥⓌⓦⓍⓧⓎⓨⓏⓩⰀⰰⰁⰱⰂⰲⰃⰳⰄⰴⰅⰵⰆⰶⰇⰷⰈⰸⰉⰹⰊⰺⰋⰻⰌⰼⰍⰽⰎⰾⰏⰿⰐⱀⰑⱁⰒⱂⰓⱃⰔⱄⰕⱅⰖⱆⰗⱇⰘⱈⰙⱉⰚⱊⰛⱋⰜⱌⰝⱍⰞⱎⰟⱏⰠⱐⰡⱑⰢⱒⰣⱓⰤⱔⰥⱕⰦⱖⰧⱗⰨⱘⰩⱙⰪⱚⰫⱛⰬⱜⰭⱝⰮⱞⱠⱡⱢɫⱣᵽⱤɽⱧⱨⱩⱪⱫⱬⱭɑⱮɱⱯɐⱰɒⱲⱳⱵⱶⱾȿⱿɀⲀⲁⲂⲃⲄⲅⲆⲇⲈⲉⲊⲋⲌⲍⲎⲏⲐⲑⲒⲓⲔⲕⲖⲗⲘⲙⲚⲛⲜⲝⲞⲟⲠⲡⲢⲣⲤⲥⲦⲧⲨⲩⲪⲫⲬⲭⲮⲯⲰⲱⲲⲳⲴⲵⲶⲷⲸⲹⲺⲻⲼⲽⲾⲿⳀⳁⳂⳃⳄⳅⳆⳇⳈⳉⳊⳋⳌⳍⳎⳏⳐⳑⳒⳓⳔⳕⳖⳗⳘⳙⳚⳛⳜⳝⳞⳟⳠⳡⳢⳣⳫⳬⳭⳮⳲⳳꙀꙁꙂꙃꙄꙅꙆꙇꙈꙉꙊꙋꙌꙍꙎꙏꙐꙑꙒꙓꙔꙕꙖꙗꙘꙙꙚꙛꙜꙝꙞꙟꙠꙡꙢꙣꙤꙥꙦꙧꙨꙩꙪꙫꙬꙭꚀꚁꚂꚃꚄꚅꚆꚇꚈꚉꚊꚋꚌꚍꚎꚏꚐꚑꚒꚓꚔꚕꚖꚗꚘꚙꚚꚛꜢꜣꜤꜥꜦꜧꜨꜩꜪꜫꜬꜭꜮꜯꜲꜳꜴꜵꜶꜷꜸꜹꜺꜻꜼꜽꜾꜿꝀꝁꝂꝃꝄꝅꝆꝇꝈꝉꝊꝋꝌꝍꝎꝏꝐꝑꝒꝓꝔꝕꝖꝗꝘꝙꝚꝛꝜꝝꝞꝟꝠꝡꝢꝣꝤꝥꝦꝧꝨꝩꝪꝫꝬꝭꝮꝯꝹꝺꝻꝼꝽᵹꝾꝿꞀꞁꞂꞃꞄꞅꞆꞇꞋꞌꞍɥꞐꞑꞒꞓꞖꞗꞘꞙꞚꞛꞜꞝꞞꞟꞠꞡꞢꞣꞤꞥꞦꞧꞨꞩꞪɦꞫɜꞬɡꞭɬꞮɪꞰʞꞱʇꞲʝꞳꭓꞴꞵꞶꞷꞸꞹꭰᎠꭱᎡꭲᎢꭳᎣꭴᎤꭵᎥꭶᎦꭷᎧꭸᎨꭹᎩꭺᎪꭻᎫꭼᎬꭽᎭꭾᎮꭿᎯꮀᎰꮁᎱꮂᎲꮃᎳꮄᎴꮅᎵꮆᎶꮇᎷꮈᎸꮉᎹꮊᎺꮋᎻꮌᎼꮍᎽꮎᎾꮏᎿꮐᏀꮑᏁꮒᏂꮓᏃꮔᏄꮕᏅꮖᏆꮗᏇꮘᏈꮙᏉꮚᏊꮛᏋꮜᏌꮝᏍꮞᏎꮟᏏꮠᏐꮡᏑꮢᏒꮣᏓꮤᏔꮥᏕꮦᏖꮧᏗꮨᏘꮩᏙꮪᏚꮫᏛꮬᏜꮭᏝꮮᏞꮯᏟꮰᏠꮱᏡꮲᏢꮳᏣꮴᏤꮵᏥꮶᏦꮷᏧꮸᏨꮹᏩꮺᏪꮻᏫꮼᏬꮽᏭꮾᏮꮿᏯＡａＢｂＣｃＤｄＥｅＦｆＧｇＨｈＩｉＪｊＫｋＬｌＭｍＮｎＯｏＰｐＱｑＲｒＳｓＴｔＵｕＶｖＷｗＸｘＹｙＺｚ𐐀𐐨𐐁𐐩𐐂𐐪𐐃𐐫𐐄𐐬𐐅𐐭𐐆𐐮𐐇𐐯𐐈𐐰𐐉𐐱𐐊𐐲𐐋𐐳𐐌𐐴𐐍𐐵𐐎𐐶𐐏𐐷𐐐𐐸𐐑𐐹𐐒𐐺𐐓𐐻𐐔𐐼𐐕𐐽𐐖𐐾𐐗𐐿𐐘𐑀𐐙𐑁𐐚𐑂𐐛𐑃𐐜𐑄𐐝𐑅𐐞𐑆𐐟𐑇𐐠𐑈𐐡𐑉𐐢𐑊𐐣𐑋𐐤𐑌𐐥𐑍𐐦𐑎𐐧𐑏𐒰𐓘𐒱𐓙𐒲𐓚𐒳𐓛𐒴𐓜𐒵𐓝𐒶𐓞𐒷𐓟𐒸𐓠𐒹𐓡𐒺𐓢𐒻𐓣𐒼𐓤𐒽𐓥𐒾𐓦𐒿𐓧𐓀𐓨𐓁𐓩𐓂𐓪𐓃𐓫𐓄𐓬𐓅𐓭𐓆𐓮𐓇𐓯𐓈𐓰𐓉𐓱𐓊𐓲𐓋𐓳𐓌𐓴𐓍𐓵𐓎𐓶𐓏𐓷𐓐𐓸𐓑𐓹𐓒𐓺𐓓𐓻𐲀𐳀𐲁𐳁𐲂𐳂𐲃𐳃𐲄𐳄𐲅𐳅𐲆𐳆𐲇𐳇𐲈𐳈𐲉𐳉𐲊𐳊𐲋𐳋𐲌𐳌𐲍𐳍𐲎𐳎𐲏𐳏𐲐𐳐𐲑𐳑𐲒𐳒𐲓𐳓𐲔𐳔𐲕𐳕𐲖𐳖𐲗𐳗𐲘𐳘𐲙𐳙𐲚𐳚𐲛𐳛𐲜𐳜𐲝𐳝𐲞𐳞𐲟𐳟𐲠𐳠𐲡𐳡𐲢𐳢𐲣𐳣𐲤𐳤𐲥𐳥𐲦𐳦𐲧𐳧𐲨𐳨𐲩𐳩𐲪𐳪𐲫𐳫𐲬𐳬𐲭𐳭𐲮𐳮𐲯𐳯𐲰𐳰𐲱𐳱𐲲𐳲𑢠𑣀𑢡𑣁𑢢𑣂𑢣𑣃𑢤𑣄𑢥𑣅𑢦𑣆𑢧𑣇𑢨𑣈𑢩𑣉𑢪𑣊𑢫𑣋𑢬𑣌𑢭𑣍𑢮𑣎𑢯𑣏𑢰𑣐𑢱𑣑𑢲𑣒𑢳𑣓𑢴𑣔𑢵𑣕𑢶𑣖𑢷𑣗𑢸𑣘𑢹𑣙𑢺𑣚𑢻𑣛𑢼𑣜𑢽𑣝𑢾𑣞𑢿𑣟𖹀𖹠𖹁𖹡𖹂𖹢𖹃𖹣𖹄𖹤𖹅𖹥𖹆𖹦𖹇𖹧𖹈𖹨𖹉𖹩𖹊𖹪𖹋𖹫𖹌𖹬𖹍𖹭𖹎𖹮𖹏𖹯𖹐𖹰𖹑𖹱𖹒𖹲𖹓𖹳𖹔𖹴𖹕𖹵𖹖𖹶𖹗𖹷𖹘𖹸𖹙𖹹𖹚𖹺𖹛𖹻𖹜𖹼𖹝𖹽𖹞𖹾𖹟𖹿𞤀𞤢𞤁𞤣𞤂𞤤𞤃𞤥𞤄𞤦𞤅𞤧𞤆𞤨𞤇𞤩𞤈𞤪𞤉𞤫𞤊𞤬𞤋𞤭𞤌𞤮𞤍𞤯𞤎𞤰𞤏𞤱𞤐𞤲𞤑𞤳𞤒𞤴𞤓𞤵𞤔𞤶𞤕𞤷𞤖𞤸𞤗𞤹𞤘𞤺𞤙𞤻𞤚𞤼𞤛𞤽𞤜𞤾𞤝𞤿𞤞𞥀𞤟𞥁𞤠𞥂𞤡𞥃ẞßᾈᾀᾉᾁᾊᾂᾋᾃᾌᾄᾍᾅᾎᾆᾏᾇᾘᾐᾙᾑᾚᾒᾛᾓᾜᾔᾝᾕᾞᾖᾟᾗᾨᾠᾩᾡᾪᾢᾫᾣᾬᾤᾭᾥᾮᾦᾯᾧᾼᾳῌῃῼῳ").chunksOf(2).toObject()

That comes out to <6KB gzipped UTF-8.

EDIT: That can actually probably be made much smaller, because toUpperCase can recover the majority of these. I will see what I can do.

tjvr · 2019-02-24T17:03:51Z

I added an ignoreCase option for literals. We don't yet allow it to be used in isolation; that can come in a future PR, so you can write:

moo.compile({
  digits: /[0-9]+/,
  cow: {match: "cow", ignoreCase: true},
})

...although I imagine the next feature request will relate to case-insensitive literals, so perhaps this needs more thought.

Note that the check I'm using for whether case is relevant for a literal is probably insufficient, for the same case-folding-related reason as you've explained above.

_{Sent with GitHawk}

nathan · 2019-02-24T17:32:28Z

Here are 809 bytes (gzipped) that generate the full map:

function d(r){for(var a=Array.from(r),o=[],i=0;i<a.length;){var t=a[i++],e=-1,f=a[i]&&a[i].charCodeAt(0);if(f<64){e=31&f;var n=a[++i]&&a[i].charCodeAt(0);n<64&&(e|=(31&n)<<5,++i)}if(o.push(t),-1<e)for(var d=0,A=t.codePointAt(0);d<=e>>1;++d)A+=1+(1&e),o.push(String.fromCodePoint(A))}return o}for(var CASE_FOLD={},i=(a=d('A0!À*!Ø*Ā-!Ĳ#Ĺ-Ŋ-!Ź#Ɓ Ƅ!Ƈ!Ɗ Ǝ$Ɠ Ɩ"Ɯ Ɵ Ƣ#Ƨ!Ƭ!Ư!Ʋ Ƶ!ƸƼǄ Ǉ Ǌ Ǎ-Ǟ/Ǳ Ǵ!Ƿ Ǻ7!Ⱥ Ƚ Ɂ!Ʉ"Ɉ%Ͱ!ͶͿΆ!Ή Ό!Ώ!Β<Σ.ϏϘ5ϴϷ!ϺϽ"#Ѡ?Ҋ5!Ӂ+Ӑ="Ա("Ⴀ("ჇჍᲐ2"Ჽ"Ḁ3$ẞ?"Ἀ,Ἐ(Ἠ,Ἰ,Ὀ(Ὑ%Ὠ,ᾈ,ᾘ,ᾨ,Ᾰ&Ὲ&Ῐ$Ῠ&Ὸ&ΩK ℲⅠ<ↃⒶ0!Ⰰ:"Ⱡ!Ᵽ Ⱨ%Ɱ"ⱲⱵⱾ"Ⲃ?"Ⳬ!ⳲꙀ+!Ꚁ9Ꜣ+Ꜳ;!Ꝺ#Ꝿ\'Ꞌ!Ꞑ!Ꞗ3Ɜ$Ʞ&Ꞷ!Ａ0!𐐀,"𐒰$"𐲀"#𑢠<!𖹀<!𞤀 "')).length;i--;)CASE_FOLD[a[i]]=a[i].toLowerCase();var a=d("µſͅςϐ ϕ ϰ ϵᏸ(ᲀ.ẛιꭰ<$"),b=d("μsισβθφπκρεᏰ(в!ос тъѣꙋṡιᎠ<$");for(i=a.length;i--;)CASE_FOLD[a[i]]=b[i];

And a gist with the code I used to generate them.

EDIT: This relies on Array.from(String), String.fromCodePoint, and String.prototype.codePointAt, all of which we will likely need for Unicode support in other places too, and all of which have fairly concise shims.

@tjvr

...although I imagine the next feature request will relate to case-insensitive literals, so perhaps this needs more thought.

Not sure what you mean by this.

Note that the check I'm using for whether case is relevant for a literal is probably insufficient, for the same case-folding-related reason as you've explained above.

I believe it actually is both necessary and sufficient. Since every character not in CaseFolding.txt maps to itself, we can just exhaustively check the ones in CaseFolding.txt:

> itt.entries(CASE_FOLD).flatten().every(c => c.toLowerCase() !== c.toUpperCase())
true

EDIT: even more exhaustive:

// CASE_FOLD_CPS is a set of every code point in CaseFolding.txt (including T and F mappings)
> itt.range(0x10FFFF).every(c =>
... CASE_FOLD_CPS.has(c) ||
... String.fromCodePoint(c).toUpperCase() === String.fromCodePoint(c).toLowerCase())
true

tjvr · 2019-02-24T21:13:59Z

Nice one! Would you like to PR that? (Perhaps unminified?)

I believe it actually is both necessary and sufficient.

Nice -- thanks for comprehensively confirming that.

Regarding my comment about keywords: I have two remaining concerns with this approach. One is that it feels a little bit "magic"; I'm not convinced it's easy to explain when ignoreCase should be used, given the behaviour in this PR. Perhaps it would be better to make things explicit, and go with the options dictionary you suggested originally.

And in particular, now that keywords() is a function by itself, combining it with ignoreCase produces potentially counter-intuitive behaviour:

const lexer = moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    ignoreCase: true,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
})

lexer.reset("foo IF")
// word foo
// ws
// word IF

nathan · 2019-02-25T00:07:27Z

Those are good points. Perhaps we should separate the /i changes into their own PR, since /u doesn't have these issues?

Nice one! Would you like to PR that? (Perhaps unminified?)

I can work on a PR (definitely unminified) for unicode ignoreCase after we sort out the design we want for /i / ignoreCase.

tjvr · 2019-02-25T22:52:00Z

I was thinking the same thing. I opened #123, which adds only the unicode flag.

Once that's merged, I'll rebase this PR, to keep the conversation about ignoreCase in one place... although this is getting quite long 😬 In conclusion, do you prefer the options dict approach, or the ignoreCase option for strings? Or something else entirely?

nathan · 2019-02-27T16:09:23Z

do you prefer the options dict approach, or the ignoreCase option for strings? Or something else entirely?

I think they both make the keywords scenario pretty confusing and unintuitive. Neither this:

moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    ignoreCase: true,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
})

nor this:

moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
}, {ignoreCase: true})

actually does what you'd expect (i.e., treats "If" and "iF" and "IF" as if tokens).

TheKnarf · 2019-07-23T09:05:50Z

Any status on this? I'm trying to write a SQL parser and therefor need case independent keyword parsing.

jdoklovic · 2020-01-30T22:01:46Z

I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:

any word on this?

nathan · 2020-02-02T19:41:29Z

@jdoklovic

I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:

If you don't care about Unicode, you can use something like this to transform the RegExp:

function insensitive(r) {
  const esc = (s, a = '', b = '') => s.replace(/[a-z]/gi, c =>
    `${a}${c.toUpperCase()}${c.toLowerCase()}${b}`)
  const PART = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)|(\[(?:\\.|[^\]])*\])/
  const ESCAPE = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)/
  return new RegExp(r.source.split(PART).map((s, i) => 
    i % 3 === 1 ? s : 
    i % 3 ? s && s.split(ESCAPE).map((t, j) =>
      j % 2 ? t : esc(t)).join('') :
    esc(s, '[', ']')).join(''), r.flags.replace('i', ''))
}
insensitive(/was\s+not\s+in|is\s+not|not\s+in|was\s+not|was\s+in|is|in|was|changed/i)
// => /[Ww][Aa][Ss]\s+[Nn][Oo][Tt]\s+[Ii][Nn]|[Ii][Ss]\s+[Nn][Oo][Tt]|[Nn][Oo][Tt]\s+[Ii][Nn]|[Ww][Aa][Ss]\s+[Nn][Oo][Tt]|[Ww][Aa][Ss]\s+[Ii][Nn]|[Ii][Ss]|[Ii][Nn]|[Ww][Aa][Ss]|[Cc][Hh][Aa][Nn][Gg][Ee][Dd]/

tjvr requested a review from nathan February 23, 2019 14:53

tjvr changed the title ~~Regexp flags~~ Add ignoreCase and unicode flags Feb 23, 2019

nathan requested changes Feb 23, 2019

View reviewed changes

tjvr mentioned this pull request Feb 24, 2019

Enable unicode property escapes #119

Closed

tjvr added 3 commits February 26, 2019 09:42

Allow ignoreCase flag if all RegExps use it

90471f3

Add test for /ui RegExps

47c216d

Require literals to be marked ignoreCase if RegExps are

f8a5814

tjvr force-pushed the regexp-flags branch from e7b4015 to f8a5814 Compare February 26, 2019 23:23

tjvr changed the title ~~Add ignoreCase and unicode flags~~ Add ignoreCase flag Feb 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ignoreCase flag #122

Add ignoreCase flag #122

tjvr commented Feb 23, 2019

nathan left a comment

nathan Feb 23, 2019

nathan Feb 23, 2019

tjvr Feb 24, 2019

nathan Feb 24, 2019

tjvr commented Feb 23, 2019

nathan commented Feb 23, 2019

tjvr commented Feb 23, 2019

nathan commented Feb 23, 2019 •

edited

Loading

tjvr commented Feb 23, 2019

nathan commented Feb 24, 2019 •

edited

Loading

tjvr commented Feb 24, 2019

nathan commented Feb 24, 2019 •

edited

Loading

tjvr commented Feb 24, 2019 •

edited

Loading

nathan commented Feb 25, 2019

tjvr commented Feb 25, 2019

nathan commented Feb 27, 2019 •

edited

Loading

TheKnarf commented Jul 23, 2019

jdoklovic commented Jan 30, 2020

nathan commented Feb 2, 2020 •

edited

Loading

Add ignoreCase flag #122

Are you sure you want to change the base?

Add ignoreCase flag #122

Conversation

tjvr commented Feb 23, 2019

nathan left a comment

Choose a reason for hiding this comment

nathan Feb 23, 2019

Choose a reason for hiding this comment

nathan Feb 23, 2019

Choose a reason for hiding this comment

tjvr Feb 24, 2019

Choose a reason for hiding this comment

nathan Feb 24, 2019

Choose a reason for hiding this comment

tjvr commented Feb 23, 2019

nathan commented Feb 23, 2019

tjvr commented Feb 23, 2019

nathan commented Feb 23, 2019 • edited Loading

tjvr commented Feb 23, 2019

nathan commented Feb 24, 2019 • edited Loading

tjvr commented Feb 24, 2019

nathan commented Feb 24, 2019 • edited Loading

tjvr commented Feb 24, 2019 • edited Loading

nathan commented Feb 25, 2019

tjvr commented Feb 25, 2019

nathan commented Feb 27, 2019 • edited Loading

TheKnarf commented Jul 23, 2019

jdoklovic commented Jan 30, 2020

nathan commented Feb 2, 2020 • edited Loading

nathan commented Feb 23, 2019 •

edited

Loading

nathan commented Feb 24, 2019 •

edited

Loading

nathan commented Feb 24, 2019 •

edited

Loading

tjvr commented Feb 24, 2019 •

edited

Loading

nathan commented Feb 27, 2019 •

edited

Loading

nathan commented Feb 2, 2020 •

edited

Loading