Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignoreCase flag #122

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add ignoreCase flag #122

wants to merge 3 commits into from

Conversation

tjvr
Copy link
Collaborator

@tjvr tjvr commented Feb 23, 2019

@tjvr tjvr requested a review from nathan February 23, 2019 14:53
@tjvr tjvr changed the title Regexp flags Add ignoreCase and unicode flags Feb 23, 2019
Copy link
Collaborator

@nathan nathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/u is definitely a good thing to have. /i is also really useful, but it worries me that I can completely change the behavior of these rules:

moo.compile({
  cat: 'cat',
  bat: 'bat',
  loudCat: 'CAT',
  loudBat: 'BAT',
  sep: ';',
})

by adding a rule at the end:

moo.compile({
  cat: 'cat',
  bat: 'bat',
  loudCat: 'CAT',
  loudBat: 'BAT',
  sep: ';',
  anyHat: /hat/i,
})

That being said, I'm still not really a fan of requiring something verbose {ignoreCase: true} on every string. Does requiring an options object only when there are string literals make sense? E.g., both of these would work:

moo.compile({
  anyCat: 'cat',
  anyBat: 'bat',
  sep: ';',
  anyHat: /(hat|chapeau)/i,
}, {ignoreCase: true})
moo.compile({
  anyCat: /cat/i,
  anyBat: /bat/i,
  sep: /;/i,
  anyHat: /hat/i,
})

test/test.js Outdated
@@ -29,16 +29,42 @@ describe('compiler', () => {
expect(lex4.next()).toMatchObject({type: 'err', text: 'nope!'})
})

test("warns for /g, /y, /i, /m, /u", () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also remove /u from the test name.

test/test.js Outdated
expect(lexer.next()).toMatchObject({value: "FoO"})
expect(lexer.next()).toMatchObject({value: "bAr"})
expect(lexer.next()).toMatchObject({value: "QuXx"})
})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should include a supports unicode test (e.g., check that an astral plane character in a character set doesn't match a lone surrogate or that /\u{1D306}/u matches '\u{1D306}' and not '\\u{1D306}').

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind giving an example? <3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of both:

test("supports unicode", () => {
  const lexer = compile({
    a: /[𝌆]/u,
  })
  lexer.reset("𝌆")
  expect(lexer.next()).toMatchObject({value: "𝌆"})
  lexer.reset("𝌆".charCodeAt(0))
  expect(() => lexer.next()).toThrow()

  const lexer2 = compile({
    a: /\u{1D356}/u,
  })
  lexer2.reset("𝍖")
  expect(lexer2.next()).toMatchObject({value: "𝍖"})
  lexer2.reset("\\u{1D356}")
  expect(() => lexer2.next()).toThrow()
})

moo.js Show resolved Hide resolved
@tjvr
Copy link
Collaborator Author

tjvr commented Feb 23, 2019

I agree; as I implied in the PR description, it feels weird leaving implicit how strings are handled.

I slightly prefer ignoreCase: true, because we can compile string literals into RegExps like /[Bb][Aa][Rr]/.

I could be persuaded to add an options dict, though.

Sent with GitHawk

@nathan
Copy link
Collaborator

nathan commented Feb 23, 2019

we can compile string literals into RegExps like /[Bb][Aa][Rr]/

We can't do this properly for Unicode unless we include the entire case-folding map (which admittedly isn't terribly large). But I can see how having case-insensitive tokens in a case-sensitive language could be useful.

I think I keep imagining a user writing a long list of string literals like this:

moo.compile({
  if: 'if',
  else: 'else',
  then: 'then',
  ...
})

to make keywords (and having to write {ignoreCase: true, match: …} for every single one to make them case-insensitive), when really she'd just use metaprogramming / a type transform. So I think just a per-string ignoreCase is fine, as long as we don't require ignoreCase on string literals for which case is irrelevant. I want a user to be able to write this:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  comma: ',',
  semi: ';',
  lparen: '(',
  rparen: ')',
  lbrace: '{',
  rbrace: '}',
  lbracket: '[',
  rbracket: ']',
  and: '&&',
  or: '||',
  bitand: '&',
  bitor: '|',
  ...
})

instead of this:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  comma: {match: ',', ignoreCase: true},
  semi: {match: ';', ignoreCase: true},
  lparen: {match: '(', ignoreCase: true},
  rparen: {match: ')', ignoreCase: true},
  lbrace: {match: '{', ignoreCase: true},
  rbrace: {match: '}', ignoreCase: true},
  lbracket: {match: '[', ignoreCase: true},
  rbracket: {match: ']', ignoreCase: true},
  and: {match: '&&', ignoreCase: true},
  or: {match: '||', ignoreCase: true},
  bitand: {match: '&', ignoreCase: true},
  bitor: {match: '|', ignoreCase: true},
  ...
})

(Arguably, this would be better than either:

moo.compile({
  cat: /cat/i,
  bat: /bat/i,
  op: {match: /[,;(){}[\]]|\|\|?|&&?/i, type: v => v},
})

but there are stylistic/interoperability reasons to prefer alphanumeric token types.)

@tjvr
Copy link
Collaborator Author

tjvr commented Feb 23, 2019

Do we need the case-folding map, or can we use the .toUpperCase() and .toLowerCase() built-ins?

Sounds like we're agreed about making ignoreCase per-string. I agree it shouldn't be required when it's irrelevant.

Sent with GitHawk

@nathan
Copy link
Collaborator

nathan commented Feb 23, 2019

can we use the .toUpperCase() and .toLowerCase() built-ins?

Strictly speaking, toUpperCase and toLowerCase are insufficient; e.g., s should map to [Ssſ] (including U+017F LATIN SMALL LETTER LONG S, which the case-folding built-ins won't produce). See, e.g., http://unicode.org/faq/casemap_charprop.html#2

EDIT: a less esoteric example would be Greek: σ must match [σςΣ] (U+03A3 GREEK CAPITAL LETTER SIGMA, U+03C3 GREEK SMALL LETTER SIGMA, and U+03C2 GREEK SMALL LETTER FINAL SIGMA).

@tjvr
Copy link
Collaborator Author

tjvr commented Feb 23, 2019

Is that for /s/i, or only /s/ui?

If the latter, then I think it would be reasonable to only support {match: "s", ignoreCase: true} when the unicode flag is not used.

Sent with GitHawk

@nathan
Copy link
Collaborator

nathan commented Feb 24, 2019

Is that for /s/i, or only /s/ui?

Only /s/ui matches ſ (/s/i does not). However, /σ/i and /σ/ui must both match σ, ς, and Σ. See the definition of Canonicalize in the spec (and Note 4 below it): /.../i uses Unicode case folding, but refuses to map characters outside of the Basic Latin range (U+0000 through U+007f) into it.

If you're worried about the size of the map, it's large but not horribly so. Here are the simple and common mappings in CaseFolding.txt (all that we need to implement /i properly) in a fairly compact notation:

itt("AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZzµμÀàÁáÂâÃãÄäÅåÆæÇçÈèÉéÊêËëÌìÍíÎîÏïÐðÑñÒòÓóÔôÕõÖöØøÙùÚúÛûÜüÝýÞþĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸÿŹźŻżŽžſsƁɓƂƃƄƅƆɔƇƈƉɖƊɗƋƌƎǝƏəƐɛƑƒƓɠƔɣƖɩƗɨƘƙƜɯƝɲƟɵƠơƢƣƤƥƦʀƧƨƩʃƬƭƮʈƯưƱʊƲʋƳƴƵƶƷʒƸƹƼƽDŽdžDždžLJljLjljNJnjNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯDZdzDzdzǴǵǶƕǷƿǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠƞȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȺⱥȻȼȽƚȾⱦɁɂɃƀɄʉɅʌɆɇɈɉɊɋɌɍɎɏͅιͰͱͲͳͶͷͿϳΆάΈέΉήΊίΌόΎύΏώΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσΤτΥυΦφΧχΨψΩωΪϊΫϋςσϏϗϐβϑθϕφϖπϘϙϚϛϜϝϞϟϠϡϢϣϤϥϦϧϨϩϪϫϬϭϮϯϰκϱρϴθϵεϷϸϹϲϺϻϽͻϾͼϿͽЀѐЁёЂђЃѓЄєЅѕІіЇїЈјЉљЊњЋћЌќЍѝЎўЏџАаБбВвГгДдЕеЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯяѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӏӁӂӃӄӅӆӇӈӉӊӋӌӍӎӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯԱաԲբԳգԴդԵեԶզԷէԸըԹթԺժԻիԼլԽխԾծԿկՀհՁձՂղՃճՄմՅյՆնՇշՈոՉչՊպՋջՌռՍսՎվՏտՐրՑցՒւՓփՔքՕօՖֆႠⴀႡⴁႢⴂႣⴃႤⴄႥⴅႦⴆႧⴇႨⴈႩⴉႪⴊႫⴋႬⴌႭⴍႮⴎႯⴏႰⴐႱⴑႲⴒႳⴓႴⴔႵⴕႶⴖႷⴗႸⴘႹⴙႺⴚႻⴛႼⴜႽⴝႾⴞႿⴟჀⴠჁⴡჂⴢჃⴣჄⴤჅⴥჇⴧჍⴭᏸᏰᏹᏱᏺᏲᏻᏳᏼᏴᏽᏵᲀвᲁдᲂоᲃсᲄтᲅтᲆъᲇѣᲈꙋᲐაᲑბᲒგᲓდᲔეᲕვᲖზᲗთᲘიᲙკᲚლᲛმᲜნᲝოᲞპᲟჟᲠრᲡსᲢტᲣუᲤფᲥქᲦღᲧყᲨშᲩჩᲪცᲫძᲬწᲭჭᲮხᲯჯᲰჰᲱჱᲲჲᲳჳᲴჴᲵჵᲶჶᲷჷᲸჸᲹჹᲺჺᲽჽᲾჾᲿჿḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓẔẕẛṡẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹỺỻỼỽỾỿἈἀἉἁἊἂἋἃἌἄἍἅἎἆἏἇἘἐἙἑἚἒἛἓἜἔἝἕἨἠἩἡἪἢἫἣἬἤἭἥἮἦἯἧἸἰἹἱἺἲἻἳἼἴἽἵἾἶἿἷὈὀὉὁὊὂὋὃὌὄὍὅὙὑὛὓὝὕὟὗὨὠὩὡὪὢὫὣὬὤὭὥὮὦὯὧᾸᾰᾹᾱᾺὰΆάιιῈὲΈέῊὴΉήῘῐῙῑῚὶΊίῨῠῩῡῪὺΎύῬῥῸὸΌόῺὼΏώΩωKkÅåℲⅎⅠⅰⅡⅱⅢⅲⅣⅳⅤⅴⅥⅵⅦⅶⅧⅷⅨⅸⅩⅹⅪⅺⅫⅻⅬⅼⅭⅽⅮⅾⅯⅿↃↄⒶⓐⒷⓑⒸⓒⒹⓓⒺⓔⒻⓕⒼⓖⒽⓗⒾⓘⒿⓙⓀⓚⓁⓛⓂⓜⓃⓝⓄⓞⓅⓟⓆⓠⓇⓡⓈⓢⓉⓣⓊⓤⓋⓥⓌⓦⓍⓧⓎⓨⓏⓩⰀⰰⰁⰱⰂⰲⰃⰳⰄⰴⰅⰵⰆⰶⰇⰷⰈⰸⰉⰹⰊⰺⰋⰻⰌⰼⰍⰽⰎⰾⰏⰿⰐⱀⰑⱁⰒⱂⰓⱃⰔⱄⰕⱅⰖⱆⰗⱇⰘⱈⰙⱉⰚⱊⰛⱋⰜⱌⰝⱍⰞⱎⰟⱏⰠⱐⰡⱑⰢⱒⰣⱓⰤⱔⰥⱕⰦⱖⰧⱗⰨⱘⰩⱙⰪⱚⰫⱛⰬⱜⰭⱝⰮⱞⱠⱡⱢɫⱣᵽⱤɽⱧⱨⱩⱪⱫⱬⱭɑⱮɱⱯɐⱰɒⱲⱳⱵⱶⱾȿⱿɀⲀⲁⲂⲃⲄⲅⲆⲇⲈⲉⲊⲋⲌⲍⲎⲏⲐⲑⲒⲓⲔⲕⲖⲗⲘⲙⲚⲛⲜⲝⲞⲟⲠⲡⲢⲣⲤⲥⲦⲧⲨⲩⲪⲫⲬⲭⲮⲯⲰⲱⲲⲳⲴⲵⲶⲷⲸⲹⲺⲻⲼⲽⲾⲿⳀⳁⳂⳃⳄⳅⳆⳇⳈⳉⳊⳋⳌⳍⳎⳏⳐⳑⳒⳓⳔⳕⳖⳗⳘⳙⳚⳛⳜⳝⳞⳟⳠⳡⳢⳣⳫⳬⳭⳮⳲⳳꙀꙁꙂꙃꙄꙅꙆꙇꙈꙉꙊꙋꙌꙍꙎꙏꙐꙑꙒꙓꙔꙕꙖꙗꙘꙙꙚꙛꙜꙝꙞꙟꙠꙡꙢꙣꙤꙥꙦꙧꙨꙩꙪꙫꙬꙭꚀꚁꚂꚃꚄꚅꚆꚇꚈꚉꚊꚋꚌꚍꚎꚏꚐꚑꚒꚓꚔꚕꚖꚗꚘꚙꚚꚛꜢꜣꜤꜥꜦꜧꜨꜩꜪꜫꜬꜭꜮꜯꜲꜳꜴꜵꜶꜷꜸꜹꜺꜻꜼꜽꜾꜿꝀꝁꝂꝃꝄꝅꝆꝇꝈꝉꝊꝋꝌꝍꝎꝏꝐꝑꝒꝓꝔꝕꝖꝗꝘꝙꝚꝛꝜꝝꝞꝟꝠꝡꝢꝣꝤꝥꝦꝧꝨꝩꝪꝫꝬꝭꝮꝯꝹꝺꝻꝼꝽᵹꝾꝿꞀꞁꞂꞃꞄꞅꞆꞇꞋꞌꞍɥꞐꞑꞒꞓꞖꞗꞘꞙꞚꞛꞜꞝꞞꞟꞠꞡꞢꞣꞤꞥꞦꞧꞨꞩꞪɦꞫɜꞬɡꞭɬꞮɪꞰʞꞱʇꞲʝꞳꭓꞴꞵꞶꞷꞸꞹꭰᎠꭱᎡꭲᎢꭳᎣꭴᎤꭵᎥꭶᎦꭷᎧꭸᎨꭹᎩꭺᎪꭻᎫꭼᎬꭽᎭꭾᎮꭿᎯꮀᎰꮁᎱꮂᎲꮃᎳꮄᎴꮅᎵꮆᎶꮇᎷꮈᎸꮉᎹꮊᎺꮋᎻꮌᎼꮍᎽꮎᎾꮏᎿꮐᏀꮑᏁꮒᏂꮓᏃꮔᏄꮕᏅꮖᏆꮗᏇꮘᏈꮙᏉꮚᏊꮛᏋꮜᏌꮝᏍꮞᏎꮟᏏꮠᏐꮡᏑꮢᏒꮣᏓꮤᏔꮥᏕꮦᏖꮧᏗꮨᏘꮩᏙꮪᏚꮫᏛꮬᏜꮭᏝꮮᏞꮯᏟꮰᏠꮱᏡꮲᏢꮳᏣꮴᏤꮵᏥꮶᏦꮷᏧꮸᏨꮹᏩꮺᏪꮻᏫꮼᏬꮽᏭꮾᏮꮿᏯAaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz𐐀𐐨𐐁𐐩𐐂𐐪𐐃𐐫𐐄𐐬𐐅𐐭𐐆𐐮𐐇𐐯𐐈𐐰𐐉𐐱𐐊𐐲𐐋𐐳𐐌𐐴𐐍𐐵𐐎𐐶𐐏𐐷𐐐𐐸𐐑𐐹𐐒𐐺𐐓𐐻𐐔𐐼𐐕𐐽𐐖𐐾𐐗𐐿𐐘𐑀𐐙𐑁𐐚𐑂𐐛𐑃𐐜𐑄𐐝𐑅𐐞𐑆𐐟𐑇𐐠𐑈𐐡𐑉𐐢𐑊𐐣𐑋𐐤𐑌𐐥𐑍𐐦𐑎𐐧𐑏𐒰𐓘𐒱𐓙𐒲𐓚𐒳𐓛𐒴𐓜𐒵𐓝𐒶𐓞𐒷𐓟𐒸𐓠𐒹𐓡𐒺𐓢𐒻𐓣𐒼𐓤𐒽𐓥𐒾𐓦𐒿𐓧𐓀𐓨𐓁𐓩𐓂𐓪𐓃𐓫𐓄𐓬𐓅𐓭𐓆𐓮𐓇𐓯𐓈𐓰𐓉𐓱𐓊𐓲𐓋𐓳𐓌𐓴𐓍𐓵𐓎𐓶𐓏𐓷𐓐𐓸𐓑𐓹𐓒𐓺𐓓𐓻𐲀𐳀𐲁𐳁𐲂𐳂𐲃𐳃𐲄𐳄𐲅𐳅𐲆𐳆𐲇𐳇𐲈𐳈𐲉𐳉𐲊𐳊𐲋𐳋𐲌𐳌𐲍𐳍𐲎𐳎𐲏𐳏𐲐𐳐𐲑𐳑𐲒𐳒𐲓𐳓𐲔𐳔𐲕𐳕𐲖𐳖𐲗𐳗𐲘𐳘𐲙𐳙𐲚𐳚𐲛𐳛𐲜𐳜𐲝𐳝𐲞𐳞𐲟𐳟𐲠𐳠𐲡𐳡𐲢𐳢𐲣𐳣𐲤𐳤𐲥𐳥𐲦𐳦𐲧𐳧𐲨𐳨𐲩𐳩𐲪𐳪𐲫𐳫𐲬𐳬𐲭𐳭𐲮𐳮𐲯𐳯𐲰𐳰𐲱𐳱𐲲𐳲𑢠𑣀𑢡𑣁𑢢𑣂𑢣𑣃𑢤𑣄𑢥𑣅𑢦𑣆𑢧𑣇𑢨𑣈𑢩𑣉𑢪𑣊𑢫𑣋𑢬𑣌𑢭𑣍𑢮𑣎𑢯𑣏𑢰𑣐𑢱𑣑𑢲𑣒𑢳𑣓𑢴𑣔𑢵𑣕𑢶𑣖𑢷𑣗𑢸𑣘𑢹𑣙𑢺𑣚𑢻𑣛𑢼𑣜𑢽𑣝𑢾𑣞𑢿𑣟𖹀𖹠𖹁𖹡𖹂𖹢𖹃𖹣𖹄𖹤𖹅𖹥𖹆𖹦𖹇𖹧𖹈𖹨𖹉𖹩𖹊𖹪𖹋𖹫𖹌𖹬𖹍𖹭𖹎𖹮𖹏𖹯𖹐𖹰𖹑𖹱𖹒𖹲𖹓𖹳𖹔𖹴𖹕𖹵𖹖𖹶𖹗𖹷𖹘𖹸𖹙𖹹𖹚𖹺𖹛𖹻𖹜𖹼𖹝𖹽𖹞𖹾𖹟𖹿𞤀𞤢𞤁𞤣𞤂𞤤𞤃𞤥𞤄𞤦𞤅𞤧𞤆𞤨𞤇𞤩𞤈𞤪𞤉𞤫𞤊𞤬𞤋𞤭𞤌𞤮𞤍𞤯𞤎𞤰𞤏𞤱𞤐𞤲𞤑𞤳𞤒𞤴𞤓𞤵𞤔𞤶𞤕𞤷𞤖𞤸𞤗𞤹𞤘𞤺𞤙𞤻𞤚𞤼𞤛𞤽𞤜𞤾𞤝𞤿𞤞𞥀𞤟𞥁𞤠𞥂𞤡𞥃ẞßᾈᾀᾉᾁᾊᾂᾋᾃᾌᾄᾍᾅᾎᾆᾏᾇᾘᾐᾙᾑᾚᾒᾛᾓᾜᾔᾝᾕᾞᾖᾟᾗᾨᾠᾩᾡᾪᾢᾫᾣᾬᾤᾭᾥᾮᾦᾯᾧᾼᾳῌῃῼῳ").chunksOf(2).toObject()

That comes out to <6KB gzipped UTF-8.

EDIT: That can actually probably be made much smaller, because toUpperCase can recover the majority of these. I will see what I can do.

@tjvr
Copy link
Collaborator Author

tjvr commented Feb 24, 2019

I added an ignoreCase option for literals. We don't yet allow it to be used in isolation; that can come in a future PR, so you can write:

moo.compile({
  digits: /[0-9]+/,
  cow: {match: "cow", ignoreCase: true},
})

...although I imagine the next feature request will relate to case-insensitive literals, so perhaps this needs more thought.

Note that the check I'm using for whether case is relevant for a literal is probably insufficient, for the same case-folding-related reason as you've explained above.

Sent with GitHawk

@nathan
Copy link
Collaborator

nathan commented Feb 24, 2019

Here are 809 bytes (gzipped) that generate the full map:

function d(r){for(var a=Array.from(r),o=[],i=0;i<a.length;){var t=a[i++],e=-1,f=a[i]&&a[i].charCodeAt(0);if(f<64){e=31&f;var n=a[++i]&&a[i].charCodeAt(0);n<64&&(e|=(31&n)<<5,++i)}if(o.push(t),-1<e)for(var d=0,A=t.codePointAt(0);d<=e>>1;++d)A+=1+(1&e),o.push(String.fromCodePoint(A))}return o}for(var CASE_FOLD={},i=(a=d('A0!À*!Ø*Ā-!IJ#Ĺ-Ŋ-!Ź#Ɓ Ƅ!Ƈ!Ɗ Ǝ$Ɠ Ɩ"Ɯ Ɵ Ƣ#Ƨ!Ƭ!Ư!Ʋ Ƶ!ƸƼDŽ LJ NJ Ǎ-Ǟ/DZ Ǵ!Ƿ Ǻ7!Ⱥ Ƚ Ɂ!Ʉ"Ɉ%Ͱ!ͶͿΆ!Ή Ό!Ώ!Β<Σ.ϏϘ5ϴϷ!ϺϽ"#Ѡ?Ҋ5!Ӂ+Ӑ="Ա("Ⴀ("ჇჍᲐ2"Ჽ"Ḁ3$ẞ?"Ἀ,Ἐ(Ἠ,Ἰ,Ὀ(Ὑ%Ὠ,ᾈ,ᾘ,ᾨ,Ᾰ&Ὲ&Ῐ$Ῠ&Ὸ&ΩK ℲⅠ<ↃⒶ0!Ⰰ:"Ⱡ!Ᵽ Ⱨ%Ɱ"ⱲⱵⱾ"Ⲃ?"Ⳬ!ⳲꙀ+!Ꚁ9Ꜣ+Ꜳ;!Ꝺ#Ꝿ\'Ꞌ!Ꞑ!Ꞗ3Ɜ$Ʞ&Ꞷ!A0!𐐀,"𐒰$"𐲀"#𑢠<!𖹀<!𞤀 "')).length;i--;)CASE_FOLD[a[i]]=a[i].toLowerCase();var a=d("µſͅςϐ ϕ ϰ ϵᏸ(ᲀ.ẛιꭰ<$"),b=d("μsισβθφπκρεᏰ(в!ос тъѣꙋṡιᎠ<$");for(i=a.length;i--;)CASE_FOLD[a[i]]=b[i];

And a gist with the code I used to generate them.

EDIT: This relies on Array.from(String), String.fromCodePoint, and String.prototype.codePointAt, all of which we will likely need for Unicode support in other places too, and all of which have fairly concise shims.

@tjvr

...although I imagine the next feature request will relate to case-insensitive literals, so perhaps this needs more thought.

Not sure what you mean by this.

Note that the check I'm using for whether case is relevant for a literal is probably insufficient, for the same case-folding-related reason as you've explained above.

I believe it actually is both necessary and sufficient. Since every character not in CaseFolding.txt maps to itself, we can just exhaustively check the ones in CaseFolding.txt:

> itt.entries(CASE_FOLD).flatten().every(c => c.toLowerCase() !== c.toUpperCase())
true

EDIT: even more exhaustive:

// CASE_FOLD_CPS is a set of every code point in CaseFolding.txt (including T and F mappings)
> itt.range(0x10FFFF).every(c =>
... CASE_FOLD_CPS.has(c) ||
... String.fromCodePoint(c).toUpperCase() === String.fromCodePoint(c).toLowerCase())
true

@tjvr
Copy link
Collaborator Author

tjvr commented Feb 24, 2019

Nice one! Would you like to PR that? (Perhaps unminified?)

I believe it actually is both necessary and sufficient.

Nice -- thanks for comprehensively confirming that.

Regarding my comment about keywords: I have two remaining concerns with this approach. One is that it feels a little bit "magic"; I'm not convinced it's easy to explain when ignoreCase should be used, given the behaviour in this PR. Perhaps it would be better to make things explicit, and go with the options dictionary you suggested originally.

And in particular, now that keywords() is a function by itself, combining it with ignoreCase produces potentially counter-intuitive behaviour:

const lexer = moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    ignoreCase: true,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
})

lexer.reset("foo IF")
// word foo
// ws
// word IF

@nathan
Copy link
Collaborator

nathan commented Feb 25, 2019

Those are good points. Perhaps we should separate the /i changes into their own PR, since /u doesn't have these issues?

Nice one! Would you like to PR that? (Perhaps unminified?)

I can work on a PR (definitely unminified) for unicode ignoreCase after we sort out the design we want for /i / ignoreCase.

@tjvr
Copy link
Collaborator Author

tjvr commented Feb 25, 2019

I was thinking the same thing. I opened #123, which adds only the unicode flag.

Once that's merged, I'll rebase this PR, to keep the conversation about ignoreCase in one place... although this is getting quite long 😬 In conclusion, do you prefer the options dict approach, or the ignoreCase option for strings? Or something else entirely?

@tjvr tjvr changed the title Add ignoreCase and unicode flags Add ignoreCase flag Feb 27, 2019
@nathan
Copy link
Collaborator

nathan commented Feb 27, 2019

do you prefer the options dict approach, or the ignoreCase option for strings? Or something else entirely?

I think they both make the keywords scenario pretty confusing and unintuitive. Neither this:

moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    ignoreCase: true,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
})

nor this:

moo.compile({
  ws: /[ \t]/i,
  word: {
    match: /[a-z]+/i,
    type: moo.keywords({
      if: "if",
      else: "else",
    }),
  },
}, {ignoreCase: true})

actually does what you'd expect (i.e., treats "If" and "iF" and "IF" as if tokens).

@TheKnarf
Copy link

Any status on this? I'm trying to write a SQL parser and therefor need case independent keyword parsing.

@jdoklovic
Copy link

I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:

/was\s+not\s+in|is\s+not|not\s+in|was\s+not|was\s+in|is|in|was|changed/i

any word on this?

@nathan
Copy link
Collaborator

nathan commented Feb 2, 2020

@jdoklovic

I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:

If you don't care about Unicode, you can use something like this to transform the RegExp:

function insensitive(r) {
  const esc = (s, a = '', b = '') => s.replace(/[a-z]/gi, c =>
    `${a}${c.toUpperCase()}${c.toLowerCase()}${b}`)
  const PART = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)|(\[(?:\\.|[^\]])*\])/
  const ESCAPE = /(\\u[\da-fA-F]{4}|\\x[\da-fA-F]{2}|\\c[a-zA-Z]|\\.)/
  return new RegExp(r.source.split(PART).map((s, i) => 
    i % 3 === 1 ? s : 
    i % 3 ? s && s.split(ESCAPE).map((t, j) =>
      j % 2 ? t : esc(t)).join('') :
    esc(s, '[', ']')).join(''), r.flags.replace('i', ''))
}
insensitive(/was\s+not\s+in|is\s+not|not\s+in|was\s+not|was\s+in|is|in|was|changed/i)
// => /[Ww][Aa][Ss]\s+[Nn][Oo][Tt]\s+[Ii][Nn]|[Ii][Ss]\s+[Nn][Oo][Tt]|[Nn][Oo][Tt]\s+[Ii][Nn]|[Ww][Aa][Ss]\s+[Nn][Oo][Tt]|[Ww][Aa][Ss]\s+[Ii][Nn]|[Ii][Ss]|[Ii][Nn]|[Ww][Aa][Ss]|[Cc][Hh][Aa][Nn][Gg][Ee][Dd]/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants