Skip to content

Latest commit

 

History

History
131 lines (70 loc) · 6.83 KB

04-08.md

File metadata and controls

131 lines (70 loc) · 6.83 KB

April 8, 2021 Incubator Call Notes (RegExp set notation)

Attendees: (alphasorted)

  • Mark Davis (MED)
  • Markus Scherer (MWS)
  • Mathias Bynens (MB)
  • Michael Ficarra (MF)
  • Richard Gibson (RGN)
  • Shane Carr (SFC)
  • Yulia Startsev (YSV)
  • Waldemar Horwat (WH)

RegExp set notation

YSV: MWS, do you have any issues?

MWS: Nothing in particular. More or less complete proposal. Need to confirm unambiguous grammar. Want to check if direction change is needed.

MWS: Combined new proposal for RegExp set notation + properties of strings (fka “sequence properties”).

MWS: Set notation: support subtraction. Like with sets, but for character classes. Only union is currently supported. Want to add the other ones, particularly important for Unicode properties.

MWS: Small group has worked together and come up with a BNF description.

MWS: Most important thing: don't break people (the web).

MB: Initially looked into in-pattern syntax. No way to be compatible with non-u regexps. It would look like it worked but then would do something else. Convinced ourselves that adding a new flag would solve the problems.

MB: v flag implies u flag. No longer need a syntax marker. More readable, less repetitive pattern.

MB: how do string properties behave in character classes? properties of strings are particularly useful with set notation. new flag adds support for the two new features together.

MWS: Properties of strings really want to be in the character class because you want to add or subtract things.

WH: not sure if this solves the sequence in a negated character class issue.

MWS: we will get to that later

MB: what these proposals have in common is that they are changing the meaning of square brackets (character class semantics)

WH: I was convinced of the flag solely based on the syntax issue, not sure if it solves sequence class issues

MWS: in a survey of other regexp engines, they are all over the place with syntax for set operations. settled on double punctuation. mixing ranges and subtraction/intersection requires range to be enclosed in square brackets because otherwise it would look confusing.

MWS: want to reserve other ascii double punctuation for future use. also single open/close curly braces.

MF: What would {…} be reserved for?

MWS: properties of strings but no syntax for literal strings in character class. since we have an idea of what we might want to do with them, it seems reasonable to reserve.

WH: it may be ambiguous what it means to reserve these things. Is [&&&&&&&] a character class with a few redundant &s, or is it & followed by the && operator followed by & followed by the && operator followed by another &?

MED: we looked over different regexp syntaxes in other engines and the recommendations we made are made in light of what others are doing. The double syntax is common to help reduce ambiguity.

WH: I agree with the goal, but the way it was stated led to some interesting questions. I believe it is doable, we just need to figure out the right way to do it.

MWS: Dashes have 3 uses now: range, subtraction, and literal character. We were previously only suggesting it be allowed at beginning/end, but now we require it to be escaped. No literal - is allowed.

MWS: precedence solution: brackets whenever you mix kinds of operators. associativity solution: no issue, always associate left.

WH: does caret require brackets?

MWS: no, it applies to everything in the character class

MED: another option could be that union (via juxtaposition) be tighter than intersection/difference. group concluded it would be safer to require brackets.

MF: In this mode with the v flag, can I do [^ab]?

MWS: Yes.

MF: [[a-z]--q] So the syntax error issue only comes up when mixing operators.

MWS: you can put properties of strings into a character class. it's also possible to use difference to remove sequences from string properties.

MWS: Within [^…] we can still support properties of strings in those cases where it’s easily provable that the end result is code points only, and no strings.

MF: Including {} in the proposal to allow literal sequences would make it stronger. To enable things like: match all RGI_Emoji but not the French flag.

WH: I also support literal sequences being included with this proposal

MWS: We agree that literal strings are useful, and want to reserve syntax for them. We kept that out of this proposal to keep it small.

MF: We do like to keep proposals small, but not arbitrarily so. This feature seems to be required for the proposal to be minimally useful.

MED: Per CLDR, characters that are used in some languages can consist of multiple code points.

WH: Flag emoji are a hornet's nest because they're not self-synchronizing.

MED: That's one of the warts of Unicode that we deeply regret.

YSV: Choice of v as the flag name.

MWS: not important that it is v, just nice that it comes after u

YSV: I think it'd be useful if it stood for something. maybe we could work backwards to a meaning

SFC: flag could be seen as a progression from u, so u -> v makes sense. the other way is that we are adding sets of strings, so maybe s would make sense.

WH: s is already used for single-line mode

MB: it would be nice if the flag could mean something or if it was clear what it stood for. we added the d flag for "inDices".

MED: no strong reasoning for a particular choice.

WH: grapheme matching (if it ever happens, because it's hard/problematic) might use w flag. v fits as being part way to grapheme matching.

YSV: an issue with the u -> v "upgrade", letter grades work the other way around. maybe go u -> t instead.

MED: Lowercase t is used somewhere, would want to be uppercase T.

MF: would grapheme matching actually be a realistic direction for the regexp part of this language? because of the locale sensitivity, wouldn't it at least be in 402 if we did it at all?

MED: There's two kinds of grapheme matching: one that only affects parsing. The other is locale-sensitive (language-specific graphemes, which change over time!). It's important to distinguish between them.

MWS: let's stay away from the grapheme matching conversation.

MB: would anybody who has previously voiced concerns speak about their current thoughts?

MF: the changes presented here have alleviated my concerns. I like how the combination of these proposals works much better now.

WH: there are some outstanding issues with ambiguity with doubled operators. what does && mean? is that legal? it depends on interpretation of prose currently.

YSV: anything else we would like as an outcome from this meeting?

MED: is it time for us to write spec text for this? it seems like it

MWS: it appears there's rough consensus from this group.

MB: I've added an agenda item to present a shorter version of this proposal at the next TC39 meeting. If the committee agrees with this direction, then we can start working on the spec text afterwards.

YSV: sounds like a great conclusion.