Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supported Modifier Flags #1

Closed
rbuckton opened this issue Nov 22, 2021 · 9 comments
Closed

Supported Modifier Flags #1

rbuckton opened this issue Nov 22, 2021 · 9 comments

Comments

@rbuckton
Copy link
Collaborator

In the Oct, 2021 plenary, @michaelficarra asked that we outline and provide motivating examples for each flag we are considering as a supported modifier.

The flags currently under consideration are:

  • i — ignore-case
    • Rationale — Toggling ignore-case is especially useful when matching patterns with varying case sensitivity, or when parsing patterns provided via JSON configuration. Especially useful when working with complex Unicode character ranges.
    • Example — Match upper case ascii letter followed by upper or lower case ascii letter or '
      const re = /^[A-Z](?i)[a-z']+$/;
      re.test("O'Neill"); // true
      re.test("o'neill"); // false
      
      // alternatively (defaulting to ignore-case):
      const re2 = /^(?-i:[A-Z])[a-z']+$/i;
    • Example — Match word starting with D followed by word starting with D or d (from .NET documentation, see 1)
      const re = /\b(D\w+)(?ix)\s(d\w+)\b/g;
      const input = "double dare double Double a Drooling dog The Dreaded Deep";
      re.exec(input); // ["Drooling dog", "Drooling", "dog"]
      re.exec(input); // ["Dreaded Deep", "Dreaded", "Deep"]
  • m — multiline
    • Rationale — Flexibility in matching beginning-of-buffer vs. beginning-of-line or end-of-buffer vs. end-of-line in a complex pattern.
    • Example — Match a frontmatter block at the start of a file
      const re = /^---(?m)$((?:^(?!---$).*$)*)^---$/;
      re.test("---a"); // false
      re.test("---\n---"); // true
      re.test("---\na: b\n---"); // true
  • s — dot-all (i.e., "single line")
    • Rationale — Control over . matching semantics within a pattern.
    • Example
      const re = /a.c(?s:.)*x.z/;
      re.test("a\ncx\nz"); // flse
      re.test("abcdxyz"); // true
      re.test("aBc\nxYz"); // true
  • x — Extended Mode. This flag is proposed by https://github.com/tc39/proposal-regexp-x-mode
    • Rationale — Would allow control over significant whitespace handling in a pattern.
    • Example — Disabling x mode when composing a complex pattern:
      const idPattern = `[a-z]{2} \d{4}`; // space required
      const re = new RegExp(String.raw`
        # match the id
        (?<id>(?-x:${idPattern}))
        
        # match a separator
        :\s
        
        # match the value
        (?<value>\w+)
      `, "x");
      
      re.exec("aa0123: foo")?.groups; // undefined
      re.exec("aa 0123: foo")?.groups; // { id: "aa 0123", value: "foo" }

Flags likely too complex to support:

  • u — Unicode. This flag affects how a pattern is parsed, not how it is matched. Supporting it would likely require a cover grammar and additional static semantics.
  • v — Extended Unicode. This flag is proposed by https://github.com/tc39/proposal-regexp-set-notation as an extension of the u flag and would have the same difficulties.

Flags that will never be supported:

  • g — Global. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
  • y — Sticky. This flag affects the index at which matching starts and not the matching behavior itself. Changing it mid pattern would have no effect.
  • d — Indices. This flag affects the match result. Changing it mid pattern would have no effect.

Footnotes

  1. https://docs.microsoft.com/en-us/dotnet/standard/base-types/miscellaneous-constructs-in-regular-expressions#inline-options

@ljharb
Copy link
Member

ljharb commented Nov 22, 2021

For the examples, can you share how you'd do it without the relevant proposal?

@rbuckton
Copy link
Collaborator Author

rbuckton commented Nov 23, 2021

i

Simple cases like /[A-Z][A-Za-z]/ are trivial:

// match an uppercase ASCII letter followed by a mixed-case ASCII letter

// with 'i' modifier:
/[A-Z](?i)[A-Z]/

// without 'i' modifier:
/[A-Z][A-Za-z]/

However, more complex cases are far from trivial:

// match a mixed case "hello" followed by the exact characters "World"

// with 'i' modifier:
/(?i:hello) World/

// without 'i' modifier:
/[Hh][Ee][Ll][Ll][Oo] World/

m

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode:

// with 'm' modifier:
/^---(?m)$((?:^(?!---$).*$)*)^---$/

// without the 'm' modifier, in 'u' mode:
/\A---$((?:^(?!---$).*$)*)^---$/mu

// without the 'm' modifier, not in 'u' mode: not possible to invert when in 'm' mode

s

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:

// match /a.b/ outside of 's' mode, then /.+/ in 's' mode, then /c.d/ outside of 's' mode
// with 's' modifier
/a.b(?s:.)+c.d/

// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

// match /a.b/ inside of 's' mode, then /.+/ outside of 's' mode, then /c.d/ inside of 's' mode
// with 's' modifier
/a.b(?-s:.+)c.d/s

// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

@ljharb
Copy link
Member

ljharb commented Nov 23, 2021

There's nothing with [^\s\S] for the dotAll case?

@rbuckton
Copy link
Collaborator Author

I'm not sure I understand what you mean. Can you clarify?

@rbuckton
Copy link
Collaborator Author

If you mean using [\s\S] to match everything, that's feasible for the first s example, sure. I don't see how it helps with the second example though.

@RunDevelopment
Copy link

I just want to share a little trick to emulate m and non-m mode without using ^ and $. This might be relevant for transpilers.

- /^ $/ == /(?<![\s\S]) (?![\s\S])/
- /^ $/m == /(?<!.) (?!.)/ // no `s` flag!

This works for both u and non-u mode.

@rbuckton
Copy link
Collaborator Author

rbuckton commented Jun 7, 2022

The modifiers supported by this proposal will be limited to i, m, and s. These may be potentially changed by future proposals (such as the x-mode proposal), but doing so is out of scope.

@rbuckton rbuckton closed this as completed Jun 7, 2022
@slevithan
Copy link

slevithan commented May 30, 2024

@rbuckton, I know this is already closed (and implemented in V8, yay!), but for interest's sake, note that it's very possible to emulate presence or lack of s and m. I do it in regex-make to locally apply the presence or absence of local flags for RegExp instances interpolated into a template.

m

If you are in u mode, you could emulate non-m mode when in m mode using the proposed \A and \z buffer boundaries. However, if you are not in u mode, there's no way to match the buffer boundaries when in m mode

Emulating is possible without u mode or buffer boundaries.

  • Emulate an m mode ^: (?<=^|[\n\r\u2028\u2029])
  • Emulate a non-m mode ^: (?<![^])
  • Emulate an m mode $: (?=$|[\n\r\u2028\u2029])
  • Emulate a non-m mode $: (?![^])

s

Its fairly complicated to invert the s flag in a RegExp without modifiers, and easy to get wrong:

[...]
// without 's' modifier
/a.b(?:.|[\r\n\u2028\u2029])+c.d/

[...]
// without 's' modifier
/a.b(?:(?![\r\n\u2028\u2029]).)+c.d/s

It's easier than that:

  • Emulate an s mode .: [^]
  • Emulate a non-s mode .: [[^]--[\n\r\u2028\u2029]] (with v) or (?:(?![\n\r\u2028\u2029]).) for a less efficient version without v (same as you showed in the quote).

Note that, like your (?:.|[\r\n\u2028\u2029]) example, [^] either matches full code points or doesn't based on the presence of flag u/v.

@RunDevelopment
Copy link

  • Emulate a non-s mode .: [[^]--[\n\r\u2028\u2029]] (with v) or (?:(?![\n\r\u2028\u2029]).) for a less efficient version without v (same as you showed in the quote).

Or [^\n\r\u2028\u2029].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants