Feature request: allow duplicate names in alternation branches #492

RReverser · 2018-06-25T12:34:31Z

I've looked at some of the previous discussions around returning multiple matches of the same named capture in repetitions, and I agree that allowing that raises many questions around the right way.

However, a case I'm looking at is allowing duplicate named captures when they're clearly exclusive. Do you think it would make sense to allow duplicates in case like below?

Regex::new(r"r#(?P<name>.*?)#|(?P<name>\w+)")

Currently retrieving such alternate matches requires multiple name bindings and then checking which one succeeded, or separating regexp into two matches, but it seems that there is no ambiguity in case with alternation so duplicate names could be allowed as-is?

BurntSushi · 2018-06-25T12:54:13Z

I prefer the simplicity in the mental model of the current behavior rather than adding special cases where duplicate names are permitted.

Moreover, that there is no ambiguity in the match semantics does not imply that the implementation will necessarily match your expectation (although it might, I can't immediately think of any counter examples). Additionally, even if this were to be permitted, you still, at some point, need to map a name to a specific capture group index, which means the regex engine would need to perform roughly similar logic as what you're doing now anyway. Finally, it's not immediately clear to me how the exclusivity analysis would proceed. It certainly seems non-trivial to me.

If you need this, my recommendation is to just use distinct name bindings.

If you need this frequently, my recommendation would be to build an abstraction that automates some or all of this for you. Said abstraction might include the generation of capture group names and the ability to figure out which one matched. If you're ambitious, you might even be able to automate this at the syntactic level with regex-syntax. Of course, the problem there is that regex-syntax is what does the duplicate detection. However, if you're interested, I'm much more flexible there if that would be useful to you (i.e., an option to disable duplicate detection in regex-syntax only).

RReverser · 2018-06-25T13:50:14Z

which means the regex engine would need to perform roughly similar logic as what you're doing now anyway

Yes, but it would allow to both avoid unnecessary allocations and simplify API for the end consumer.

Finally, it's not immediately clear to me how the exclusivity analysis would proceed. It certainly seems non-trivial to me.

In my mind it seemed fairly straightforward - forking a set of already bound names when visiting an alternation node, and recursing into each branch with own copy of this set, so that duplicate checker would never see names in different branches as conflicts.

BurntSushi · 2018-06-25T13:58:31Z

Seems insufficient to me. e.g., in ((?P<foo>re)|(?P<foo>re))+ your checker would consider the names to be exclusive but they actually aren't. e.g.,

extern crate regex;

use regex::Regex;

fn main() {
    let re = Regex::new(r"((?P<foo>\w)|(?P<bar>\W))+").unwrap();
    let caps = re.captures("a*").unwrap();
    println!("{:?}", caps);
}

outputs

Captures({0: Some("a*"), 1: Some("*"), "foo": Some("a"), "bar": Some("*")})

Playground link: http://play.rust-lang.org/?gist=9e511cec90e7c19664bac1051a9f6087&version=stable&mode=debug

Also, I don't think I know which allocations you're talking about. Even if they exist, they seem to be able to be amortized.

RReverser · 2018-06-25T14:06:46Z

The repetition cases are still tricky, yeah. I guess this wouldn't be an issue for most common cases, so it could still be allowed under an option, or maybe checker should take whether it's in a repetition into account... idk.

Also, I don't think I know which allocations you're talking about.

I meant the HashMap, especially when combining many regexps that all eventually lead to the same resulting data that needs to be extracted in different ways from each branch. But then, again, perf is just a nice-to-have improvement here, I'm mostly interesting in API ergonomics benefit.

Perhaps what I'm thinking of could be better achieved by allowing .captures on a RegexSet instance, but that seems even harder to implement correctly and efficiently.

BurntSushi · 2018-06-25T14:24:42Z

You should be able to build that hashmap once when the regex is built. The cost of doing that will likely be dwarfed by regex compilation.

Here's my stance:

Philosophically speaking, I'm not strongly opposed to a feature like this, but I do appreciate the simplicity of the current behavior.
There are interesting corner cases that need tonbe considered, and there is likely some non trivial implementation work for this to happen.

My advice would be to prototype something in an external crate that provides the semantics you want so that folks can actually use it. After that point, if there is an obvious way to integrate that functionality back into thia crate, then we can revisit it.

I would be happy to brainstorm what an external crate API might look like.

flying-sheep · 2024-12-04T15:41:34Z

In Python, group names and indices are assigned during the match.

e.g. (a)|(b) returns a re.Match with one capture group (excluding the “full match” group 0), since one of them is always skipped and one of them is always not skipped. The same is possible for named groups: If two capture groups are always mutually exclusive, nothing stops you from using the same name.

I think that’s much more elegant.

BurntSushi · 2024-12-04T16:05:38Z

What?

>>> re.compile(r"(?P<cap>\w)|(?P<cap>\d)")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    re.compile(r"(?P<cap>\w)|(?P<cap>\d)")
  File "/usr/lib/python3.12/re/__init__.py", line 228, in compile
    return _compile(pattern, flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/__init__.py", line 307, in _compile
    p = _compiler.compile(pattern, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_compiler.py", line 745, in compile
    p = _parser.parse(p, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_parser.py", line 979, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_parser.py", line 460, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_parser.py", line 857, in _parse
    raise source.error(err.msg, len(name) + 1) from None
re.error: redefinition of group name 'cap' as group 2; was group 1 at position 16

It does seem that group indices are about as you say though, although I find the behavior somewhat unintuitive:

>>> m = re.search(r"(\w)(\w)|(\d)(\d)", "AB")
>>> m.groups()
('A', 'B', None, None)
>>> m.group(1)
'A'
>>> m.group(2)
'B'
>>> m = re.search(r"(\w)(\w)|(\d)(\d)", "12")
>>> m.groups()
('1', '2', None, None)
>>> m.group(1)
'1'
>>> m.group(2)
'2'
>>> m.group(3)
>>> m.group(4)

I also couldn't find anything about any of this in the docs.

flying-sheep · 2024-12-05T16:13:55Z

Ah sorry, I meant the regex package:

>>> import regex
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)")
regex.Regex('(?P<cap>\\w)|(?P<cap>\\d)', flags=regex.V0)
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)").fullmatch("A")["cap"]
'A'
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)").fullmatch("1")["cap"]
'1'

BurntSushi · 2024-12-05T17:02:19Z

Interesting. While not nearly as convenient, you can achieve something similar with the lower level regex-automata API via a multi-regex:

use regex_automata::{meta::Regex, PatternID, Span};

fn main() -> anyhow::Result<()> {
    let re = Regex::new_many(&[r"(?P<cap>\pL)", r"(?P<cap>\d)"]).unwrap();
    let mut caps = re.create_captures();

    re.captures("A", &mut caps);
    assert_eq!(
        caps.get_match().map(|m| m.pattern()),
        Some(PatternID::must(0))
    );
    assert_eq!(caps.get_group_by_name("cap"), Some(Span::from(0..1)));

    re.captures("1", &mut caps);
    assert_eq!(
        caps.get_match().map(|m| m.pattern()),
        Some(PatternID::must(1))
    );
    assert_eq!(caps.get_group_by_name("cap"), Some(Span::from(0..1)));

    Ok(())
}

This actually gives you more information. It tells you which branch matched. But is somewhat more restricted in its usage, but probably covers most use cases.

flying-sheep · 2024-12-05T18:38:47Z

Thanks! Yeah, I think having a convenient way to do this would be nice for quite some use cases. I don’t think anything speaks against making this possible.

BurntSushi closed this as completed Jun 25, 2018

BurntSushi mentioned this issue May 27, 2021

Add "replace" text field in --json output that contains replace string for respective matches BurntSushi/ripgrep#1872

Open

BurntSushi mentioned this issue Nov 16, 2023

Use the same name of a capture group in different alternatives of disjunction (| operator) #1128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: allow duplicate names in alternation branches #492

Feature request: allow duplicate names in alternation branches #492

RReverser commented Jun 25, 2018 •

edited

Loading

BurntSushi commented Jun 25, 2018 •

edited

Loading

RReverser commented Jun 25, 2018

BurntSushi commented Jun 25, 2018 •

edited

Loading

RReverser commented Jun 25, 2018

BurntSushi commented Jun 25, 2018

flying-sheep commented Dec 4, 2024

BurntSushi commented Dec 4, 2024

flying-sheep commented Dec 5, 2024

BurntSushi commented Dec 5, 2024 •

edited

Loading

flying-sheep commented Dec 5, 2024

Feature request: allow duplicate names in alternation branches #492

Feature request: allow duplicate names in alternation branches #492

Comments

RReverser commented Jun 25, 2018 • edited Loading

BurntSushi commented Jun 25, 2018 • edited Loading

RReverser commented Jun 25, 2018

BurntSushi commented Jun 25, 2018 • edited Loading

RReverser commented Jun 25, 2018

BurntSushi commented Jun 25, 2018

flying-sheep commented Dec 4, 2024

BurntSushi commented Dec 4, 2024

flying-sheep commented Dec 5, 2024

BurntSushi commented Dec 5, 2024 • edited Loading

flying-sheep commented Dec 5, 2024

RReverser commented Jun 25, 2018 •

edited

Loading

BurntSushi commented Jun 25, 2018 •

edited

Loading

BurntSushi commented Jun 25, 2018 •

edited

Loading

BurntSushi commented Dec 5, 2024 •

edited

Loading