Skip to content

Feature request: allow duplicate names in alternation branches #492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RReverser opened this issue Jun 25, 2018 · 10 comments
Closed

Feature request: allow duplicate names in alternation branches #492

RReverser opened this issue Jun 25, 2018 · 10 comments

Comments

@RReverser
Copy link
Contributor

RReverser commented Jun 25, 2018

I've looked at some of the previous discussions around returning multiple matches of the same named capture in repetitions, and I agree that allowing that raises many questions around the right way.

However, a case I'm looking at is allowing duplicate named captures when they're clearly exclusive. Do you think it would make sense to allow duplicates in case like below?

Regex::new(r"r#(?P<name>.*?)#|(?P<name>\w+)")

Currently retrieving such alternate matches requires multiple name bindings and then checking which one succeeded, or separating regexp into two matches, but it seems that there is no ambiguity in case with alternation so duplicate names could be allowed as-is?

@BurntSushi
Copy link
Member

BurntSushi commented Jun 25, 2018

I prefer the simplicity in the mental model of the current behavior rather than adding special cases where duplicate names are permitted.

Moreover, that there is no ambiguity in the match semantics does not imply that the implementation will necessarily match your expectation (although it might, I can't immediately think of any counter examples). Additionally, even if this were to be permitted, you still, at some point, need to map a name to a specific capture group index, which means the regex engine would need to perform roughly similar logic as what you're doing now anyway. Finally, it's not immediately clear to me how the exclusivity analysis would proceed. It certainly seems non-trivial to me.

If you need this, my recommendation is to just use distinct name bindings.

If you need this frequently, my recommendation would be to build an abstraction that automates some or all of this for you. Said abstraction might include the generation of capture group names and the ability to figure out which one matched. If you're ambitious, you might even be able to automate this at the syntactic level with regex-syntax. Of course, the problem there is that regex-syntax is what does the duplicate detection. However, if you're interested, I'm much more flexible there if that would be useful to you (i.e., an option to disable duplicate detection in regex-syntax only).

@RReverser
Copy link
Contributor Author

which means the regex engine would need to perform roughly similar logic as what you're doing now anyway

Yes, but it would allow to both avoid unnecessary allocations and simplify API for the end consumer.

Finally, it's not immediately clear to me how the exclusivity analysis would proceed. It certainly seems non-trivial to me.

In my mind it seemed fairly straightforward - forking a set of already bound names when visiting an alternation node, and recursing into each branch with own copy of this set, so that duplicate checker would never see names in different branches as conflicts.

@BurntSushi
Copy link
Member

BurntSushi commented Jun 25, 2018

Seems insufficient to me. e.g., in ((?P<foo>re)|(?P<foo>re))+ your checker would consider the names to be exclusive but they actually aren't. e.g.,

extern crate regex;

use regex::Regex;

fn main() {
    let re = Regex::new(r"((?P<foo>\w)|(?P<bar>\W))+").unwrap();
    let caps = re.captures("a*").unwrap();
    println!("{:?}", caps);
}

outputs

Captures({0: Some("a*"), 1: Some("*"), "foo": Some("a"), "bar": Some("*")})

Playground link: http://play.rust-lang.org/?gist=9e511cec90e7c19664bac1051a9f6087&version=stable&mode=debug

Also, I don't think I know which allocations you're talking about. Even if they exist, they seem to be able to be amortized.

@RReverser
Copy link
Contributor Author

The repetition cases are still tricky, yeah. I guess this wouldn't be an issue for most common cases, so it could still be allowed under an option, or maybe checker should take whether it's in a repetition into account... idk.

Also, I don't think I know which allocations you're talking about.

I meant the HashMap, especially when combining many regexps that all eventually lead to the same resulting data that needs to be extracted in different ways from each branch. But then, again, perf is just a nice-to-have improvement here, I'm mostly interesting in API ergonomics benefit.

Perhaps what I'm thinking of could be better achieved by allowing .captures on a RegexSet instance, but that seems even harder to implement correctly and efficiently.

@BurntSushi
Copy link
Member

You should be able to build that hashmap once when the regex is built. The cost of doing that will likely be dwarfed by regex compilation.

Here's my stance:

  • Philosophically speaking, I'm not strongly opposed to a feature like this, but I do appreciate the simplicity of the current behavior.
  • There are interesting corner cases that need tonbe considered, and there is likely some non trivial implementation work for this to happen.

My advice would be to prototype something in an external crate that provides the semantics you want so that folks can actually use it. After that point, if there is an obvious way to integrate that functionality back into thia crate, then we can revisit it.

I would be happy to brainstorm what an external crate API might look like.

@flying-sheep
Copy link
Contributor

In Python, group names and indices are assigned during the match.

e.g. (a)|(b) returns a re.Match with one capture group (excluding the “full match” group 0), since one of them is always skipped and one of them is always not skipped. The same is possible for named groups: If two capture groups are always mutually exclusive, nothing stops you from using the same name.

I think that’s much more elegant.

@BurntSushi
Copy link
Member

What?

>>> re.compile(r"(?P<cap>\w)|(?P<cap>\d)")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    re.compile(r"(?P<cap>\w)|(?P<cap>\d)")
  File "/usr/lib/python3.12/re/__init__.py", line 228, in compile
    return _compile(pattern, flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/__init__.py", line 307, in _compile
    p = _compiler.compile(pattern, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_compiler.py", line 745, in compile
    p = _parser.parse(p, flags)
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_parser.py", line 979, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_parser.py", line 460, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/re/_parser.py", line 857, in _parse
    raise source.error(err.msg, len(name) + 1) from None
re.error: redefinition of group name 'cap' as group 2; was group 1 at position 16

It does seem that group indices are about as you say though, although I find the behavior somewhat unintuitive:

>>> m = re.search(r"(\w)(\w)|(\d)(\d)", "AB")
>>> m.groups()
('A', 'B', None, None)
>>> m.group(1)
'A'
>>> m.group(2)
'B'
>>> m = re.search(r"(\w)(\w)|(\d)(\d)", "12")
>>> m.groups()
('1', '2', None, None)
>>> m.group(1)
'1'
>>> m.group(2)
'2'
>>> m.group(3)
>>> m.group(4)

I also couldn't find anything about any of this in the docs.

@flying-sheep
Copy link
Contributor

Ah sorry, I meant the regex package:

>>> import regex
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)")
regex.Regex('(?P<cap>\\w)|(?P<cap>\\d)', flags=regex.V0)
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)").fullmatch("A")["cap"]
'A'
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)").fullmatch("1")["cap"]
'1'

@BurntSushi
Copy link
Member

BurntSushi commented Dec 5, 2024

Interesting. While not nearly as convenient, you can achieve something similar with the lower level regex-automata API via a multi-regex:

use regex_automata::{meta::Regex, PatternID, Span};

fn main() -> anyhow::Result<()> {
    let re = Regex::new_many(&[r"(?P<cap>\pL)", r"(?P<cap>\d)"]).unwrap();
    let mut caps = re.create_captures();

    re.captures("A", &mut caps);
    assert_eq!(
        caps.get_match().map(|m| m.pattern()),
        Some(PatternID::must(0))
    );
    assert_eq!(caps.get_group_by_name("cap"), Some(Span::from(0..1)));

    re.captures("1", &mut caps);
    assert_eq!(
        caps.get_match().map(|m| m.pattern()),
        Some(PatternID::must(1))
    );
    assert_eq!(caps.get_group_by_name("cap"), Some(Span::from(0..1)));

    Ok(())
}

This actually gives you more information. It tells you which branch matched. But is somewhat more restricted in its usage, but probably covers most use cases.

@flying-sheep
Copy link
Contributor

Thanks! Yeah, I think having a convenient way to do this would be nice for quite some use cases. I don’t think anything speaks against making this possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants