-
Notifications
You must be signed in to change notification settings - Fork 462
Feature request: allow duplicate names in alternation branches #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I prefer the simplicity in the mental model of the current behavior rather than adding special cases where duplicate names are permitted. Moreover, that there is no ambiguity in the match semantics does not imply that the implementation will necessarily match your expectation (although it might, I can't immediately think of any counter examples). Additionally, even if this were to be permitted, you still, at some point, need to map a name to a specific capture group index, which means the regex engine would need to perform roughly similar logic as what you're doing now anyway. Finally, it's not immediately clear to me how the exclusivity analysis would proceed. It certainly seems non-trivial to me. If you need this, my recommendation is to just use distinct name bindings. If you need this frequently, my recommendation would be to build an abstraction that automates some or all of this for you. Said abstraction might include the generation of capture group names and the ability to figure out which one matched. If you're ambitious, you might even be able to automate this at the syntactic level with |
Yes, but it would allow to both avoid unnecessary allocations and simplify API for the end consumer.
In my mind it seemed fairly straightforward - forking a set of already bound names when visiting an alternation node, and recursing into each branch with own copy of this set, so that duplicate checker would never see names in different branches as conflicts. |
Seems insufficient to me. e.g., in extern crate regex;
use regex::Regex;
fn main() {
let re = Regex::new(r"((?P<foo>\w)|(?P<bar>\W))+").unwrap();
let caps = re.captures("a*").unwrap();
println!("{:?}", caps);
} outputs
Playground link: http://play.rust-lang.org/?gist=9e511cec90e7c19664bac1051a9f6087&version=stable&mode=debug Also, I don't think I know which allocations you're talking about. Even if they exist, they seem to be able to be amortized. |
The repetition cases are still tricky, yeah. I guess this wouldn't be an issue for most common cases, so it could still be allowed under an option, or maybe checker should take whether it's in a repetition into account... idk.
I meant the HashMap, especially when combining many regexps that all eventually lead to the same resulting data that needs to be extracted in different ways from each branch. But then, again, perf is just a nice-to-have improvement here, I'm mostly interesting in API ergonomics benefit. Perhaps what I'm thinking of could be better achieved by allowing |
You should be able to build that hashmap once when the regex is built. The cost of doing that will likely be dwarfed by regex compilation. Here's my stance:
My advice would be to prototype something in an external crate that provides the semantics you want so that folks can actually use it. After that point, if there is an obvious way to integrate that functionality back into thia crate, then we can revisit it. I would be happy to brainstorm what an external crate API might look like. |
In Python, group names and indices are assigned during the match. e.g. I think that’s much more elegant. |
What?
It does seem that group indices are about as you say though, although I find the behavior somewhat unintuitive:
I also couldn't find anything about any of this in the docs. |
Ah sorry, I meant the >>> import regex
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)")
regex.Regex('(?P<cap>\\w)|(?P<cap>\\d)', flags=regex.V0)
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)").fullmatch("A")["cap"]
'A'
>>> regex.compile(r"(?P<cap>\w)|(?P<cap>\d)").fullmatch("1")["cap"]
'1' |
Interesting. While not nearly as convenient, you can achieve something similar with the lower level use regex_automata::{meta::Regex, PatternID, Span};
fn main() -> anyhow::Result<()> {
let re = Regex::new_many(&[r"(?P<cap>\pL)", r"(?P<cap>\d)"]).unwrap();
let mut caps = re.create_captures();
re.captures("A", &mut caps);
assert_eq!(
caps.get_match().map(|m| m.pattern()),
Some(PatternID::must(0))
);
assert_eq!(caps.get_group_by_name("cap"), Some(Span::from(0..1)));
re.captures("1", &mut caps);
assert_eq!(
caps.get_match().map(|m| m.pattern()),
Some(PatternID::must(1))
);
assert_eq!(caps.get_group_by_name("cap"), Some(Span::from(0..1)));
Ok(())
} This actually gives you more information. It tells you which branch matched. But is somewhat more restricted in its usage, but probably covers most use cases. |
Thanks! Yeah, I think having a convenient way to do this would be nice for quite some use cases. I don’t think anything speaks against making this possible. |
I've looked at some of the previous discussions around returning multiple matches of the same named capture in repetitions, and I agree that allowing that raises many questions around the right way.
However, a case I'm looking at is allowing duplicate named captures when they're clearly exclusive. Do you think it would make sense to allow duplicates in case like below?
Currently retrieving such alternate matches requires multiple name bindings and then checking which one succeeded, or separating regexp into two matches, but it seems that there is no ambiguity in case with alternation so duplicate names could be allowed as-is?
The text was updated successfully, but these errors were encountered: