-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add *(::Union{Regex, AbstractString, AbstractChar}...) #23422
Conversation
I assume you mean "regex noobie", because |
Yes you are right, I edited to "regex-noobie friendly" ! For the option you find more readble, it would require more work, and as |
Related: regular languages with In other words, extending |
Someone else had a similar observation:
In this PR I clearly don't intend to go beyond multiplication: is it still too early to decide to do that, or shall I add tests and documentation? |
Go for it! |
b3a8a81
to
87e9ca2
Compare
Freshly rebased and documented, ready for reviews :) |
87e9ca2
to
4306a4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My immediate instinct is to worry about composing expressions via string operations. And indeed, I think there are some quite broken cases here. E.g. r"this|that" * r"cat|hat"
would result in r"this|thatcat|hat"
which is not the correct juxtaposition at all. Similarly, for repetition r"this|that"^2
results in r"this|thatthis|that"
which is also wrong. I think that if you unconditionally wrap regex patterns in non-capturing subpatterns that this might be correct. So you'd produce r"(?:this|that)(?:cat|hat)"
for the first example. For repetition, I think it's wasteful to actually textually repeat the regex pattern since regular expressions already support repetition as a language feature. So for the second case, I would instead generate r"(?:this|that)^{2}"
.
Other features that we may want to consider while thinking about such an API expansion are:
- alternation of regexes and even strings, producing regexes
- repetition by a range of indices to get between min and max number of repetitions
- an operator/function for Kleene star?
For example, we could use +
for alternation since that makes *
and +
a semiring on regexes with strings embedded in them as "simple" patterns. We could also use something like p^(a:b)
to generate r"(?:$p){$a,$b}"
if you follow my pseudo-syntax here.
base/regex.jl
Outdated
function *(r1::Union{Regex,AbstractString,AbstractChar}, rs::Union{Regex,AbstractString,AbstractChar}...) | ||
opts = unique((r.compile_options, r.match_options) for r in (r1, rs...) if r isa Regex) | ||
length(opts) == 1 || | ||
throw(ArgumentError("cannot multiply regexes with incompatible options")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be supported with internal option setting, probably best combined with a non-capturing subpattern. So something like this:
"(?sm:$(r1.pattern))(?i:$(r2.pattern))"
This could work uniformly but it would be nice to avoid the extra line noise in the presumably common case where the options match. You could get even fancier and keep options that are shared by all of the regexes in the result regex's global options while only putting the options that are specific to each subpattern in it's option set.
So overall todos here are:
|
Thanks so much for taking the time to set out the logic! and for linking to documentation. I hadn't even started to think about this sophistication, and didn't know the PCRE features you indicated to implement the logic. So I implemented this, but with the limitation that only the four options corresponding to "imsx" can be different between the multiplied regexes (IIUC, these are the only ones which can be indicated "internally"). Concerning the extension of the API, I unfortunately have nothing to contribute to your ideas, but it make sense to me and I would be happy to implement them. For a Kleene star operator, it is unfortunate that |
base/regex.jl
Outdated
# Examples | ||
```jldoctest | ||
julia> r"Hello " * "world" | ||
r"Hello world" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oups, forgot to update the docstrings, will do next time I push. Does it need more explanations? ("concatenate regexes" is quite minimal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex composition stuff is looking good now but I think the string embedding into regular expressions is still wrong: strings and characters need to be embedded as patterns which match only that literal string or character.
test/regex.jl
Outdated
end | ||
|
||
@test r"a"i * r"b"i == r"(?:a)(?:b)"i | ||
@test r"a"i * "b" == r"(?:a)(?:b)"i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really work like this? I would expect the string "b"
to only match itself?
base/regex.jl
Outdated
end | ||
|
||
unwrap_string(r::Regex, unshared::UInt32) = string("(?", regex_opts_str(r.compile_options & unshared), ':', r.pattern, ')') | ||
unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = string("(?:", s, ')') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this is not how we want this to work: I think that a string or character should be converted to a regex that only matches that exact string or character, not to a regex with that string or character as its content. Also, I think there's a case to be made for making this a method of convert. For a clearer example where it matters, should r"a|b" * "c|d"
produce r"(?:a|b)(?:c|d)"
which would match ac
, ad
, bc
and bd
or should it produce something like r"(?:a|b)(?:c\|d)"
which would match ac|d
and bc|d
? I think it has to be the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, actually while you were reviewing I pushed a new version unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = s
, but it has the same problem. I didn't think about this problem, but I don't see how to address generally. It's easy enough to quote |
, but I don't know if this would be enough to quote a list a special characters? ('{', '+', etc). Plus, it would be time consuming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, there seems to be a solution after all: enclose the string between "\Q" and "\E". Remains to decide which behavior is the best, I don't have a strong opinion. But it's true that if one wants the regex interpretation of characters, it's easy enough to wrap the string in a call to Regex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to submit that review as "request changes" but picked the wrong radio button.
In fully generality, this requires being able to correctly regex-quote strings, which is definitely doable, but not entirely straightforward, I suspect. It may be easier to just skip that part of this change and keep it focused on the pure regex bit. |
So allow only multiplication of regexes, not regexes + strings + chars? |
You can certainly include that too, but does it makes sense to you that it should treat strings as a regex that only matches that string? Of you’re cool with implementing that, I have no objection. |
I don't know if you saw my comment above, but there seems to be a trivial solution, using quoting between "\Q" and "\E". So I updated the PR with this.
It sure makes sense, but as I said I don't have a strong opinion and anyway it (my opinion) shouldn't matter as I didn't do much regexing in my life. Your call! |
01b8253
to
04340e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting there. I think if you just handle \E
in strings we should be good.
base/regex.jl
Outdated
@@ -559,7 +571,7 @@ end | |||
*(r::Regex) = r # avoids wrapping r in a useless subpattern | |||
|
|||
unwrap_string(r::Regex, unshared::UInt32) = string("(?", regex_opts_str(r.compile_options & unshared), ':', r.pattern, ')') | |||
unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = s # no need to wrap in subpattern | |||
unwrap_string(s::Union{AbstractString,AbstractChar}, ::UInt32) = string("\\Q", s, "\\E") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, you're going to want to kill me but what if s
contains \E
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahaha! actually, I should have known better with my background in cybersecurity (cf. injection attacks) ;-)
EDIT: I'm very curious whether you'll find a flaw in my solution!
This looks good to me now. We’re halfway to having a fully algebra on regular expressions! |
dba1b10
to
8c5af35
Compare
8c5af35
to
8382116
Compare
So the intersection of the failing tests between this CI run and the previous one is empty (previous failures: travis, tester_macos64, tester_linux32), so I will assume they are not related. I will merge soon if no objection. |
Go for it! Squash or not as appropriate. Edit: I decided to just go for it 😁 |
@@ -24,7 +24,7 @@ New library functions | |||
Standard library changes | |||
------------------------ | |||
|
|||
* Cmd interpolation (``` `$(x::Cmd) a b c` ``` where) now propagates `x`'s process flags (environment, flags, working directory, etc) if `x` is the first interpolant and errors otherwise ([#24353]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed in #31874.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, sorry about that!
I needed it recently and got frustrated that it doesn't work:
So yes there is a way to achieve what is needed with the
Regex
constructor, but allowing*
is more regex-noobie friendly.I don't know if it's a good idea, but don't foresee a problem with it; note that I didn't go so far as making
Regex <: AbstractString
, as I didn't expect to get support for that ;-)