-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve regex search functions signatures #24116
Conversation
base/regex.jl
Outdated
@@ -282,10 +278,10 @@ function matchall(re::Regex, str::String, overlap::Bool=false) | |||
matches | |||
end | |||
|
|||
matchall(re::Regex, str::SubString, overlap::Bool=false) = | |||
matchall(re, String(str), overlap) | |||
matchall(re::Regex, str::AbstractString, overlap::Bool=false) = matchall(re, String(str), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps linebreak after =
would look better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
additionally fixed a bug in |
I have also improved handling of bad |
base/regex.jl
Outdated
@@ -214,16 +215,12 @@ function match(re::Regex, str::Union{SubString{String}, String}, idx::Integer, a | |||
n = div(length(ovec),2) - 1 | |||
mat = SubString(str, ovec[1]+1, prevind(str, ovec[2]+1)) | |||
cap = Union{Void,SubString{String}}[ovec[2i+1] == PCRE.UNSET ? nothing : | |||
SubString(str, ovec[2i+1]+1, | |||
prevind(str, ovec[2i+2]+1)) for i=1:n] | |||
SubString(str, ovec[2i+1]+1, prevind(str, ovec[2i+2]+1)) for i=1:n] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the alignment used before my PR. What alignment should be correct. I have decreased in my local file the alignment to one tab so S
in SubString
is exactly below =
in the above line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the alignment from your PR was OK, at least it made it clear that these where inside the comprehension. (Anyway, if you didn't align everything inside the comprehension, I think it should use only four spaces, not eight; we never use tabs, so one character = one space, that's a simple rule.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - reverting to the original alignment from the PR (by tab I meant 4 spaces)
base/regex.jl
Outdated
@@ -269,7 +266,7 @@ function matchall(re::Regex, str::String, overlap::Bool=false) | |||
end | |||
end | |||
|
|||
push!(matches, SubString(str, ovec[1]+1, ovec[2])) | |||
push!(matches, SubString(str, ovec[1]+1, prevind(str, ovec[2]+1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to explain what this does? It's quite hard to guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to correct for the fact that ovec is 0-based and that ovec[2] is pointing at the last byte of the match. I am adding an appropriate comment to the source code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've added the comment above, but not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is in three places in this file. Each follows exactly the same rule. I thought to add it to the first occurrence from top of the file.
Do you think it is worth to add it everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, tough call, but given how subtle this is, I'd rather be too verbose than not enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - adding additional comment as it will not hurt
base/regex.jl
Outdated
throw(BoundsError()) | ||
end | ||
function search(str::Union{String,SubString{String}}, re::Regex, idx::Integer) | ||
isvalid(str, idx) || throw(ArgumentError("invalid index $i")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"invalid character index" would be more explicit. Also throw a UnicodeError
for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will separate this into BoundsError
and UnicodeError
. It will introduce a slight overhead though as isvalid
does not distinguish them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's annoying. Make it throw either BoundsError
or UnicodeError
, and let's say we'll improve that later by replacing isvalid
everywhere it's appropriate with a function which would throw the errors directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted it to a corrected version of the original test to minimize the number of changes in this PR (as this change is breaking). I propose to implement more strict testing when your search Julep is decided.
base/regex.jl
Outdated
)) | ||
search(s::AbstractString, r::Regex) = search(s,r,start(s)) | ||
search(str::Union{String,SubString{String}}, re::Regex) = | ||
str == "" ? (0:-1) : search(str, re, start(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this, as it means passing the index 1
is not equivalent to omitting the index. Why not add this check the the method which takes an index? Also, the current behavior had its logic, as noted at #24103.
Better move this to a separate PR and only keep uncontroversial changes in this one. It already contains many different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this change from this PR and add it to #24103.
I will discuss my rationale in #24103 (I have not read though it yet - I am working though this PR to fix #24157 first).
But in short passing 1
is exactly not equivalent to omitting the index. The corner case is an empty string "". You can search for something in an empty string (and it should not throw an error if you do not specify an index), but for this string index 1
is invalid so specifying it should be an error.
test/regex.jl
Outdated
# regex match / search string must be a String | ||
@test_throws ArgumentError match(r"test", GenericString("this is a test")) | ||
@test_throws ArgumentError search(GenericString("this is a test"), r"test") | ||
# regex match / search string must be a String, changed in #24116 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove if that no longer applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have accidentally started a review of my own code 👶.
@nalimilan I am trying to locate the reason for build error (one could be allowing to pass signed integers to
I think it would be good to have it consistent. So the question is:
|
Good catch. It would be good to make everything consistent indeed, even if that's breaking, since that's the last chance to fix these before 1.0. The fact that they are not documented makes it less of an issue. Regarding BTW, note that some of these functions might be merged with generic |
I have an additional question relating to type of For instance in I ask because it is not unthinkable (or even I would say in a few years it might be normal in some types applications) to have a string larger than If |
In #define PCRE2_SIZE size_t So it looks like you find a bug and it should be |
3dc6013
to
c24f155
Compare
@nalimilan Given your search Julep I propose to leave this PR with:
What I did not do:
|
base/regex.jl
Outdated
@@ -154,22 +155,25 @@ r"a.a" | |||
julia> ismatch(rx, "aba") | |||
true | |||
|
|||
julia> ismatch(rx, "aba", 2) | |||
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
false
?
base/regex.jl
Outdated
compile(r) | ||
return PCRE.exec(r.regex, String(s), offset, r.match_options, | ||
return PCRE.exec(r.regex, String(s), Csize_t(idx-1), r.match_options, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Csize_t
is a C implementation detail, better keep it inside PCRE.exec
as much as possible. Can't you adapt that function directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, is any change really needed? PCRE.exec
uses ccall
, which will take care of converting offset
to Csize_t
. There's no need to make the conversion manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I thought, mistakenly understanding the documentation of ccall
, that the conversion is unsafe, but it is actually safe. Removing.
test/regex.jl
Outdated
@@ -34,9 +37,10 @@ show(buf, r"") | |||
# see #10994, #11447: PCRE2 allows NUL chars in the pattern | |||
@test ismatch(Regex("^a\0b\$"), "a\0b") | |||
|
|||
# regex match / search string must be a String | |||
@test_throws ArgumentError match(r"test", GenericString("this is a test")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are now supposed to work, better keep them and check that they return the correct result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
fixed |
Can we merge this? It fixes a bug in |
Any objections to merge this? |
Even though the offset argument was undocumented, I don't think we should just change its meaning without a deprecation like this – it will silently break people's code. I'm also not even 100% on exposing the old So the alternative would be to make the methods with the offset/index arguments internal (e.g. |
Makes sense, anyway |
Do you want to update this now that |
8acc9b5
to
c3e002b
Compare
Rebased and readjusted (we had some major changes in the meantime in strings so I hope I have not mixed up anything 😄). I when CI passes a review is required. Going back to the old changes I have one thing I am not 100% sure of (it was earlier used, but I want to make sure that this is correct - I have checked that it works OK, but I do not understand why):
passes |
c98b15f
to
822f659
Compare
Thanks. Though reading @StefanKarpinski's comment above, looks like we should not document the Regarding the pointer question, I guess |
test/regex.jl
Outdated
@@ -21,6 +21,9 @@ for f in [matchall, collect_eachmatch] | |||
@test f(r"GCG","GCGCG",overlap=true) == ["GCG","GCG"] | |||
end | |||
|
|||
# Issue #24157 | |||
+@test matchall(r"fé", "café") == ["fé"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
As for the |
CI passes so this should be good for a review. |
base/deprecated.jl
Outdated
@@ -1240,6 +1240,14 @@ end | |||
|
|||
@deprecate ismatch(r::Regex, s::AbstractString) contains(s, r) | |||
|
|||
function ismatch(r::Regex, s::AbstractString, offset::Integer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't @deprecate ismatch(r::Regex, s::AbstractString, offset::Integer) contains(SubString(s, offset+1), r)
be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would :). Fixing
This is so outdated that I close it. |
This PR proposes the following changes:
ismatch
should not accept anySubString
but onlySubString{String}
asPCRE.exec
assumes UTF-8 encoded string;AbstractString
and converted it toString
if needed; some of them threw an error for non-UTF-8 string; as all their docstrings indicated they should acceptAbstractString
I have made all functions accept anyAbstractString
and convert it toString
if needed.All the changes should be non-breaking.