Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve regex search functions signatures #24116

Closed
wants to merge 3 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 13, 2017

This PR proposes the following changes:

  • Line 170: ismatch should not accept any SubString but only SubString{String} as PCRE.exec assumes UTF-8 encoded string;
  • there was a mix of two approaches in search functions: some of them accepted any AbstractString and converted it to String if needed; some of them threw an error for non-UTF-8 string; as all their docstrings indicated they should accept AbstractString I have made all functions accept any AbstractString and convert it to String if needed.

All the changes should be non-breaking.

base/regex.jl Outdated
@@ -282,10 +278,10 @@ function matchall(re::Regex, str::String, overlap::Bool=false)
matches
end

matchall(re::Regex, str::SubString, overlap::Bool=false) =
matchall(re, String(str), overlap)
matchall(re::Regex, str::AbstractString, overlap::Bool=false) = matchall(re, String(str),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps linebreak after = would look better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@fredrikekre fredrikekre added the strings "Strings!" label Oct 13, 2017
@bkamins
Copy link
Member Author

bkamins commented Oct 13, 2017

additionally fixed a bug in SubString indexing in match and matchall via ovec for unicode characters (caught by Documenter build).

@bkamins
Copy link
Member Author

bkamins commented Oct 13, 2017

I have also improved handling of bad idx along with #24103.

base/regex.jl Outdated
@@ -214,16 +215,12 @@ function match(re::Regex, str::Union{SubString{String}, String}, idx::Integer, a
n = div(length(ovec),2) - 1
mat = SubString(str, ovec[1]+1, prevind(str, ovec[2]+1))
cap = Union{Void,SubString{String}}[ovec[2i+1] == PCRE.UNSET ? nothing :
SubString(str, ovec[2i+1]+1,
prevind(str, ovec[2i+2]+1)) for i=1:n]
SubString(str, ovec[2i+1]+1, prevind(str, ovec[2i+2]+1)) for i=1:n]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect alignment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the alignment used before my PR. What alignment should be correct. I have decreased in my local file the alignment to one tab so S in SubString is exactly below = in the above line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the alignment from your PR was OK, at least it made it clear that these where inside the comprehension. (Anyway, if you didn't align everything inside the comprehension, I think it should use only four spaces, not eight; we never use tabs, so one character = one space, that's a simple rule.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - reverting to the original alignment from the PR (by tab I meant 4 spaces)

base/regex.jl Outdated
@@ -269,7 +266,7 @@ function matchall(re::Regex, str::String, overlap::Bool=false)
end
end

push!(matches, SubString(str, ovec[1]+1, ovec[2]))
push!(matches, SubString(str, ovec[1]+1, prevind(str, ovec[2]+1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain what this does? It's quite hard to guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to correct for the fact that ovec is 0-based and that ovec[2] is pointing at the last byte of the match. I am adding an appropriate comment to the source code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've added the comment above, but not here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is in three places in this file. Each follows exactly the same rule. I thought to add it to the first occurrence from top of the file.
Do you think it is worth to add it everywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, tough call, but given how subtle this is, I'd rather be too verbose than not enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - adding additional comment as it will not hurt

base/regex.jl Outdated
throw(BoundsError())
end
function search(str::Union{String,SubString{String}}, re::Regex, idx::Integer)
isvalid(str, idx) || throw(ArgumentError("invalid index $i"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"invalid character index" would be more explicit. Also throw a UnicodeError for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will separate this into BoundsError and UnicodeError. It will introduce a slight overhead though as isvalid does not distinguish them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's annoying. Make it throw either BoundsError or UnicodeError, and let's say we'll improve that later by replacing isvalid everywhere it's appropriate with a function which would throw the errors directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted it to a corrected version of the original test to minimize the number of changes in this PR (as this change is breaking). I propose to implement more strict testing when your search Julep is decided.

base/regex.jl Outdated
))
search(s::AbstractString, r::Regex) = search(s,r,start(s))
search(str::Union{String,SubString{String}}, re::Regex) =
str == "" ? (0:-1) : search(str, re, start(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this, as it means passing the index 1 is not equivalent to omitting the index. Why not add this check the the method which takes an index? Also, the current behavior had its logic, as noted at #24103.

Better move this to a separate PR and only keep uncontroversial changes in this one. It already contains many different things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this change from this PR and add it to #24103.
I will discuss my rationale in #24103 (I have not read though it yet - I am working though this PR to fix #24157 first).
But in short passing 1 is exactly not equivalent to omitting the index. The corner case is an empty string "". You can search for something in an empty string (and it should not throw an error if you do not specify an index), but for this string index 1 is invalid so specifying it should be an error.

test/regex.jl Outdated
# regex match / search string must be a String
@test_throws ArgumentError match(r"test", GenericString("this is a test"))
@test_throws ArgumentError search(GenericString("this is a test"), r"test")
# regex match / search string must be a String, changed in #24116
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove if that no longer applies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Member Author

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have accidentally started a review of my own code 👶.

@bkamins
Copy link
Member Author

bkamins commented Oct 16, 2017

@nalimilan I am trying to locate the reason for build error (one could be allowing to pass signed integers to PCRE.exec as offset but I have to dig into it), but I want to discuss and additional thing I have spotted now.

ismatch takes offset as its third argument, on the other hand (which is not documented now) which is 0 based.
match takes idx as its third argument which is 1 based. Additionally it takes add_opts argument that is not accepted either by ismatch not matchall (and not all signatures of match allow it).

matchall has no idx nor add_opts argument.

I think it would be good to have it consistent. So the question is:

  • should we change offset in ismatch to idx and make it 1-based (breaking);
  • should idx argument be added to matchall (non breaking);
  • shuold add_opts options be added to ismatch, matchall and all signatures of match? (non breaking); additionally I would make add_opts a keyword argument as someone might want to specify add_opts without specifying idx (breaking for some signatures of match).

@nalimilan
Copy link
Member

Good catch. It would be good to make everything consistent indeed, even if that's breaking, since that's the last chance to fix these before 1.0. The fact that they are not documented makes it less of an issue.

Regarding add_opts, there's no hurry to add it to functions which do not support it since adding it won't break anything. I'm not sure about making it a keyword argument. Currently that can hurt performance, and it's easy to pass the default index using start(s) anyway.

BTW, note that some of these functions might be merged with generic find functions (see the Search & Find Julep). ismatch could be findfirst(regexp, string) > 0, eachmatch could be findeach(regexp, string), etc. Though match is a bit special so maybe it will stay as is. So overall maybe better not make improvements to them until this is sorted out, except of course breaking changes which need to happen before 1.0.

@bkamins
Copy link
Member Author

bkamins commented Oct 16, 2017

I have an additional question relating to type of offset used in PCRE.exec (I do not know whom it is best to ask this).

For instance in regex.jl, e.g. in matchall offset is defined as UInt32. The question is: is this a limitation of PCRE distribution that Julia uses or it might be UInt64 or UInt32 depending on the build of PCRE.

I ask because it is not unthinkable (or even I would say in a few years it might be normal in some types applications) to have a string larger than typemax(UInt32) in which case the for example the line offset = UInt32(ovec[1]+1) will throw InexactError, especially as we have a definition that ovec::Vector{Csize_t} and Csize_t does not have to be UInt32 (and on 64 builds it will not be).

If UInt32 is a maximum limit then I would document it. If it is not then the question is how to check what is PCRE2_SIZE used in PCRE build (is it guaranteed to be equal to Csize_t?).

@nalimilan
Copy link
Member

In deps/build/pcre2-10.21/src/pcre2.h, there's this:

#define PCRE2_SIZE            size_t

So it looks like you find a bug and it should be Csize_t, a.k.a. UInt instead. Time for another PR? :-)

@bkamins bkamins force-pushed the patch-16 branch 2 times, most recently from 3dc6013 to c24f155 Compare October 16, 2017 11:27
@bkamins
Copy link
Member Author

bkamins commented Oct 16, 2017

@nalimilan Given your search Julep I propose to leave this PR with:

  • change offset to idx in ismatch and document it; it is breaking in a nasty way, as it will cause silent errors, but I believe we should unify it - I have checked Julia base and I did not find ismatch used with offset argument;
  • make all PCRE.exec calls with Csize_t conversion of the offset (I decided to put it into this PR as it is related to function signatures, where we allow any Integer; I thought it is a better approach than restricting the signatures)
  • make match and matchall accept any AbstractString
  • correct indexing into ovec in matchall

What I did not do:

  • improve bounds checking of idx as it is breaking and probably needs a separate PR (when a discussion of String search inconsistency #24103 and seach Julep are decided)
  • Correct use of Csize_t everywhere (not only in PCRE.exec) calls: in particular the definitions of RegexMatch and Regex are inconsistent (Csize_t vs Int), but this is a separate issue and maybe it is not critical (when limits are reached an error should be thrown, in PCRE.exec we have to fix it as it is a low level C call).

base/regex.jl Outdated
@@ -154,22 +155,25 @@ r"a.a"
julia> ismatch(rx, "aba")
true

julia> ismatch(rx, "aba", 2)
true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false?

base/regex.jl Outdated
compile(r)
return PCRE.exec(r.regex, String(s), offset, r.match_options,
return PCRE.exec(r.regex, String(s), Csize_t(idx-1), r.match_options,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Csize_t is a C implementation detail, better keep it inside PCRE.exec as much as possible. Can't you adapt that function directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is any change really needed? PCRE.exec uses ccall, which will take care of converting offset to Csize_t. There's no need to make the conversion manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I thought, mistakenly understanding the documentation of ccall, that the conversion is unsafe, but it is actually safe. Removing.

test/regex.jl Outdated
@@ -34,9 +37,10 @@ show(buf, r"")
# see #10994, #11447: PCRE2 allows NUL chars in the pattern
@test ismatch(Regex("^a\0b\$"), "a\0b")

# regex match / search string must be a String
@test_throws ArgumentError match(r"test", GenericString("this is a test"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are now supposed to work, better keep them and check that they return the correct result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

nalimilan
nalimilan previously approved these changes Oct 16, 2017
@nalimilan nalimilan dismissed their stale review October 16, 2017 16:20

Woops, CI still fails.

@bkamins
Copy link
Member Author

bkamins commented Oct 17, 2017

fixed

@nalimilan
Copy link
Member

Can we merge this? It fixes a bug in matchall which makes CI fail for CategoricalArrays.

@bkamins
Copy link
Member Author

bkamins commented Nov 21, 2017

Any objections to merge this?

@StefanKarpinski
Copy link
Member

Even though the offset argument was undocumented, I don't think we should just change its meaning without a deprecation like this – it will silently break people's code. I'm also not even 100% on exposing the old offset or the new idx arguments at all since once can accomplish the same effect with a SubString without complicating the API (and very soon this will be zero-cost).

So the alternative would be to make the methods with the offset/index arguments internal (e.g. _match) and deprecate the undocumented offset methods to use SubString – in case anyone is currently using them. Does that seem reasonable?

@nalimilan
Copy link
Member

Makes sense, anyway ismatch should be combined with contains (#19250). At this stage better simplify APIs, we can always add arguments later, but we won't be able to remove them.

@nalimilan
Copy link
Member

Do you want to update this now that ismatch has been merged with contain? We really have to fix #24157 before 0.7.

@bkamins bkamins force-pushed the patch-16 branch 2 times, most recently from 8acc9b5 to c3e002b Compare February 11, 2018 15:58
@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2018

Rebased and readjusted (we had some major changes in the meantime in strings so I hope I have not mixed up anything 😄). I when CI passes a review is required.

Going back to the old changes I have one thing I am not 100% sure of (it was earlier used, but I want to make sure that this is correct - I have checked that it works OK, but I do not understand why):

function contains(s::SubString{String}, r::Regex, idx::Integer=firstindex(s))
    compile(r)
    return PCRE.exec(r.regex, s, idx-1, r.match_options,
                     r.match_data)
end

passes SubString{String} to pcre2_match_8 from PCRE_LIB as Ptr{UInt8}. How does Julia resolve the correct pointer to the start of SubString{String} in this call?

@bkamins bkamins force-pushed the patch-16 branch 3 times, most recently from c98b15f to 822f659 Compare February 11, 2018 16:13
@nalimilan
Copy link
Member

Thanks. Though reading @StefanKarpinski's comment above, looks like we should not document the idx argument, but instead deprecate it in favor of using SubString.

Regarding the pointer question, I guess PCRE.exec takes care of the conversion, or maybe ccall does that thanks to a cconvert method?

test/regex.jl Outdated
@@ -21,6 +21,9 @@ for f in [matchall, collect_eachmatch]
@test f(r"GCG","GCGCG",overlap=true) == ["GCG","GCG"]
end

# Issue #24157
+@test matchall(r"fé", "café") == ["fé"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2018

idx - I have forgotten about that. I will deprecate offset. Actually it even more sense as offset/idx could be an invalid index and lead to weird results. Should I put the deprecation in deprecated.jl?
offset is not used in Base anywhere so this should not be a problem. I will change it when I know if CI passes

As for the SubString{String} issue - the reason why I have asked as cconvert is no-op for SubString{String}. Probably ccall does the job, but I do not understand how.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2018

CI passes so this should be good for a review.

@@ -1240,6 +1240,14 @@ end

@deprecate ismatch(r::Regex, s::AbstractString) contains(s, r)

function ismatch(r::Regex, s::AbstractString, offset::Integer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't @deprecate ismatch(r::Regex, s::AbstractString, offset::Integer) contains(SubString(s, offset+1), r) be enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would :). Fixing

@bkamins
Copy link
Member Author

bkamins commented Oct 24, 2020

This is so outdated that I close it.

@bkamins bkamins closed this Oct 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants